I consider myself a practical programmer. I work almost entirely in higher-level languages that don’t concern themselves with the deep details of a computer’s deep operations and core calculations. Whenever possible, I also use frameworks and libraries developed and maintained by other, smarter people to further abstract away the computational busywork between myself and my goal.
Certainly, I almost never work directly with binary numbers, the informational stuff that lies at the subatomic level of all software. My favorite general-purpose programming language, Perl, has — like most any language like it — a passel of binary-smart functions and “bitwise” operators with which a coder can manipulate numbers and other information at the binary level. I have never made use of that stuff, though. It simply doesn’t operate at the level I dwell in.
So you can understand why I felt pretty smart, late one night earlier this month, when I realized that a bizarre problem appearing on a client’s system had everything to do with binary numbers, and specifically the differences between 32-bit and 64-bit computing architectures, something I’ve seldom had reason to think about before.
I do, at least, have a vague understanding of how computers use binary numbers. I can express more or less accurately the literally difference between, say, an 8-bit and a 16-bit machine: that the binary-number “boxes” in the latter computer, as seen in its RAM and various other computery places, have twice the size of the former. An 8-bit machine can store, as atomic units, numbers as high as 255, or
11111111 in binary. A roomier 16-bit machine works with numbers as high as
1111111111111111, or 65,535. And so on, with newer 32- and then 64-bit systems offering the ability to store and transmit ever-larger numbers within itself.
I know just from observation — many years working across a variety of Unix-based systems belonging both to me and my customers — that contemporary computer architectures tend to be 32- or 64-bit, with the latter apparently becoming ever more commonly encountered in the context I have the most familiarity with: virtual private servers, physically located who-knows-where and made of hardware I couldn’t care less about. They run Linux, and that’s all I need.
Years ago, when I first adopted this style of work, I didn’t care about bits at all, as every VPS provider I used (such as Linode, my long-time favorite) sold access only to 32-bit machines. At some point, 64-bit servers started becoming available, and then I started having to pay attention to whether software I wanted to install expected a 32- or 64-bit computing environment. I never stopped to think of why it might care; it seemed a perfectly reasonable thing for a software package to feel picky about, so I just went along with it. And this is as far as my practical knowledge of architectures’ bit-width differences ran: effectively, just two different colors for the same thing.
Time passed, and earlier this year I became the maintainer for an ancient, sprawling codebase of deep importance to a client. I identify its contents as legacy code in both the casual sense of its adherence to engineering practices of antiquity (in this case, of the mid-to-late 1990s), and the more precise definition offered by Michael Feathers in his book Working Effectively with Legacy Code: code with negligible test coverage. It works great, and has worked great for many years, and the thought of changing any part of it even a little fills one with terror.
Before it passed into my hands, this system’s masters had already decided to migrate it to a new machine, away from the one it had run on for as long as anyone could remember. This, then, became my first major project with it. With carte blanche to set up the new machine as I saw fit, I improved as much as I could regarding the software’s environment. I set its processes to run as an ordinary user and not as the all-powerful (and asking-for-trouble) root user, for example, and made use of proper, Git-based version control to help separate out its hardcoded configuration settings from its program logic.
But, I couldn’t magically write the test coverage the system deserved, at least not in the time and budget allotted for this migration. All we could do was get the system apparently working on the new machine, invite the client to kick the tires for a bit, then have it go live at the appointed hour. We threw the switch with a song in our hearts, and waited hopefully to see how its first few hours of public service in its new home.
For the most part, everything worked quite well. We had some irritating issues involving the webserver not understanding the new database’s character encoding scheme, or applying the wrong SSL certificate, but these end up simple matters of configuration-tuning.
A far more bizarre problem showed up on a collection of web pages that displayed ranges of IP addresses associated with my client’s own customer accounts. These display address data in a neat table, with every row displaying a high and low address from each contiguous block of a given customer’s IPs. Now, though, it collapsed every such table down to a single row, with the higher address involving an absurdly large number, impossible for any IP address. It looked something like this: 18.104.22.168 to 1099511627622.214.171.124. Oh dear.
After I confirmed that the old data had indeed transferred cleanly — for the database has migrated to a new server as well — I located the routines in the codebase that fetched the IP addresses from storage and displayed them. What I found, of course, was a screens-long logic-monolith that I could more or less suss out if I crossed my eyes. I’d never seen raw IP math before. My predecessor clearly had a strong grasp of it, making extensive use of those seldom-seem bitwise operators, confidently converting IPs to decimal numbers and back via bit-shifting (a practice of which I was previously unaware), applying “netmask” numbers to turn single IPs into ranges, and so on. I felt a little intimidated; low-level network-hacking, even just the manipulation of symbols that point to network nodes, I have always treated as sorcery beyond my ken. But I knew the problem lurked in there somewhere.
God help me, I walked into my personal library and pulled out my camel book. It had been a long time. As with any modern hacker, I use the internet as my primary language reference; for Perl-based work, that means the module documentation at MetaCPAN first and foremost, and after that perldoc.perl.org for those rarer times when I need to look up something about the actual core language, versus one of the third-party abstractions I so love applying. Something about the code at hand compelled me to seek paper documentation, though. Perhaps its circa-1995 style, written with an apparent pre-CPAN mindset, compelled me to pair it with an equally antique “search engine” as a back-of-the-book index. More to the point, with a starting point of obscure, language-defined operators rather than some high-level API’s methods, it felt right and comfortable to turn to a reference I could physically flip through, seeing the textual content around whatever I’d looked up, rather than isolated definitions.
Happily, I found the key clue very quickly. A discussion of the bitwise or, and and not operators — for which Perl uses the |, & and ~ characters — noted that one should be careful about possible vagaries of machine architecture when manipulating numbers at the bit level. Until I read that, I had paid no mind at all to the fact that my client’s software had moved from an older 32-bit machine to a newer 64-bit one. For the first time in my career, a change in this number had an observable effect on the operation of a program whose reins I held.
It all came down to the code’s use of bitwise-not, an operation that transforms one number into another one by literally flipping all its constituent bits: ones become zeros, and vice-versa. (Apparently, this operation proves useful when pairing IP addresses with netmask numbers.) Looking again at the number 255, on an 8-bit system its bitwise-not becomes
00000000: all eight of its ones flipped into zeros, resulting in the decimal number 0. But if you bitwise-not 255 on a 16-bit system, you instead get
1111111100000000, or 65,280 in decimal notation. This because 255 stored in a 16-bit computer begins with eight leading zeros, and those zeros all turn into ones upon the application of a bitwise-not. And thus can the same simple mathematical operation performed upon the same number, but on two different computers, output two very different results.
More to the point, this is how the operation on a “more-bits” computer can result in numbers vastly larger than those seen on a “fewer-bits” one, which quite neatly matched the problem reported to me.
Kid stuff, to any computer scientist! But, again: I almost never have to think of binary numbers when programming. (Also, my college degree is in English.) Imagine a plumber suddenly realizing that the pipe in front of her had become clogged not due to the usual greasy hairballs, but instead via a novel reaction between the molecular structure of water and the chemical compounds found in the pipe’s interior coating. This was not an every-day customer issue.
Since inheriting it at the start of this year, I have held a policy of modifying the guts of this legacy codebase as little as I can. For this case I felt justified in risking an exception. A quick search of MetaCPAN revealed two modules that did precisely what this software wanted, but in a safe, portable, and easy-to-read way: Net::Netmask by David Muir Sharnoff, which performs all sorts of operations on IP addresses, and Net::IPAddress::Minimal by Tamir Lousky and Sawyer X, providing that IP-to-decimal-and-back conversion. I had never worked with either module before, but found both to have accurate and concise documentation, and I easily found the functions within them I needed. Thus did I scrub many lines out of the codebase, all those quite clever and perfectly well-intentioned bitwise operations, replacing them with a bare handful of calls to these two modules.
After that, following a dreadful hunch, I used the
ack program to search for other invocations of bitwise-not in the codebase, and found one more: a line in a subroutine scheduled to run — for the first time, on this new machine — in about four hours. I may have begun subvocally humming the Mission: Impossible theme as I hurriedly set up some tests on both the old and the new machine, making sure that all calls to the changed functions would execute with the same results on both, once again replacing the headspinningly handwritten bit-twiddling with calls to those two CPAN modules.
This sort of time-pressure doesn’t at all resemble how I usually work, and I hope I don’t need to do it again for some time. But boy, did I ever feet wizardly once I had tied everything off successfully, and even more so the next day when my client confirmed that everything once again seemed to work in the way they expected. And that was that.
An unexpected postscript: Between the time of this incident and my posting this account of it, I described my little adventure to a colleague. Before I mentioned the specifics of how I changed the code, they interrupted me to ask: “Did you use Net::Netmask”?
This came to mind for them, it turns out, because they had coincidentally made some improvements to this module earlier this year as part of the CPAN Pull Request Challenge, a community phenomenon which I have written about before. My friend had no way of knowing that their work would help me pull my client’s code out of the fire during an eleventh-hour struggle in a few months’ time; like everyone else who participates in the challenge, they just pitched in because it felt like the right thing to do, a fun and self-rewarding way to give back to technological community one cares about.
So, that all turned out pretty well for everyone.