I imagine everyone with enough years in the industry has a story like this, but this one is mine.
Call me Ishmael. Seven years ago I was a software engineer on a team building the Rackspace Cloud Control Panel (a now defunct product). We had been part of Rackspace’s OpenStack Public Cloud launch, which itself was a break with the old and buggy Rackspace Cloud powered by Slicehost and a slow and buggy Control Panel with a Java tech stack. We were using the new hotness - modern JavaScript and Python on the backend. We had hit our launch dates (even meeting some bonus targets) and were deploying to production several times a day without waiting for weeks-long QA cycles. The main difference (in our mind) is that we were able to utilize production APIs for development and testing, rather than relying on the often down Rackspace staging environment. In short, we were riding high.
We had a lurking performance issue in our product that we weren’t (yet) aware of, but when you loaded a page with a list of servers, your browser would give a small “lock”. This was caused by the product behavior of middle truncation: for customers who had a number of similarly named servers such as “prodwebserver1”, “prodwebserver2”, “prodwebserver3”, and so forth, you might display “prodweb…ver1”, “prodweb…ver2”, and “prodweb…ver3” rather than “prodweb…” three times. This was behavior carried over from another acquired startup and nobody on the existing team really seemed to think it was that important, but basically everything got passed through the “middle truncation” code. Unlike CSS ellipsis truncation which is done by the browser, middle truncation needed to be done by client-side JavaScript that we had written.
Building a feature in the DNS control panel, we ended up with a ticket from QA: sometimes the page would hang when saving a TXT record. Digging into the issue, I discovered the performance bug in the middle truncation algorithm that ran after the DNS TXT record was saved: it was recursive. Middle truncating a long string would make a number of recursive calls, essentially linear in the size of the passed-in string. Since this was executed in client-side JavaScript it would end up being super inefficient and causing the “page hangs” that we saw.
This was pretty silly since the algorithm didn’t need to be recursive at all - you could just look over the size of the string, peel off the edges, and return “…” for the stuff in the middle. I made the modifications, wrote the unit tests, and then for the final test, verified that the performance bug was fixed by testing the max size of a TXT record, which at the time was something like 8k, which happened to match up with the size of the first two chapters of that wonderful classic of 19th century literature, Moby Dick. It’s a book I love - primarily for its beautiful prose and its description of a now-gone world.
Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears Hook to Coenties Slip, and from thence, by Whitehall, northward. What do you see?—Posted like silent sentinels all around the town, stand thousands upon thousands of mortal men fixed in ocean reveries. Some leaning against the spiles; some seated upon the pier-heads; some looking over the bulwarks of ships from China; some high aloft in the rigging, as if striving to get a still better seaward peep. But these are all landsmen; of week days pent up in lath and plaster— tied to counters, nailed to benches, clinched to desks. How then is this? Are the green fields gone? What do they here?
Great stuff. Anyways, I press “Save”, everything works. Ticket done.
About 45 minutes later, my team gets an email from someone who I’ll refer to as “the DNS Guy”, indicating that all Rackspace DNS updates had stopped being processed because “someone pasted a book” into the system. You see, while I was testing against my development environment, this hit the production DNS API servers, which used the production Rackspace actual DNS (you know, like BIND servers), and something in the size of the record caused the (I’m sure ancient and rarely updated) DNS servers to choke to a crawl. Unlike our little control panel and API, the DNS Guy’s contention is that people actually cared about Rackspace DNS. Making this scenario better was how insanely dramatic the email was, as if I had intentionally tried to bring the system down, rather than the more mundane reality that I was clicking a bunch of buttons on top of systems I barely understood. Any customer could have done the same thing, I just happened to do it before they did.
The DNS Guy essentially wanted all testing against production to stop, and for all teams (including the Rackspace DNS API team that we were building a frontend for) to be pointed towards their staging environment, something that nobody had ever heard of prior to this incident. It wasn’t enough to just add a limit on the size of the TXT record. Directors were CCed. Meetings were had. The issue was no longer the thing that had happened, but instead trying to “scare straight” the entire Rackspace Cloud division. We pushed back - staging environments in the company were notoriously bad and hearing that we needed to develop “safely” seemed the same as us going back to the old world where we’d sit around waiting for a working environment to develop against. My manager realized that the CEO was the only common reporting structure between us and the DNS Guy. Eventually things fizzled out without any resolution or change in behavior.
I’ve always found the story amazing on a number of levels. First, the complete idiocy that I operated on in the course of doing some pretty routine product development. Second, while the DNS Guy was probably right about actually sandboxing the insane things we were doing in order to provide a more safe system to our customers, our testing strategy had been working great for us (and continued to for years to come), so we weren’t going to listen to him, as right as he may have been. Third, all these sort of unreconcilable issues just sort of melted away because nobody in a position of authority actually cared enough to act. And of course fourth, the fact that for a brief point one afternoon, the beautiful prose of Moby Dick took down Rackspace DNS.