Outages, you say? Of course I have stories about outages, and limits, and some limits causing outages, and other things just screwing life up. Here are some random thoughts which sprang to mind upon reading this morning’s popcorn-fest.

I was brand new at a company that “everybody knew” had AMAZING infrastructure. They could do things with Linux boxes that nobody else could. As part of the new employee process, I had to get accounts in a bunch of systems, and one of them was this database used to track the states of machines. It was where you could look to see if it was (supposed to be) serving, or under repair, or whatever. You could also see (to some degree) what services were supposed to be running on it, and what servers (that is, actual programs), the port numbers, and whether all of that stuff was synced to the files on the box or not.

My request didn’t go through for a while, and I found out that it had something to do with my employee ID being a bit over 32767. And yeah, for those of you who didn’t just facepalm at seeing that number, that’s one of those “magic numbers” which pops up a bunch when talking about limits. That one is what you get when you try to store numbers as 16 bit values… with a sign to allow negative values. Why you’d want a negative employee number is anyone’s guess, but that’s how they configured it.

I assume they fixed the database schema at some point to allow more than ~15 bits of employee numbers, but they did an interesting workaround to get me going before then. They just shaved off the last digit and gave me that ID in their system instead. I ended up as 34xx instead of 34xxx, more or less.

This was probably my first hint that their “amazing infra” was in fact the same kind of random crazytown as everywhere else once you got to see behind the curtain.

Then there was the time that someone decided that a log storage system that had something like a quarter of a million machines (and growing fast) feeding it needed a static configuration. The situation unfolded like this:

(person 1) Hey, why is this thing crashing so much?

(person 2) Oh yeah, it’s dumping cores constantly! Wow!

(person 1) It’s running but there’s nothing in the log?

(person 2) Huh, “runtime error … bad id mapping?”

(person 2) It’s been doing this for a month… and wait, other machines are doing it, too!

(person 1) Guess I’ll dig into this.

(person 2) “range name webserv_log.building1.phase3 range [1-20000]”

(person 2) But this machine is named webserv20680…

(person 2) Yeah, that’s enough for me. Bye!

The machines were named with a ratcheting counter: any time they were assigned to be a web server, they got names like “webserv1”, “webserv2”, … and so on up the line. That had been the case all along.

Whoever designed this log system years later decided to put a hard-coded limiter into it. I don’t know if they did it because they wanted to feel useful every time it broke so they could race in and fix it, or if they didn’t care, or if they truly had no idea that numbers could in fact grow beyond 20000.

Incidentally, that particular “building1.phase3” location didn’t even have 20000 machines at that specific moment. It had maybe 15000 of them, but as things went away and came back, the ever-incrementing counter just went up and up and up. So, there _had been_ north of 20K machines in that spot overall, and that wasn’t even close to a surprising number.

There was a single line that would catch obvious badness at a particular gig where we had far too many Apache web servers running on various crusty Linux distributions:

locate access_log | xargs ls -la | grep 2147

It was what I’d send in chat to someone who said “hey, the customer’s web server won’t stay up”. The odds were very good that they had a log file that had grown to 2.1 GB, and had hit a hard limit which was present in that particular system. Apache would try to write to it, that write would fail, and the whole process would abort.

“2147”, of course, is the first 4 digits of the expected file size: 2147483647 … or (2^31)-1.

Yep, that’s another one of those “not enough bits” problems like the earlier story, but this one is 32 bits with one of them being for the sign, not 16 like before. It’s the same problem, though: the counter maxes out and you’re done.

These days, files can get quite a bit bigger… but you should still rotate your damn log files once in a while. You should probably also figure out what’s pooping in them so much and try to clean that up, too!

As the last one for now, there was an outage where someone reported that something like half of their machines were down. They had tried to do a kernel update, and wound up hitting half of them at once. I suspect they wanted to do a much smaller quantity, but messed up and hit fully half of them somehow. Or, maybe they pointed it at all of them, and only half succeeded at it. Whatever the cause, they now had 1000 freshly-rebooted machines.

The new kernel was fine, and the usual service manager stuff came back up, and it went to start the workload for those systems, and then it would immediately crash. It would try to start it again. It would crash again. Crash crash crash. This is why we call it “crashlooping”.

Finally, the person in question showed up in the usual place where we discussed outages, and started talking about what was going on.

(person 1) Our stuff isn’t coming back.

(person 2) Oh yeah, that’s bad, they’re all trying to start.

(person 1) Start, abort, start, abort, …

(person 2) Yep, aborting… right about here: company::project::client::BlahClient::loadConfig … which is this code: <paste>

(person 2) It’s calling “get or throw” on a map for an ID number…

(person 1) My guess is the config provider service isn’t running.

(person 2) It’s there… it’s been up for 30 minutes…

(person 1) Restarting the jobs.

(person 2) Nooooooooooo…

<time passes>

(person 2) Why is there no entry for number 86 in the map in the config?

(person 1) Oh, I bet it’s problems with port takeover.

(person 3) I think entry 86 is missing from <file>.

(person 2) Definitely is missing.

(person 4) Hey everyone, we removed that a while back. Why would it only be failing now?

(person 2) It’s only loaded at startup, right?

(person 4) Right.

(person 2) So if they were running for a long time, then it changed, then they’re toast after a restart…

(person 3) Hey, this change looks related.

(person 4) I’m going to back that out.

This is a common situation: program A reads config C. When it starts up, config C is on version C1, and everything is fine. While A is running, the config is updated from C1 to C2, but nothing notices. Later, A tries to restart and it chokes on the C2 config, and refuses to start.

Normally, you’d only restart a few things to get started, and you’d notice that your program can’t consume the new config at that point. You’d still have a few instances down, but that’s it - a *few* instances. Your service should keep running on whatever’s left over that you purposely didn’t touch.

This is why you strive to release things in increments.

Also, it helps when programs notice config changes while they’re running, so this doesn’t sneak up on you much later when you’re trying to restart. If the programs notice the bad config right after the change is made, it’s *far* easier to correlate it to the change just by looking at the timeline.

Tuesday, 11:23:51: someone applies change.

Tuesday, 11:23:55: first 1% of machines which subscribe to the change start complaining.

… easy, right? Now compare it to this:

Tuesday, November 18: someone applies a change

Wednesday, January 7: 50% of machines fail to start “for some reason”

That’s a lot harder to nail down.

Random aside: restarting the jobs did not help. They were already restarting themselves. “Retry, reboot, reinstall, repeat” is NOT a strategy for success.

It was not the config system being down. It was up the whole time.

It was nothing to do with “port takeover”. What does that have to do with a config file being bad?

The evidence was there: the processes were crashing. They were logging a message about WHY they were killing themselves. It included a number they wanted to see, but couldn’t find. It also said what part of the code was blowing up.

*That* is where you start looking. You don’t just start hammering random things.


From Writing - rachelbythebay via this RSS feed