Not long ago, in a datacenter not far away…this is a story about stuff going wrong and how we fixed it.
My team works on Socorro, the Firefox crash reporting system, among other things.
When Firefox crashes, it submits two files to us via HTTP POST. One is JSON metadata, and one is a minidump. This is similar in nature to a core dump. It’s a binary file, median size between 150 and 200 kB.
When we have a plugin problem (typically with Flash), we get two of these crash reports: one from the browser process, and one from the plugin container. Until recently it was challenging to reunite the two halves of the crash information. Benjamin Smedberg made a change to the way plugin crashes are reported. We now get a single JSON metadata file, with both minidumps, the one from the browser, and the one from the plugin container. We may at some point get another 1-2 dumps as part of the same crash report.
We needed to make a number of code changes to Socorro to support this change in our data format. From here on in, I shall refer to this architectural change as “multidump support”, or just “multidump”.
Crashes arrive via our collectors. This is a set of boxes that run two processes:
1. Collector: this is Python (web.py) running in a mod_wsgi process via Apache. Collector receives crashes via POST, and writes them to local filesystem storage.
2. Crash mover: This is a Python daemon that picks up crashes from the filesystem and writes them to HBase.
You may be saying, “Wow, local disk? That is the worst excuse for a queue I’ve ever seen.” You would be right. The collector uses pluggable storage, so it can write wherever you want (from Postgres, HBase, filesystem). We have previously written crashes to NFS, and more recently and less successfully directly to HBase. That turned out to be a Bad Idea ™, so about two years ago I suggested we write them to local disk “until we can implement a proper queue”. Local storage has largely turned out to be “good enough”, which is why it has persisted for so long.
Adding multidump support changed the filesystem code, among other things.
Act I: An Unexpected Journey
We had landed multidump support on our master and stage branch, but engineers and QA agreed that we were not quite comfortable enough with it to ship it. Although we had planned to ship it this day, we didn’t, but we had some other stuff we needed to ship. Instead of what we usually do (in git, push master to stage, which is our release branch), we stashed stage changes between the last release and now, and then cherry picked the stuff we needed to ship.
What we didn’t realize was that we accidentally left multidump in the stage branch, so when we pushed, we pushed multidump support. It ran for several hours in production seemingly without problems. We did not apply the PostgreSQL schema migration, but we had previously changed the HBase schema to support this, so it didn’t cause any problems, but was not end-user visible. When we realized the error, we rolled back, rebuilt, and pushed the intended changes. This happened within a couple of hours. (The rollback/rebuild/repush took a minute or two.)
We intentionally pushed multidump support. It passed QA, and everything seemed to be going swimmingly.
A Socorro user (Kairo) noticed that our crash volume had been lower than average for the last couple of days.
Investigation showed that many, many crashes were backed up in filesystem storage, and that HBase writes were giving frequent errors, meaning that the crashmovers were having trouble keeping up.
We decided to take one collector box at a time out of the pool, to allow it to catch up. We also noticed at this time that all the collectors were backed up except collector04, which was keeping up. This was a massive red herring as it later turned out. We ran around checking the config and build and netflows on collector04 were the same as on the other collectors. While we watched, collector04 gradually began backing up, and then was in the same boat as the others.
Based on previous experiences, many bad words were said about Thrift at this point. (If you don’t know Thrift, it’s a mechanism we use for talking to HBase. We use it because our code is in Python and not a JVM language, so we use Thrift as middleman.) But this was instinct, not empirical evidence, and therefore not useful for problem solving.
To actually diagnose the problem, we first tried strace-ing the crashmover process, and then decided to push an instrumented build to a single box. By “instrumented” I mean “it logs a lot”. As soon as we had the instrumentation in place, syslog began to tell a story. Each crash move was taking 4-5 seconds to complete. Our normal throughput on a single collector topped out at around 2800-3000 crashes/minute, so something was horribly wrong.
As it turned out the slow part was actually *deleting* the crashes from disk. That was consuming almost all of the 4-5 seconds.
While looking at the crashes on disk, trying to discern a pattern, we made an interesting discovery. Our filesystem code uses radix storage: files are distributed among directories on a YY/MM/DD/HH/MM/ basis. (There are also symlinks to access the crashes by the first digit of their hex OOID, or crash ID.) We discovered that instead of distributing crashes like this, all the crashes on each collector were in a directory named [YY]/[MM]/[DD]/00/00. Given the backlog, that meant that, on the worst collector, we had 750,000 crashes in a single directory, on ext4. What could possibly go wrong?
At this point we formed the hypothesis that deletes were taking so long because of the sheer number of files in a directory. (If there’s any kind of stat in the code - and strace showed there was - then this would perform poorly.)
We moved the crashes manually out of the way, as a test. This sped things up quite a bit.
We also noticed at this point that the 00/00 crashes had backed up on several days. We had some orphaned crashes on disk (a known bug, when multiple retries fail), and this was the pattern.
01/10/00/00 - a moderate number of crashes
01/17/00/00 - ditto
(same for each succeeding day)
01/22/00/00 - a huge number of crashes
These days correlated to the days we had multidump code running in production. We had kind of suspected that, but this was proof.
We rolled back a single collector to pre-multidump code, and it immediately resumed running at full speed. We then rolled back the remainder of the collectors, and took them out of the pool one at a time so they could catch up.
Somewhere during our investigation (my notes don’t show when) the intermittent failures from HBase had stopped.
By Saturday 1/26, we had caught up on the backlog. We had also by this time, discovered the code bug that wrote all files into a single directory, and patched it. (The filesystem code no long had access to the time, so all times were 00/00.)
We thought we were out of the woods, and scheduled a postmortem for 1/31. However, it wasn’t going to be that easy.
Act II: All this has happened before, and will happen again.
We ran backfill for our aggregate processing, in order to recalculate totals with the additional processed crashes included.
Our working hypothesis at this stage was as follows. An unknown event involving HBase connection outages (specifically on writes) had caused crashes to begin backing up, and then having a large number of crashes in a single directory made deletion slow. We still wanted to know what had caused the HBase issue, but there were two factors that we knew about. First, at the time of the problem, we had an outage on a single Region Server. This shouldn’t cause a problem, but the timing was suspicious. Secondly, we saw an increased number of errors from Thrift. This has happened periodically and is short-term solved by restarting Thrift. We believe it is partially caused by our code handling Thrift connections in a suboptimal way, something that is in the process of being solved by our intern.
Also on 1/28 we pushed a new build that incorporated the fix for the directory naming problem. (see https://github.com/mozilla/socorro/commit/9a376d8c1b2c9bf40b3b612661a971a311a9738c)
A big day. We had two things planned for this day: first, a postmortem for the multidump issue, and second, a PostgreSQL failover from our primary to secondary master so we could replace the disks with bigger ones.
Murphy, the god of outages, intermittent errors, and ironic timing, did not smile fondly upon us this day.
Crashes began backing up on collectors once again (see https://bugzilla.mozilla.org/show_bug.cgi?id=836845). We saw no HBase connection errors at this time, and hence realized at this point that we must have missed something. We rolled back to a pre-multidump build on collectors, and they immediately began catching up. We held off running backfill of aggregates at this time, because we wanted to go ahead with the failover. Disk was getting desperately short and we had already had to delay the failover once due to external factors.
We postponed the postmortem, because clearly we didn’t have a handle on the root cause(s) at this time.
We then discovered the cause. Multidump code was using remove() instead of the previously used quickDelete(), which was used to replace remove() a number of years ago because it was so slow. (a href=’https://bugzilla.mozilla.org/show_bug.cgi?id=836986)
We proceeded with the planned failover from master01 to master02, and replaced the disks in master01. Our plan was to maintain master02 as primary, with master01 replicating from master02. The failover went well, but the new disks for master01 turned out to be faulty, post-installation. We were now in a position where we no longer had a hot standby. Our disk vendor did not meet their SLA for replacement.
We ran backfill of aggregate reports, and from an end-user perspective everything was back to normal.
Replaced disks on master01 (again). These too had some errors but we managed to solve that.
Later, we pushed a new build that solved the quickDelete() issue. We were officially out of the woods.
Things that went well:
- The team, consisting of engineers, WebOps, and DCOps worked extremely well and constructively together.
- As a result of looking closely at our filesystem/HBase interactions, we tuned disk performance and ordered some SSDs which have effectively doubled performance since installation. Thrift appears to be the next bottleneck in the system.
Things we could have done better:
- Release management: we broke our RM process and that led us to accidentally ship the code prematurely.
- Not shipped broken code, you know, the usual. Although I do have to say this was more subtly broken than average. The preventative measures here would have been better in-code documentation in the old code (”Using quickDelete here instead of remove because remove performs badly.”) We did go through code review, unit and integration testing, and manual QA, as per usual, but given this code only performed poorly once other parts of the system showed degraded performance, this was hard to catch.
- Relying on end-user observation to discover how the system was broken. Monitoring can solve this.
Things we will change:
- Improvements to monitoring. We will now monitor the number of backed up crashes. It’s not a root cause monitor but an indicator of trouble somewhere in the system. We have a few others of these, and they are good catch-alls for things we haven’t thought to monitor yet. We are also working on better monitoring of Thrift errors using thresholds. Right now we consider a 1% error rate on Thrift connections normal, and support limited retries with exponential fallback. We want to alert if the percentage increases. We plan on doing more of these thresholded monitors by writing these errors to statsd, and pointing nagios at the rolling aggregates. This will also work for monitoring degraded performance over time.
- Improvements to our test and release cycles. We have seen a few times now an issue where when we get a feature to staging we decide it’s not ready to ship, and this involves git wrangling and introduces a level of human error. Our intention is to build out a set of “try” environments, that is parallel staging environments that run different branches from the repo.
I like disasters. They always lead to a better process and better code. Also, when the team works well together, it’s a positive trust-building and team-building experience. Much better than trust falls in my experience.
A final note
All of the troubleshooting was done with a remote team, working from various locations across North America, communicating via IRC and Vidyo. It works.
Thanks to everyone involved in troubleshooting this issue: Jake Maul, Selena Deckelmann, Rob Helmer, Chris Lonnen, Dumitru Gherman, and Lars Lohn.