Archive for August 2010

The future of crash reporting

This post first appeared in the Mozilla Webdev Blog on August 5 2010.

In recent blog posts I’ve talked about our plans for Socorro and our move to HBase.

Today, I’d like to invite community feedback on the draft of our plans for Socorro 2.0. In summary, we have been moving our data into HBase, the Hadoop database. In 1.7 we began exclusively using HBase for crash storage. In 1.8 we will move the processors and minidump_stackwalk to Hadoop.

Here comes the future

In 1.9, we will enable pulling data from HBase for the webapp via a web services layer. This layer is also known as “the pythonic middleware layer”. (Nominations for a catchier name are open. My suggestion of calling it “hoopsnake” was not well received.)

In 2.0 we will expose HBase functionality to the end user. We also have a number of other improvements planned for the 2.x releases, including:

  • Full text search of crashes
  • Faceted search
  • Ability for users to run MapReduce jobs from the webapp
  • Better visibility for explosive and critical crashes
  • Better post-crash user engagement via email

Full details can be found in the draft PRD. If you prefer the visual approach you can read the slides I presented at the Mozilla Summit last month.

Give us feedback!

We welcome all feedback from the community of users - please take a look and let us know what we’re missing. We’re also really interested in feedback about the best order in which to implement the planned features.

You can send your feedback to laura at mozilla dot com - I look forward to reading it.

Moving Socorro to HBase

This post first appeared in the Mozilla Webdev Blog on July 26 2010.

We’ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I’ll be catching up on what we’ve been doing in a series of blog posts. If you’re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey. You can see the output of the system at http://crash-stats.mozilla.com. The project’s code is also being used by people outside Mozilla: most recently Vigil Games are using it to catch crashes from Warhammer 40,000: Dark Millenium Online.

Back in June we launched Socorro 1.7, and we’re now approaching the release of 1.8. In this post, I’ll review what each of these features represents on our roadmap.

First, a bit of history on data storage in Socorro. Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS). Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes. Most crashes go to deferred storage and are not processed unless specifically requested. However, some crashes are queued into standard storage for processing. Generally this has been all crashes from alpha, beta, release candidate and other “special” versions; all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes. (Recently this has been between ten and fifteen percent.)

The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage. We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.

For some time prior to 1.7, we’d been storing all crashes in HBase in parallel with writing them into NFS. The main goal of 1.7 was to make HBase our chief storage mechanism. This involved rewriting the collector and processor to write into HBase. The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing. Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.

Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release. This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.

During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version. We are now using Cloudera’s CDH2 Hadoop distribution and HBase 0.20.5.

The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.

We’re now in the final throes of 1.8. While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available. In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes. This means that we will be able to horizontally scale the number of processors with the size of the data. Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.

Some of the associated changes in 1.8 are also really exciting. We are introducing a new component to the system, called the registrar. This process will track heartbeats for each of the processors. Also in this version, we have added an introspection API for the processors. The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors. We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.

Update: we’re frozen on 1.8 and expect release later this month.