Capability For Continuous Deployment

Continuous deployment is very buzzword-y right now. I have some really strong opinions about deployment (just ask James Socol or Erik Kastner, who have heard me ranting about this). Here’s what I think, in a nutshell:

You should build the capability for continuous deployment even if you never intend to do continuous deployment. The machinery is more important than your deployment velocity.

Let me take a step back and talk about deployment maturity.

Immature deployment

At the immature end, a developer or team works on a system that has no staging environment. Code goes from the development environment straight to production. (I’m not going to even talk about the situation where the development environment is production. I am fully aware that these still exist, from asking questions to that effect of the audience in conference talks.) I’m also assuming, in this era of github, that everybody is using version control.

(I want to point out that it’s the easy availability of services like github that has enabled even tiny, disorganized teams to use version control. VC is ubiquitous now, and that is a huge blessing.)

This sample scenario is very common: work on dev, make a commit, push it out to prod. Usually, no ops team is involved, or even exists. This can work really well in an early stage company or project, especially if you’re pre-launch.

This team likely has no automation, and a variable number of tests (0+). Even if they have tests, they may have no coverage numbers and no continuous integration.

When you hear book authors, conference speakers or tech bloggers talk about the wonders of continuous deployment, this scenario is not what they are describing.

The machinery of continuous deployment

Recently in the Mozilla webdev team, we’ve had a bunch of conversations about CD. When we talked about what was needed to do this, I had a revelation.

Although we were choosing not to do CD on my team, we had in place every requirement that was needed:

  • Continuous integration with build-on-commit
  • Tests with good coverage, and a good feel for the holes in our coverage
  • A staging environment that reflects production – our stage environment is a scale model of production, with the same ratios between boxes
  • Managed configuration
  • Scripted deployment to a large number of machines

I realized then that the machinery for continuous deployment is different from the deployment velocity that you choose for your team. If we need to, we can make a commit and push it out inside of a couple of minutes, without breaking a sweat.

Why we don’t do continuous deployment on Socorro

We choose not to, except in urgent situations, for a few reasons:

  • We like to performance test our stuff, and we haven’t yet automated that
  • We like to have a human QA team test in addition to automated tests
  • We like to version our code and do “proper” releases because it’s open source and other people use our packages
  • A commit to one component of our system is often related to other commits in other components, which make more sense to ship as a bundle

Our process looks like this:

  • The “dev” environment runs trunk, and the “stage” environment runs a release branch.
  • On commit, Jenkins builds packages and deploys them to the appropriate environment.
  • To deploy to prod, we run a script that pulls the specified package from Jenkins and pushes it out to our production environment. We also tag the revision that was packaged for others to use, and for future reference. Note we are pushing the same package to prod that we pushed to stage, and stage reflects production.
  • If we need to change configuration for a release, we change it first in Puppet on staging, and then update production the same way.

We do intend to increase our deployment velocity in the coming months, but for us that’s not about the machinery of deployment, it’s about increasing our delivery velocity.

Delivery velocity is a different problem, which I’m wrestling with right now. We have a small team, and the work we’re trying to do tends not to come in small chunks but big ones, like a new report, or a system-wide change to the way we aggregate data (the thing we’re working on at the moment). It’s not that changes sit in trunk, waiting for a release. It’s more that it takes us a while to get something to a deployable stage. That is, deployment for us is not the blocker to getting features to our users faster.

finally:

It’s the same old theme you’ve seen ten times before on this blog: everybody’s environment is different, and continuous deployment may not be for everybody. On the other hand, the machinery for continuous deployment can be critical to making your life easier . Automating all this stuff certainly helps me sleep at night much better than I used to when we did manual pushes.

(Incidentally, I’d like to thank Rob Helmer and Justin Dow for automating our world: you couldn’t find better engineers to work with.)

All systems suck

I’ve been thinking a lot about this idea lately.  I’ve spent a lot of years as an engineer and consultant fixing other people’s systems that suck, writing my own systems that suck, and working on legacy systems, that, well, suck.

Don’t let anyone fool you.  All systems suck, to a greater or lesser extent.

If it’s an old system, there’s the part of the code that everybody is afraid to work on: the fragile code that is easier to replace than maintain or refactor.  Sometimes this seems hard, or nobody really understands it.  These parts of the code are almost always surrounded by an SEP field.  If you’re unfamiliar with the term, it means “Somebody Else’s Problem”.  Items with an SEP field are completely invisible to the average human.

New systems have the parts that haven’t been built yet, so you’ll hear things like “This will be so awesome once we build feature X”.  That sucks.

There’s also the prototype that made it into production, a common problem.  Something somebody knocked together over a weekend, whether it was because of lack of time, or because of their utter brilliance, is probably going to suck in ways you just haven’t worked out yet.

All systems, old and crufty or new and shiny, have bottlenecks, where a bottleneck is defined as the slow part, the part that will break first when the system is under excessive load.  This is also part of your system that sucks.

If someone claims their system has no bugs, I have news for you: their system sucks.  And they are overly optimistic (or naive).  (Possibly they just suck as an engineer, too.)

In our heads as engineers we have the Platonic Form of a system: the system that doesn’t suck, that never breaks, that runs perfectly and quietly without anyone looking at it.  We work tirelessly to make our systems approach that system.

Even if you produce this Platonically perfect system, it will begin to suck as soon as you release it.  As data grows and changes, there will start to be parts of the system that don’t work right, or that don’t work fast enough.  Users will find ways to make your system suck in ways you hadn’t even anticipated.  When you need to add features to your perfect system, they will detract from its perfection, and make it suck more.

Here’s the punchline: sucking is like scaling.  You just have to keep on top of it, keep fixing and refactoring and improving and rewriting as you go.  Sometimes you can manage the suck in a linear fashion with bug fixes and refactoring, and sometimes you need a phase change where you re-do parts or all of the system to recover from suckiness.

This mess is what makes engineering and ops different from pure mathematics.  Embrace the suck.  It’s what gets me up in the mornings.

Books change the world

Last week I read a tweet that really got my goat, so much so that I stewed on it all weekend. The author, who is someone from the tech/startup community, said, to Tim O’Reilly no less:

“No-one ever changed the world by writing books.”

This pushed my rant button.

I thought about mentioning some books that have changed the face of civilization. Religious books: the Bible, which defines the shape of many Western civilizations, and the equivalent books of religious law in other cultures. Science books: Copernicus’ “On the Revolutions of the Celestial Spheres”, which began the scientific revolution, and defined a heliocentric model of the universe. Newton’s “Philosophiæ Naturalis Principia Mathematica”, which outlines classical mechanics and gravitation. Einstein’s multitude of publications. Books on economics: Keynes, anyone? Feminism: Friedan’s “Feminine Mystique”. Political thought. Philosophy. Need I go on?

On a micro- level, think about a book you read as a child, as a teenager, last year, last week, that changed the way you felt, gave you hope, gave you relaxation that you needed, or an escape from an unpleasant reality.

I could go on about world-changing books all day. Instead, I’m going to tell you a story about a very unexciting book. This book happens to be one that I wrote, on a subject that I’m passionate about. It’s a book on web development.

Now, this book is only a technical book. It won’t start any revolutions or cause any epiphanies, neither will it make you laugh or cry. My family won’t ever read it, and when non-technical people who are excited to discover I’m a published author hear the topic, their faces fall. It will never be on the New York Times list, or on Oprah, or be banned in countries with oppressive governments. It is a humble technical book.

This technical book has, however, sold quite well over the years. Many people have bought it (thank you), some have liked it, and some have reviewed it. Copies are in libraries, used as the prescribed texts in colleges, sold secondhand, and pirated as PDFs on the internet.

Hundreds of thousands of people have read this book, and, I hope, learned a little something about coding for the web.

Some of those people have probably gotten jobs as a result. Some might have graduated college. Some have built a personal website. Some might have gotten a promotion.

Out of those people, I venture that there have to be a hundred, perhaps more, perhaps less, who have started a company that does some kind of web development, whether it’s a consulting company or a startup. Maybe some of those companies got funded, maybe some were bootstrapped, maybe some were successful.

I wonder if that benchmark is something that the author of the tweet might value.

I hope it’s not too arrogant as an author to hope these things: that the books you write change someone’s life for the better, and in doing so change the world. I continue to believe this, and that is why I continue to write.

Socorro’s Community

This post originally appeared on the Mozilla WebDev blog.

As readers of my blog posts know, Socorro is Mozilla’s crash reporting system. All of the code for Socorro (the crash catcher) and Breakpad (the client side) is open source and available on Google Code.

Some other companies are starting to use Socorro to process crashes. In particular, we are seeing adoption in the gaming and music sectors - people who ship connected client software.

One of these companies is Valve Software, the makers of Half Life, Left 4 Dead, and Portal, among other awesome games. Recently Elan Ruskin, a game developer at Valve, gave a talk at the Game Developers Conference about Valve’s use of crash analysis. His slides are up on his blog and are well worth a read.

If you’re thinking about trying Socorro, I’d encourage you to join the general discussion mailing list (or you can follow it on Google Groups). It’s very low traffic at present but I anticipate that it will grow as more people join.

Later in the year, we plan on hosting the first inaugural Crash Summit at Mozilla, where we’ll talk about tools, crash analysis, and the future of crash reporting. Email me if you’re interested in attending (laura at mozilla) or would like to present. The event will be open to Mozillians and others. I’ll post updates on this blog as we develop the event.

Big Data at SXSW (begin shameless plug)

On Monday March 14, I’ll be one of the presenters at a SXSW workshop called “Big Data and APIs for PHP Developers”, along with:

We’ll be talking about what Big Data is, how to work with it, Big Data APIs (how to design and implement your own, and how to consume them), data visualization, and the wonders of MapReduce. I’ll talk through a case study around Socorro: the nature of the data we have, how we manage it, and some of the challenges we have faced so far.

Workshops are new at SXSW. They are longer than the traditional panel - 2.5 hours - so we can actually get into some techinical content. We plan on making our presentation a conversation about data, with plenty of war stories.

Hope to see you there!

Being Open

I was recently privileged to be invited to come and give a talk at AOL, on the work we do with Socorro, how Mozilla works, and what it means to be open.

The audience for the talk was a new group at AOL called the Technology Leadership Group. It consists of exceptional technical people - engineers and operational staff - from all parts of the organization, who have come together to form a group of thought leaders.

One of the first items on their agenda is, as Erynn Petersen, who looks after Developer and Open Source Evangelism, puts it: “how we scale, how we use data in new and interesting ways, and what it means to treat all of our projects as open source projects.” My task was partly to talk about how open source communities work, what the challenges are, and how AOL might go about becoming more open.

It’s amazing how things come full circle.

I think every person in the audience was a user of Open Source, and many of them were already Open Source contributors on a wide range of projects. Some had been around since the days when Netscape was acquired by AOL.

I’ll include the (limited) slides here, but the best part of the session in my opinion was the Q&A. We discussed some really interesting questions, and I’ll feature some of those here. (I want to note that I am paraphrasing/summarizing the questions as I remember them, and am not quoting any individual.)

Q: Some of our software and services are subscription-based. If we give that code away, we lose our competitive advantage - no one will pay for it anymore.

A: There are a bunch of viable business models that revolve around making money out of open source. The Mozilla model is fairly unusual in the space. The most common models are:

  • Selling support, training, or a built and bundled version of the software. This model is used by Red Hat, Canonical, Cloudera, and many others.
  • Dual licensing models. One version of the software is available under an open source license and another version is available with a commercial license for embedding. This is (or has been) the MySQL model.
  • Selling a hosted version of Open Source software as a service. This model is used by Github (git) and Automattic (Wordpress), among others.
  • It’s also perfectly valid to make some of your software open and leave some proprietary. This is the model used by 37signals - they work on Ruby on Rails and sell SaaS such as Backpack and Basecamp.

Another point is that at Mozilla, our openness *is* our competitive advantage. Our users know that we have no secret agenda: we’re not in it for the money, but we’re also not in it to mine or exploit their personal data. We exist to care for the Open Web. There has been a lot of talk lately about this, best summarized by this statement, which you’ll see in blog posts and tweets from Mozillians:

Firefox answers to no-one but you.

Q: How do we get started? There’s a lot of code - how do we get past the cultural barriers of sharing it?

A: It’s easier to start open than to become open after the fact. However, it can be done - if it couldn’t be done Mozilla wouldn’t exist. Our organization was born from the opening of Netscape. A good number of people in the room were at AOL during the Netscape era, too, which must give them a sense of deja vu. I revisited jwz’s blog post about leaving the Mozilla project, back in those days after I drafted this post, and I recommend reading it as it talks about a lot of the issues.

My answer is that there’s a lot to think about here:

  • What code are we going to make open source? Not everything has to be open source, and it doesn’t have to happen all at once. I suggest starting up a site and repository that projects can graduate to as they become ready for sharing. Here at Mozilla basically everything we work
    on is open source as a matter of principle (”open by default”), but someof it is more likely to be reused than other parts. Tools and libraries are a great starting point.
  • How will that code be licensed? This is partly a philosophical question and partly a legal question. Legal will need to examine the licensing and ownership status of existing code. You might want a contributors’ agreement for people to sign too. Licenses vary and the answer to this question is also dependent on the business model you want to use.
  • How will we share things other than the code? This includes bug reports, documentation, and so on.
  • How will the project be governed? If I want to submit a patch, how do I do that? Who decides if, when, and how that patch will be applied? There are various models for this ranging from the benevolent dictator model to the committee voting model.

I would advise starting with a single project and going from there.

Q: How will we build a community and encourage contributions?
A: This is a great question. We’re actually trying to answer this question on Socorro right now. Here’s what we are doing:

  • Set up paths of communication for the community: mailing lists, newsgroups, discussion forums
  • Make sure you have developer documentation as well as end user documentation
  • If the project is hard to install, consider providing a VM with everything already installed. (We plan to do this both for development and for users who have a small amount of data.)
  • Choose some bugs and mark them as “good first bug” in your bug tracking system.
  • Make the patch submission process transparent and documented.

There was a lot more discussion. I really enjoyed talking to such a smart and engaging group, and I wish AOL the very best in their open source initiative.

The new Socorro

(This post first appeared on the Mozilla Webdev Blog.)

As per my previous blog post, we migrated Socorro to the new data center in Phoenix successfully on Saturday.

This has been a mammoth exercise, that sprang from two separate roots:

  • In June last year, we had a configuration problem that looked like a spike in crashes for a pre-release version of Firefox (3.6.4). This incident (known inaccurately, as “the crash spike”) made it clear that crash-stats is on the critical path to shipping Firefox releases.
  • Near the end of Q3 we realized we were rapidly approaching the capacity of the current Socorro infrastructure.

Around that time we also had difficulty with Socorro stability. This spawned the creation of a master Socorro Stability Plan, with six key areas. I’ll talk about each of these in turn.

Improve stability

Here, we solved some Hbase related issues, upgraded our software, scheduled restarts three times a week and, most importantly, sat down to conduct an architectural review.

The specific HBase issue that we finally solved had to do with intra-cluster communication. (We needed to upgrade our Broadcomm NIC drivers and kernel to solve a known issue when used with HBase. This problem surfaces as the number of TCP connections growing and growing until the box crashes. Solving this removed the need for constant system restarts.)

Architectural improvements

We collect incoming crashes directly to HBase. We determined that as HBase is relatively new and as we’d had stability problems, that we should get Hbase off the critical path for production uptime. We rewrote collectors to seamlessly fall back to disk if HBase was unavailable, and for them optionally to use disk as primary. As part of this process, the system that moves crashes from disk to HBase was replaced. It went from single threaded to multi-threaded, which makes playing catchup after an HBase downtime much much faster.

We still want to put a write through caching layer in front of HBase. This quarter, the Metrics team is prototyping a system for us to test, using Hazelcast.

Build process

We now have an automated build system. When code is checked into svn, Hudson notices, creates a build and runs tests. This build is then deployed in staging. Now we are on the new infrastructure, we will deploy new releases from Hudson as well.

Improved release practices

We have a greatly improved set of release guidelines, including writing a rollback plan for every major release. We did this as part of the migration, too. Developers now have read only access to everything in production: we can audit configs and tail logs.

The biggest change here, though, is switching to Puppet to manage all of our configurations. Socorro has a lot of moving parts, each with its own configuration, and thanks to the fine work by the team, all of these configurations are automatically and centrally managed and can be changed and deployed at the push of a button.

Improved insight into systems

As part of the migration, we audited our monitoring systems. We now have many more nagios monitors on all components, and have spent a lot of time tuning these. We also set up ganglia and have a good feel for what the system looks like under normal load.

We still intend to build a better ops dashboard so we can see all of these checks and balances in one place.

Move to bigger, better hardware in PHX

This one is the doozy. Virtually all Socorro team resources have been on this task full time since October, and we managed to subsume half of the IT/Ops team as well. I’d like to give a special shout-out here to NetOps, who patiently helped with our many many requests.

It’s worth noting that part of the challenge here is that Socorro has a fairly large number of moving parts. I present, for your perusal, a diagram of our current architecture.
Socorro Architecture, 1.7.6
I’ve blogged about our architecture before, and will do so again, soon, so this is just a teaser.

One of the most interesting parts of the migration was the extensive smoke and load testing we performed. We set up a farm of 40 Seamicro nodes and used them to launch crashes at the new infrastructure. This allowed us to find network bottlenecks, misconfigurations, and perform tuning, so that on the day of the migration we were really confident that things would go well. QA also really helped here because with the addition of a huge number of automated tests on the UI, we knew that things were looking great from an end user perspective.

The future

We now have a great infrastructure and set of processes to build on. The goal of Socorro - and Webtools in general - is to help Firefox ship as fast and seamlessly as possible, by providing information and tools that help to build the best possible browser. (I’ll talk more about this in future blog posts, too.)

Thanks

I’ll wrap up by thanking every member of the team that worked on the migration, in no particular order:

  • Justin Dow, Socorro Lead Sysadmin, Puppetmaster
  • Rob Helmer, Engineer, Smoke Tester Extraordinaire and Automater of Builds
  • Lars Lohn, Lead Engineer
  • Ryan Snyder, Web Engineer
  • Josh Berkus, PostgreSQL consultant DBA
  • Daniel Einspanjer, Metrics Architect
  • Xavier Stevens, HBase wizard
  • Corey Shields, Manager, Systems
  • Matthew Zeier, Director of Ops and IT
  • Ravi Pina, NetOps
  • Derek Moore, NetOps
  • Stephen Donner, WebQA lead
  • David Burns, Automated Tester
  • Justin Lazaro, Ops

Good job!
Mozilla Crash Reporting Team

PHP and Big Data

This post first appeared as part of the PHP Advent Calendar.

Big data, data science, analytics. These are some of the hottest buzzwords in tech right now. Five years ago, the boasting rights went to the geek with the largest number of users: these days he with the biggest data wins.

There are a number of approaches to dealing with vast quantities of data, but one of the best known is Apache Hadoop. Hadoop is a toolkit for managing large data sets, based originally on the Google whitepapers about MapReduce and the Google File System. For Socorro, the Mozilla crash reporting system, we use HBase, a non-relational (NoSQL) database built on the Hadoop ecosystem.

The Hadoop world is largely a Java world, since all the tools are written in Java. However, if you feel the same way about Java as Sean Coates, you should not lose hope. You, too, can use PHP to work with Hadoop.

Let’s start by understanding MapReduce. This is a framework for distributed processing of large datasets.

A MapReduce job consists of two pieces of code:

A Mapper
The job of the Mapper is to map input key-value pairs to output key-value pairs.
A Reducer
The Reducer receives and collates results from Mappers.

More parts are needed to make this work:

  • An Input reader generates splits of data for each Mapper to work through.
  • A Partition function takes the output of Mappers and chooses a destination Reducer.
  • An Output writer takes the output of the Reducers and writes it to the Hadoop Distributed File System (HDFS).

In summary, the Mapper and Reducer are the core functionality of a MapReduce job. Now, let’s get set up to write a Mapper and Reducer against Hadoop with PHP.

Setting up Hadoop is a non-trivial task; luckily, a number of VMs are available to help. For this example, I am using the Training VM from Cloudera. (You’ll need VMWare Player for Windows or Linux, or VMWare Fusion for OS X to run this VM.)

Once you’ve started the VM, open up a terminal window. (This VM is Ubuntu based.)

The VM you have just installed comes with a sample data set of the complete works of Shakespeare. You’ll need to put these files into HDFS so that we can work with them. Run the following commands to put the files into HDFS:

cd ~/git/data
tar vzxf shakespeare.tar.gz
hadoop fs -put input /user/training/input

You can confirm this worked by viewing the files in the input directory on HDFS: hadoop fs -ls /user/training

Next, we need to create the mapper and reducer. To demonstrate these, we’ll reproduce what is often referred to as the canonical MapReduce example: word count.

You can find the Java version of this code in the Cloudera Hadoop Tutorial.

As you can see (if you know Java), the mapper reads words from input, and for each word it encounters, emits to standard output the word and the value 1 to indicate that the word has been encountered. The reducer takes output from mappers and aggregates it to produce a set of words and counts.

The easiest way to communicate from PHP to Hadoop and back again is using the Hadoop Streaming API. This expects mappers and reducers to use standard input and output as a pipe for communication.

This is how we write the word count mapper in PHP, which we’ll name mapper.php:

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");

while ($line = fgets($input)) {
$line = strtolower($line);
if ($words = preg_split("/\W/", $line)) {
foreach ($words as $word) {
echo "$word\t1\n";
}
}
}

fclose($input);

We open standard input for reading a line at a time, split that line into an array along word boundaries using a regular expression, and emit output as the word encountered followed by a 1. (I delimited this with tabs, but you may use whatever you like.)

Now, here’s the reducer (reducer.php):

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");
$counts = array();

while($line = fgets($input)) {
$tuple = explode("\t", $line);
$counts[$tuple[0]] += $tuple[1];
}

fclose($input);

foreach($counts as $word => $count) {
echo("$word $count\n");
}

Again, we read a line at a time from standard input, and summarize the results in an array. Finally, we write out the array to standard output.

Copy these scripts to your VM, and once you have saved them, make them executable:

chmod a+x mapper.php
chmod a+x reducer.php

You can run this example code in the VM using the following command:

hadoop \
jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+320.jar \
-mapper mapper.php \
-reducer reducer.php \
-input input \
-output wordcount-php-output

In the output from the command you will see a URL where you can trace the execution of your MapReduce job in a web browser as it runs. Once the job has finished running, you can view the output in the location you specified:

hadoop fs \
-ls /user/training/wordcount-php-output

You should see something like:

Found 2 items
drwxr-xr-x   - training supergroup          0 2010-12-14 15:40
/user/training/wordcount-php-output/_logs
-rw-r--r--   1 training supergroup     279706 2010-12-14 15:40
/user/training/wordcount-php-output/part-00000

You can view the output, too:

hadoop fs \
-cat /user/training/wordcount-php-output/part-00000

An excerpt from the output should look like this:

yeoman 13
yeomen 1
yerk 2
yes 211
yest 1
yesterday 25

This is a pretty trivial example, but once you have this set up and running, it’s easy to extend this to whatever you need to do. Some examples of the kinds of things you can use it for are inverted index construction, machine learning algorithms, and graph traversal. The data you can transform is limited only by your imagination, and, of course, the size of your Hadoop cluster. That’s a topic for another day.

The future of crash reporting

This post first appeared in the Mozilla Webdev Blog on August 5 2010.

In recent blog posts I’ve talked about our plans for Socorro and our move to HBase.

Today, I’d like to invite community feedback on the draft of our plans for Socorro 2.0. In summary, we have been moving our data into HBase, the Hadoop database. In 1.7 we began exclusively using HBase for crash storage. In 1.8 we will move the processors and minidump_stackwalk to Hadoop.

Here comes the future

In 1.9, we will enable pulling data from HBase for the webapp via a web services layer. This layer is also known as “the pythonic middleware layer”. (Nominations for a catchier name are open. My suggestion of calling it “hoopsnake” was not well received.)

In 2.0 we will expose HBase functionality to the end user. We also have a number of other improvements planned for the 2.x releases, including:

  • Full text search of crashes
  • Faceted search
  • Ability for users to run MapReduce jobs from the webapp
  • Better visibility for explosive and critical crashes
  • Better post-crash user engagement via email

Full details can be found in the draft PRD. If you prefer the visual approach you can read the slides I presented at the Mozilla Summit last month.

Give us feedback!

We welcome all feedback from the community of users - please take a look and let us know what we’re missing. We’re also really interested in feedback about the best order in which to implement the planned features.

You can send your feedback to laura at mozilla dot com - I look forward to reading it.

Moving Socorro to HBase

This post first appeared in the Mozilla Webdev Blog on July 26 2010.

We’ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I’ll be catching up on what we’ve been doing in a series of blog posts. If you’re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey. You can see the output of the system at http://crash-stats.mozilla.com. The project’s code is also being used by people outside Mozilla: most recently Vigil Games are using it to catch crashes from Warhammer 40,000: Dark Millenium Online.

Back in June we launched Socorro 1.7, and we’re now approaching the release of 1.8. In this post, I’ll review what each of these features represents on our roadmap.

First, a bit of history on data storage in Socorro. Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS). Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes. Most crashes go to deferred storage and are not processed unless specifically requested. However, some crashes are queued into standard storage for processing. Generally this has been all crashes from alpha, beta, release candidate and other “special” versions; all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes. (Recently this has been between ten and fifteen percent.)

The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage. We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.

For some time prior to 1.7, we’d been storing all crashes in HBase in parallel with writing them into NFS. The main goal of 1.7 was to make HBase our chief storage mechanism. This involved rewriting the collector and processor to write into HBase. The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing. Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.

Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release. This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.

During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version. We are now using Cloudera’s CDH2 Hadoop distribution and HBase 0.20.5.

The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.

We’re now in the final throes of 1.8. While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available. In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes. This means that we will be able to horizontally scale the number of processors with the size of the data. Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.

Some of the associated changes in 1.8 are also really exciting. We are introducing a new component to the system, called the registrar. This process will track heartbeats for each of the processors. Also in this version, we have added an introspection API for the processors. The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors. We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.

Update: we’re frozen on 1.8 and expect release later this month.