The future of crash reporting

This post first appeared in the Mozilla Webdev Blog on August 5 2010.

In recent blog posts I’ve talked about our plans for Socorro and our move to HBase.

Today, I’d like to invite community feedback on the draft of our plans for Socorro 2.0. In summary, we have been moving our data into HBase, the Hadoop database. In 1.7 we began exclusively using HBase for crash storage. In 1.8 we will move the processors and minidump_stackwalk to Hadoop.

Here comes the future

In 1.9, we will enable pulling data from HBase for the webapp via a web services layer. This layer is also known as “the pythonic middleware layer”. (Nominations for a catchier name are open. My suggestion of calling it “hoopsnake” was not well received.)

In 2.0 we will expose HBase functionality to the end user. We also have a number of other improvements planned for the 2.x releases, including:

  • Full text search of crashes
  • Faceted search
  • Ability for users to run MapReduce jobs from the webapp
  • Better visibility for explosive and critical crashes
  • Better post-crash user engagement via email

Full details can be found in the draft PRD. If you prefer the visual approach you can read the slides I presented at the Mozilla Summit last month.

Give us feedback!

We welcome all feedback from the community of users - please take a look and let us know what we’re missing. We’re also really interested in feedback about the best order in which to implement the planned features.

You can send your feedback to laura at mozilla dot com - I look forward to reading it.

Moving Socorro to HBase

This post first appeared in the Mozilla Webdev Blog on July 26 2010.

We’ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I’ll be catching up on what we’ve been doing in a series of blog posts. If you’re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey. You can see the output of the system at http://crash-stats.mozilla.com. The project’s code is also being used by people outside Mozilla: most recently Vigil Games are using it to catch crashes from Warhammer 40,000: Dark Millenium Online.

Back in June we launched Socorro 1.7, and we’re now approaching the release of 1.8. In this post, I’ll review what each of these features represents on our roadmap.

First, a bit of history on data storage in Socorro. Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS). Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes. Most crashes go to deferred storage and are not processed unless specifically requested. However, some crashes are queued into standard storage for processing. Generally this has been all crashes from alpha, beta, release candidate and other “special” versions; all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes. (Recently this has been between ten and fifteen percent.)

The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage. We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.

For some time prior to 1.7, we’d been storing all crashes in HBase in parallel with writing them into NFS. The main goal of 1.7 was to make HBase our chief storage mechanism. This involved rewriting the collector and processor to write into HBase. The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing. Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.

Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release. This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.

During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version. We are now using Cloudera’s CDH2 Hadoop distribution and HBase 0.20.5.

The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.

We’re now in the final throes of 1.8. While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available. In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes. This means that we will be able to horizontally scale the number of processors with the size of the data. Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.

Some of the associated changes in 1.8 are also really exciting. We are introducing a new component to the system, called the registrar. This process will track heartbeats for each of the processors. Also in this version, we have added an introspection API for the processors. The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors. We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.

Update: we’re frozen on 1.8 and expect release later this month.

Socorro: Mozilla’s Crash Reporting System

(Cross-posted from the Mozilla WebDev blog.)

Recently, we’ve been working on planning out the future of Socorro.  If you’re not familiar with it, Socorro is Mozilla’s crash reporting system.

You may have noticed that Firefox has become a lot less crashy recently - we’ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers to find, diagnose, and fix the most common crashes, so crash reporting is critical to these improvements.

We receive on our peak day each week 2.5 million crash reports, and process 15% of those, for a total of 50 GB.  In total, we receive around 320Gb each day!  Right now we are handicapped by the limitations of our file system storage (NFS) and our database’s ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being written to HBase.  Soon this will become our main data storage, and we’ll be able to do a lot more interesting things with the data.  We’ll also be able to process 100% of crashes.  We want to do this because the long tail of crashes is increasingly interesting, and we may be able to get insights from the data that were not previously possible.

I’ll start by taking a look at how things have worked to date.

History of Crash Reporting

Current Socorro Architecture

The data flows as follows:

  • When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla’s end, this is where Socorro comes into play.
  • Crashes are submitted to the collector, which writes them to storage.
  • The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)
  • Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.
  • Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at http://crash-stats.mozilla.com

Implementation Details

  • The collector, processor, monitor and cron jobs are all written in Python.
  • Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.
  • The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.

Roadmap

Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.

  • 1.6 - 1.6.3 (in production)

    The current production version is 1.6.3, which was released last Wednesday.  We don’t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.

    When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.

    You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that’s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.

    If you are viewing one half of a hang pair for a pre-release or beta product, you’ll find a link to the other half at the top right of the report.

    You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.

  • 1.7 (Q2)

    We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.

  • 1.8 (Q2)

    In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.

    New Socorro Architecture

    With this release, the 15% throttling of Firefox release channel crashes goes away entirely.

  • 2.0 (Q3 2010)

    You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.

    Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven’t scheduled you for an interview and you think we ought to talk to you, please let us know.

Features under consideration

  • Full text search of crashes
  • Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.
    Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?
  • The ability to write and run your own Map/Reduce jobs (training will be provided)
  • Detection of “explosive crashes” that appear quickly
  • Viewing crashes by “build time” instead of clock time
  • Classification of crashes by component

This is a big list, obviously. We need your feedback - what should we work on first?

One thing that we’ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!

How to get involved

As always, we welcome feedback and input on our plans.

You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.

In addition, we always welcome contributions.  You can find our code repository at
http://code.google.com/p/socorro/

We hold project meetings on a Wednesday afternoon - details and agendas are here
https://wiki.mozilla.org/Breakpad/Status_Meetings

Parenting Versus Programming

This post was written for and first appeared in the PHP Advent Calendar 2009.

Advent calendars are about Christmas, and for me Christmas has always been a time for family. This year I have recently joined the ranks of the parents among you. I am taking a short break from work and focusing on being a mother rather than being a programmer. This has led me to reflect on the similarities between parenting and coding. I present these here for your enlightenment, or so you can laugh at me.

Lesson One: Smells

Babies, like programs, are associated with a variety of interesting smells. If you try to ignore the smells associated with babies, they only get worse over time. Then the screaming starts.

It was Martin Fowler who popularized the notion of code smells. These are defined as parts of your code that do things in an ugly way, or to put it a different way, they are hacks. Typically, when we find these parts of code, our eyes begin to glaze over, and we enter a strange, zombie-like state. (This will also be familiar to parents.) In this state, we are paralyzed and thus prevented from doing anything about the smell. It is only when we emerge from the cave of code smells into the daylight of clean code that we can again be productive. I am not sure whether it is the horror of the code, a tendency to procrastinate that is endemic among programmers, or simply a fear of breaking things that prevents most people from doing something about the mess. It is always the smelliest parts of the code that are the most fragile.

The lesson programmers can learn from babies, here, is to face your smells and get rid of them as quickly as possible. Then everybody’s happy.

Lesson Two: Sleep

I have now been programming for mumble years. Not long after I started work at Mozilla, a delightful Canadian journalist asked me what it was like to work with a group of people “so much younger than myself.” While I am not actually that old, I am old enough to have picked up some skills. One of these skills is the ability to survive on 4 to 6 hours of sleep a night for extended periods. Although this is not my favorite hobby, I have gained a certain level of mastery.

Another thing I’ve learned is that surviving on 4 to 6 hours of sleep a night for an extended period makes you mildly deranged. With parenting, as with programming, it is sometimes required. When you achieve a certain level of derangement, you find that very silly things start to sound like good ideas, or they just start to happen, whether you intended them to or not. For example, you find yourself accidentally filing the baby in the filing cabinet or implementing a new framework.

This is a lesson both programmers and parents can learn from the military: sleep when you can do so safely, and as often as possible. If you can snatch a few minutes of nap here and there, it may not make you feel a lot better, but your degraded IQ will recover somewhat.

Lesson Three: With Great Power Comes Great Responsibility

When you set out to implement a new web app or raise a child — and yes, I do realize these things are not quite on the same scale — you have a great responsibility to do a good job. Otherwise, everyone who has to interact with your app or child in the future will curse your name. Repeatedly. Your baby — whether it is human or code — is totally dependent upon you to do a good job.

Lesson Four: It Takes a Village

On that note, it is important to realize that it is very hard to raise a child or write a web app completely on your own. Some things are better done with a little help from your friends. Whether that help is providing a role model, a shoulder to cry on, advice when you just don’t know what to do, or purely someone to vent to, it really helps to surround yourself with people you can rely on.

That summarizes the commonalities between parenting and programming that I have learned over the last 7 weeks. I suspect I have a great deal more to learn.

One final note: If anyone approaches you to work on a web app that will take 18 years, run away as fast as you can, and do not look back.

Seven Things

I feel like I’m about last to the party, but after getting tagged by Ben Ramsey I thought I’d contribute to this meme/tag/whatever that’s going around the PHP community.  I intend to blog more in 2009, so this is a starting point anyway.

Seven Things you didn’t know about me:

1.   I learned to program in the 4th grade on an Apple II, in LOGO.  A high school near me had a program where a group of selected nerds from local schools would go there once a week and share two machines.  I was the only one in primary (elementary) school.

2. I used to do a lot of singing - school choir, madrigals, musicals, an a capella group with my friends, and a youth gospel band.  I sing alto.

3. I’m not religious and consider myself a (non militant) atheist.  We were raised that way since my family is a combination of Catholic, Jewish and Presbyterian and my parents didn’t believe in organized religion.  Despite this I went to an Anglican girls’ school for 11 years (also see the above mentioned gospel band) and am technically Jewish.

4. I met my husband Luke Welling in Advanced Software Engineering at college (RMIT University in Melbourne, Australia).  We had to do a review of each other’s code as part of an assignment.  The first words he ever said to me were, “This is crap.”

5.  I moved house every 1-2 years when I was a kid, left school at 16, went back at 19, and then spent far too much time nerding out at college, which I loved.

6.  I’ve been riding horses since I was four years old.  At four, I rode my Shetland pony Froggy through the house to annoy my mother.  (I believe I succeeded.)

7.  I sold my first article to a magazine when I was about 13 years old.  It was a light humorous piece about the challenges involved in buying a horse and was printed in a national horse magazine.  (I always wanted to be a writer when I grew up…or possibly a veterinarian…or maybe a secret agent.  I never thought it would end up being tech books that I wrote.)  I have also written one complete bad novel, and the beginnings of several others.  (Nanowrimo, I’m looking at you.)

OK, now here are my tag-ees.

Tag-wise, here are the rules:

  • Link your original tagger(s), and list these rules on your blog.
  • Share seven facts about yourself in the post—some random, some weird.
  • Tag seven people at the end of your post by leaving their names and the links to their blogs.
  • Let them know they’ve been tagged by leaving a comment on their blogs and/or Twitter.

A Year at Mozilla

This week marks one year I have been at Mozilla.  I’ve always found milestones a good time for reflection, so I tend to think back around these times.

Since I started at Mozilla, I’ve been lucky enough to work on some great projects, including:

- Developing the AMO (http://addons.mozilla.org) API, used by Firefox 3 for the Addons Manager

- Scaling SUMO (http://support.mozilla.com) in preparation for Download Day

- Leading development for SUMO

- Helping plan the PHP5 migration for our web properties, and migrating AMO

- Working with Chris Pollett on full text search for AMO

- Working with Jacob Potter, one of our awesome interns this summer, on Melanion, our loadtesting webapp

- Working with Legal on an upcoming project

- Designing and planning a Single Sign On solution for all of the Mozilla web properties.

There’s been a lot of travel including to the superb Firefox Summit at Whistler, which was one of the highlights of my year.

I’ve also been pretty slack about blogging over the last year, I note, because some of these things really deserve their own entries.

The Mozilla firehose takes a while to absorb, but finally it dawns on you that this place is really really different from other companies, and in a very good way.  John Lilly was calling it “chaord” which is an excellent description - pushing control and responsibility out to the edges.  In some ways it reminds me of academia, with regard to both the autonomy we have and the rigor in the way we do things, in other ways the organic anarchy of many other Open Source projects.

I’m also really lucky, and feel privileged, to work with such a good group of people, both in my own team and in the whole of the organization.

On a more personal note, I’m a much happier person now than I was when I started this job.  I don’t think I’ll ever be the same person who came to the USA for three months three years ago, but I guess time changes everyone.  (Even this year hasn’t been straightforward or quiet on a personal level, but it’s been easier.)

Here’s to many more years of good work with the good people at Mozilla.

A rare and special day

Yesterday was one of the best days I have ever had.

We rented a car and drove from Portland through twisty mountain roads (all alike) to Nehalem State Park on the Oregon Coast.  Luke and I were guided on our adventure by Wendy, her daughter Blake, and helper Tracy from Northwest Equine Outfitters.  (Wendy started running trail rides because she is a quarter horse breeder, and the horses needed to help with the bills.)

After mounting up we rode through the dunes and scrub of the state park, past deer, and giant gulls.  A mist of rain freshened our faces as we cantered over the sand and down to the beach.  What I thought at first was a pile of rocks turned out to be the local seal colony, some dozing idly, others keeping a watchful eye on us, with tails raised.

We dismounted at the beach and left the horses.  Wendy called for a ride and a tiny boat, with pirate flag astern, came speeding across the estuary to pick us up.  We sailed past the seals - a bit closer this time, so we could see their black little eyes and long whiskers - and then crossed the water to the Jetty, where we bought freshly caught clams, crabs, and oysters.  The fishermen steamed the clams and crabs for us, while we took our oysters to the campfire and cooked them over the open flames. The oysters were the biggest I have ever seen - the meat in each one the size of two hands.  Wendy basted them in a mix of butter, garlic, chili powder, hot sauce, and lime juice.  We sat on giant seats carved out of thousand year old trees around the camp fire, patted the dogs, ate oysters til our fingers dripped with juice, and heard stories of great courage and being lost at sea.

Eventually we sailed back over the water and were reunited with our horses.  We rode down the beach for a couple more hours, returning at last, exhausted and sandy, to the corrals.

Write Beautiful Code at OSCON

I gave my talk yesterday at OSCON 2008, and here are the slides.

It’s interesting - I think every time I have given this talk I focus on a slightly different aspect. Yesterday it was the importance of decoupling parts of your application architecture as much as possible. This is better for security reasons (allows paranoid coding practices), for scaling (allows you to switch out and/or scale components independently and quickly), and for maintainability.

OSCON is good as usual - if you’re here be sure to join Mozilla at Beerforge tonight, and come say hi.

Why Open Source rocks

The interview I did with Bruce Byfield at OpenWeb Vancouver has been posted on linux.com.  In it, I talk about why Free and Open Source Software makes for better programmers, how to make developers happy, and explain why all the passionate people at Mozilla make it a cool place to live.

Foxes

We have a litter of fox kits in the back field at our house.  Today we managed to catch them on film.  Please enjoy our very own foxkehs.  :)

Foxkeh?

Edited to add: Some people apparently don’t know Foxkeh…for comparison:
Foxkeh,  (C) 2006 Mozilla Japan

You can view the whole set here:
http://flickr.com/photos/lauraxthomson/sets/72157605003262452/