Archive for the ‘Mozilla’ Category.

Socorro: Mozilla’s Crash Reporting System

(Cross-posted from the Mozilla WebDev blog.)

Recently, we’ve been working on planning out the future of Socorro.  If you’re not familiar with it, Socorro is Mozilla’s crash reporting system.

You may have noticed that Firefox has become a lot less crashy recently - we’ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers to find, diagnose, and fix the most common crashes, so crash reporting is critical to these improvements.

We receive on our peak day each week 2.5 million crash reports, and process 15% of those, for a total of 50 GB.  In total, we receive around 320Gb each day!  Right now we are handicapped by the limitations of our file system storage (NFS) and our database’s ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being written to HBase.  Soon this will become our main data storage, and we’ll be able to do a lot more interesting things with the data.  We’ll also be able to process 100% of crashes.  We want to do this because the long tail of crashes is increasingly interesting, and we may be able to get insights from the data that were not previously possible.

I’ll start by taking a look at how things have worked to date.

History of Crash Reporting

Current Socorro Architecture

The data flows as follows:

  • When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla’s end, this is where Socorro comes into play.
  • Crashes are submitted to the collector, which writes them to storage.
  • The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)
  • Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.
  • Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at http://crash-stats.mozilla.com

Implementation Details

  • The collector, processor, monitor and cron jobs are all written in Python.
  • Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.
  • The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.

Roadmap

Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.

  • 1.6 - 1.6.3 (in production)

    The current production version is 1.6.3, which was released last Wednesday.  We don’t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.

    When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.

    You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that’s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.

    If you are viewing one half of a hang pair for a pre-release or beta product, you’ll find a link to the other half at the top right of the report.

    You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.

  • 1.7 (Q2)

    We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.

  • 1.8 (Q2)

    In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.

    New Socorro Architecture

    With this release, the 15% throttling of Firefox release channel crashes goes away entirely.

  • 2.0 (Q3 2010)

    You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.

    Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven’t scheduled you for an interview and you think we ought to talk to you, please let us know.

Features under consideration

  • Full text search of crashes
  • Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.
    Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?
  • The ability to write and run your own Map/Reduce jobs (training will be provided)
  • Detection of “explosive crashes” that appear quickly
  • Viewing crashes by “build time” instead of clock time
  • Classification of crashes by component

This is a big list, obviously. We need your feedback - what should we work on first?

One thing that we’ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!

How to get involved

As always, we welcome feedback and input on our plans.

You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.

In addition, we always welcome contributions.  You can find our code repository at
http://code.google.com/p/socorro/

We hold project meetings on a Wednesday afternoon - details and agendas are here
https://wiki.mozilla.org/Breakpad/Status_Meetings

A Year at Mozilla

This week marks one year I have been at Mozilla.  I’ve always found milestones a good time for reflection, so I tend to think back around these times.

Since I started at Mozilla, I’ve been lucky enough to work on some great projects, including:

- Developing the AMO (http://addons.mozilla.org) API, used by Firefox 3 for the Addons Manager

- Scaling SUMO (http://support.mozilla.com) in preparation for Download Day

- Leading development for SUMO

- Helping plan the PHP5 migration for our web properties, and migrating AMO

- Working with Chris Pollett on full text search for AMO

- Working with Jacob Potter, one of our awesome interns this summer, on Melanion, our loadtesting webapp

- Working with Legal on an upcoming project

- Designing and planning a Single Sign On solution for all of the Mozilla web properties.

There’s been a lot of travel including to the superb Firefox Summit at Whistler, which was one of the highlights of my year.

I’ve also been pretty slack about blogging over the last year, I note, because some of these things really deserve their own entries.

The Mozilla firehose takes a while to absorb, but finally it dawns on you that this place is really really different from other companies, and in a very good way.  John Lilly was calling it “chaord” which is an excellent description - pushing control and responsibility out to the edges.  In some ways it reminds me of academia, with regard to both the autonomy we have and the rigor in the way we do things, in other ways the organic anarchy of many other Open Source projects.

I’m also really lucky, and feel privileged, to work with such a good group of people, both in my own team and in the whole of the organization.

On a more personal note, I’m a much happier person now than I was when I started this job.  I don’t think I’ll ever be the same person who came to the USA for three months three years ago, but I guess time changes everyone.  (Even this year hasn’t been straightforward or quiet on a personal level, but it’s been easier.)

Here’s to many more years of good work with the good people at Mozilla.

Why Open Source rocks

The interview I did with Bruce Byfield at OpenWeb Vancouver has been posted on linux.com.  In it, I talk about why Free and Open Source Software makes for better programmers, how to make developers happy, and explain why all the passionate people at Mozilla make it a cool place to live.

Firefox 3 Beta 3 Add-ons Manager and Add-ons API

Yesterday the beta 3 of Firefox 3 was released to the world.  This beta contains the new Add-ons Manager, and people seem to be liking it so far - ArsTechnica says

One of the most promising and impressive new features in beta 3 is an
integrated add-on installer system that allows users to search for and
install add-ons from addons.mozilla.org directly through the Add-ons
Manager user interface.

The new Add-ons Manager is the result of collaboration between a bunch of smart Mozilla people - Madhava Enros and Dave Townsend to name two - and a small contribution from yours truly. 

The Add-ons Manager pulls data about Recommended addons and search results from the main addons.mozilla.org (AMO) website via the AMO API, which is my project.    When you ask for a recommendation, the Add-ons manager pulls a RESTian URL like

https://services.addons.mozilla.org/en-US/firefox/api/list/recommended/all/

checks for addons that you don’t yet have installed from that list, and displays details of the remaining addons.

The API will be (is) available to the community as well, and promoted once testing is complete.  If you’d like to experiment with the API then draft documentation is available at
http://wiki.mozilla.org/Update:Remora_API_Docs
(This will move to the Mozilla Developer Center once it’s more fleshed out.)  Please file any bugs you find.

I’m still working on tweaks and bug fixes: I’ve already fixed a bunch of character encoding issues in different languages, and applied some performance tweaks. (Some still to go into production.)  Right now, I’m working on speeding up search.  Search is slow on the whole of AMO, and later this year I plan to implement a full text search.  Right now it’s just tweaking - it’s slow because when you search all the possible translations are searched (think many left joins), and the plan is to rejig the database to only search your local translation plus English (since many add-ons are only available in English, and we wouldn’t want you to miss out).

Anyway, it’s been great fun working on this project so far, and it’s incredibly rewarding to think that something I wrote is incorporated into my favorite browser. 

Frameworks, Addons, Firefox, busy busy busy.

I’m about to leave for Orlando where I will speak at CakeFest One tomorrow on the subject of building the addons.mozilla.org API using CakePHP.  The whole of the addons.mozilla.org website is built with Cake, and we believe it to be the biggest installation (in terms of traffic) in the world.  I’ll post slides after the presentation, and a bit more information about the numbers and so on.

Building the API has consumed my thoughts for the last few months.  It’s used by the new Addons Manager in Firefox 3, which will be in beta 3.  (You can read Madhava Enros’ blog entry on the subject for a preview).  After beta 3 is out, I plan on blogging more about the API details.  I’m still ironing out bugs and doing some peformance tuning.

In addition to my involvement with Cake these days, I have recently been associated with two new framework books.  I acted as a tech reviewer for Mike Naberezny and Derek DeVries’ "Rails for PHP Developers" (Pragmatic, 2008) and wrote the foreword for Cal Evans’ "Guide to Programming with Zend Framework" (php|architect, 2008).  These books are now available, so please enjoy the fruits of the authors’ labor.

I can’t help but find it amusing that something I’m (in)famous for not being a fan of has dominated my professional life for the last six or so months.  I’ll have to write more about my thoughts on these three frameworks soon…but right now I’ve got too much work to do :) and a plane to catch, besides.