A billion crashes: 2012 in review

In 2012, on the Socorro project, we:

  • Collected more than one billion crashes: more than 150TB of raw data, amounting to around half a petabyte stored. (Not all at once: we now have a data expiration policy.)
  • Shipped 54 releases
  • Resolved 1010 bugs.  Approximately 10% of these were the Django rewrite, and 40% were UI bugs.  Many of the others were backend changes to support the front end work (new API calls, stored procedures, and so on).

New features include:

  • Reports available in build time as well as clock time (graphs, crashes/user, topcrashers)
  • Rapid beta support
  • Multiple dump support for plugin crashes
  • New signature summary report
  • Per OS top crashers
  • Addition of memory usage information, Android hardware information, and other new metadata
  • Timezone support
  • Correlation reports for Java
  • Better admin navigation
  • New crash trends report
  • Added exploitability analysis to processing and exposed this in the UI (for authorized users)
  • Support for ESR channel and products
  • Support for WebRT
  • Support for WebappRTMobile
  • Support for B2G
  • Explosiveness reporting (back end)
  • More than 50 UI tweaks for better UX

Non-user facing work included:

  • Automated most parts of our release process
  • All data access moved into a unified REST API
  • Completely rewrote front end in Python/Django (from old KohanaPHP version with no upgrade path)
  • Implemented a unified configuration management solution
  • Implemented unified cron job management
  • Implemented auto-recovery in connections for resilience
  • Added statsd data collection
  • Implemented fact tables for cleaner data reporting
  • Added rules-based transforms to support greater flexibility in adding new products
  • Refactored back end into pluggable fetch-transform-save architecture
  • Automated data export to stage and development environments
  • Created fakedata sandbox for development for both Mozilla employees and outside contributors
  • Implemented automated reprocessing of elfhack broken crashes
  • Automated tests run on all pull requests
  • Added views and stored procedures for metrics analysts
  • Opened read-only access to PostgreSQL and HBase (via Pig) for internal users

I believe we run one of the biggest software error collection services in the world.  Our code is used by open source users across the internet, games, gaming (casino), music, and audio industries.

As well as working on Socorro, the Webtools team worked on more than 30 other projects, fixed countless bugs, shipped many, many releases, and supported critical organizational goals such as stub installer and Firefox Health Report.  We contributed to Gaia, too.

We could not have done any of this without help from IT (especially WebOps, SRE, and DB Ops) and WebQA.  A huge thank you to those teams. <3

I’ll write a part two of this blog post to talk more about our work on projects other than crash reporting, but I figured collecting a billion crashes deserved its own blog post.

Edited to add: I learned from Corey Shields, our Systems Manager, that we had 100% uptime in Q4.  (He’s still working on statistics for the whole of 2012.)

2 Comments

  1. Paul C:

    Very nice, I enjoyed reading, thanks for writing!

    I think it’s actually hugely educational to see what a project goes through as it grows and becomes increasingly stable and streamlined. It’s interesting to think about the balance between architecture, UI and other changes as a product evolves. We need a software evolution Darwinian-like theory.

    Also awesome that the project is used by casinos! Haha.

  2. tech ramblings » Blog Archive » Webtools in 2012: Part 2:

    […] couple of weeks ago I put up a blog post about what we did on Socorro in 2012.  I promised another blog post about all the non-Socorro things we did.  I have probably missed […]

Leave a comment