Archive for the ‘Mozilla’ Category.

DXR: code search and static analysis

I asked Erik Rose from my team to blog about his work on DXR (docs), the code search and static analysis tool for the Firefox codebase.  He did so on the Mozilla Webdev blog, so it would show up on Planet Mozilla. Today, it was pointed out to me that the Webdev blog is not on Planet.

It’s a great post, summarizing all the things he’s done in the last few months.  Go read the article: DXR digests the Firefox codebase.

Webtools in 2012: Part 2

A couple of weeks ago I put up a blog post about what we did on Socorro in 2012.  I promised another blog post about all the non-Socorro things we did.  I have probably missed some things, but here’s the list:

Elmo

Elmo is a localization management dashboard.  We worked this into a Playdoh app, completed a redesign, built a new homepage, deployed it on new infrastructure, moved it to a new domain, added metrics and launched the app!

Bouncer

Bouncer is the download redirector and is one of the oldest webapps at Mozilla. In 2012 we revived the project in order to support the stub installer.  We worked with IT to build out new dev, stage and prod clusters.  We added support for the redirects that stub installer needed, and made Bouncer SSL aware.  We also fixed a number of other issues.

Air Mozilla

We built and launched the brand-new Air Mozilla webapp, including support for Persona, secure/private streams, integrated event scheduling, and a bunch of other exciting features.

Perfomatic

We worked with A-team to update graph server into Datazilla to support changes to make Talos more statistically reliable.

DXR

DXR is a code search tool based on static analysis of the code.  We ran a usability study and built mockups in preparation for the work we’ve been doing this year (new UI, MXR parity).

Etherpad / Etherpad Lite

We deployed Etherpad with Persona support, and added Persona and Teampad support to Etherpad Lite (staged on the PaaS so I won’t link it here).  We are working on security review of EL prior to deployment, and also on getting our changes upstreamed.

Whistlepig

We developed the UI for a new IT maintenance and outage dashboard. Bonus: This looks like it will be part of the NOC we are going to build out in 2013.

Tools-as-a-service

We developed a product plan for building out a webtools-as-a-service offering for Marketplace.

MozLDAP

We built an API that wraps LDAP, so if you want to write a Mozilla webapp that uses LDAP for auth you can use this library and avoid having to build your own LDAP code.

Balrog (AUS)

We built an admin UI for the new Automated Update Service for Release Engineering.

Plugincheck

We developed a new UI for plugincheck. This is about to launch.

Dragnet

We developed code for a new crowdsourced DLL directory, based on the DLL information that we have in Socorro.  This is code complete and in a pre-launch state.

Mediawiki-Bugzilla plugin

We took over development and launched this plugin after Legneato left. It provides integration between Bugzilla and Mediawiki.

Privacy Hub

We developed a new set of pages (UI and code development) to hold Mozilla’s Privacy policies.  These will form part of mozilla.org.

Gaia

We worked on the UX for Gaia, the user-facing layer of FirefoxOS.

Playdoh

We contributed many patches to Playdoh, the Mozilla version of Django.

Verbatim

We added support for contributor acknowledgments which was accepted upstream in Pootle.

PTO

We built out a new PTO app for reporting vacation. This was completed but did not launch as a different approach is being pursued.

Sheriffs

We built out a new app for co-ordinating the Sheriffs calendar. This was completed but did not launch due to hiring a perma-sheriff (probably a better solution than a webapp).

Bramble/Briar-patch

We prototyped a monitoring and capacity planning dashboard for the build farm.  This project was later put on hold and did not launch.

Team growth and development

During the year, we welcomed new team members Selena Deckelmann and Erik Rose, and intern Tim Mickel.  We participated in several Mozilla workweeks, including a Stability themed work week with Engineering, a team-only workweek at DjangoCon, and a Webdev workweek.  We gave talks at several conferences and participated in HackerSchool.

We got better at working with Ops, QA, and RelEng and built trust and relationships with those groups.

We automated a bunch of processes, perhaps most notably building on pull requests with Leeroy (awesome!).

finally:

My New Year’s Resolution for 2012 was, “Do more. Go Faster.”  Mission accomplished.

If I could change anything it would be avoiding the rabbithole of projects that were later killed - it’s a waste of team effort.  We had a small handful of these.

Overall, it was an awesome, invigorating, and exhausting year. I hope we can do even more and cooler things in 2013.

One point to note is that we are a broadly distributed and largely remote team, but we work well together and ship a lot of stuff.  We are currently spread across Mountain View, northern California, Oregon (multiple locations), Maryland (multiple locations), France, and South Africa.

My thanks to the Webtools team: Adrian Gaudebert, Brandon Savage, Chris Lonnen, Erik Rose, K Lars Lohn, Peter Bengtsson, Rob Helmer, Schalk Neethling, and Selena Deckelmann; and interns Tim Mickel and Tony Young.  You are all awesome.

Dumps, disks, and disasters: a detective story

Not long ago, in a datacenter not far away…this is a story about stuff going wrong and how we fixed it.

Prologue

My team works on Socorro, the Firefox crash reporting system, among other things.

When Firefox crashes, it submits two files to us via HTTP POST. One is JSON metadata, and one is a minidump. This is similar in nature to a core dump. It’s a binary file, median size between 150 and 200 kB.

When we have a plugin problem (typically with Flash), we get two of these crash reports: one from the browser process, and one from the plugin container.  Until recently it was challenging to reunite the two halves of the crash information.  Benjamin Smedberg made a change to the way plugin crashes are reported.  We now get a single JSON metadata file, with both minidumps, the one from the browser, and the one from the plugin container.  We may at some point get another 1-2 dumps as part of the same crash report.

We needed to make a number of code changes to Socorro to support this change in our data format.  From here on in, I shall refer to this architectural change as “multidump support”, or just “multidump”.

Crashes arrive via our collectors.  This is a set of boxes that run two processes:
1. Collector: this is Python (web.py) running in a mod_wsgi process via Apache.  Collector receives crashes via POST, and writes them to local filesystem storage.

2. Crash mover: This is a Python daemon that picks up crashes from the filesystem and writes them to HBase.

You may be saying, “Wow, local disk? That is the worst excuse for a queue I’ve ever seen.” You would be right.  The collector uses pluggable storage, so it can write wherever you want (from Postgres, HBase, filesystem).  We have previously written crashes to NFS, and more recently and less successfully directly to HBase.  That turned out to be a Bad Idea ™, so about two years ago I suggested we write them to local disk “until we can implement a proper queue”.  Local storage has largely turned out to be “good enough”, which is why it has persisted for so long.

Adding multidump support changed the filesystem code, among other things.

Act I: An Unexpected Journey

1/10/2013
We had landed multidump support on our master and stage branch, but engineers and QA agreed that we were not quite comfortable enough with it to ship it.  Although we had planned to ship it this day, we didn’t, but we had some other stuff we needed to ship.  Instead of what we usually do (in git, push master to stage, which is our release branch), we stashed stage changes between the last release and now, and then cherry picked the stuff we needed to ship.

What we didn’t realize was that we accidentally left multidump in the stage branch, so when we pushed, we pushed multidump support.  It ran for several hours in production seemingly without problems.  We did not apply the PostgreSQL schema migration, but we had previously changed the HBase schema to support this, so it didn’t cause any problems, but was not end-user visible.  When we realized the error, we rolled back, rebuilt, and pushed the intended changes.  This happened within a couple of hours.  (The rollback/rebuild/repush took a minute or two.)

1/17/2013
We intentionally pushed multidump support.  It passed QA, and everything seemed to be going swimmingly.

1/22/2013
A Socorro user (Kairo) noticed that our crash volume had been lower than average for the last couple of days.

Investigation showed that many, many crashes were backed up in filesystem storage, and that HBase writes were giving frequent errors, meaning that the crashmovers were having trouble keeping up.

We decided to take one collector box at a time out of the pool, to allow it to catch up.  We also noticed at this time that all the collectors were backed up except collector04, which was keeping up.  This was a massive red herring as it later turned out.  We ran around checking the config and build and netflows on collector04 were the same as on the other collectors.  While we watched, collector04 gradually began backing up, and then was in the same boat as the others.

Based on previous experiences, many bad words were said about Thrift at this point.  (If you don’t know Thrift, it’s a mechanism we use for talking to HBase. We use it because our code is in Python and not a JVM language, so we use Thrift as middleman.)   But this was instinct, not empirical evidence, and therefore not useful for problem solving.

To actually diagnose the problem, we first tried strace-ing the crashmover process, and then decided to push an instrumented build to a single box.  By “instrumented” I mean “it logs a lot”.  As soon as we had the instrumentation in place, syslog began to tell a story.  Each crash move was taking 4-5 seconds to complete.  Our normal throughput on a single collector topped out at around 2800-3000 crashes/minute, so something was horribly wrong.

As it turned out the slow part was actually *deleting* the crashes from disk.  That was consuming almost all of the 4-5 seconds.

While looking at the crashes on disk, trying to discern a pattern, we made an interesting discovery.  Our filesystem code uses radix storage: files are distributed among directories on a YY/MM/DD/HH/MM/ basis.  (There are also symlinks to access the crashes by the first digit of their hex OOID, or crash ID.)  We discovered that instead of distributing crashes like this, all the crashes on each collector were in a directory named [YY]/[MM]/[DD]/00/00.  Given the backlog, that meant that, on the worst collector, we had 750,000 crashes in a single directory, on ext4. What could possibly go wrong?

At this point we formed the hypothesis that deletes were taking so long because of the sheer number of files in a directory.  (If there’s any kind of stat in the code - and strace showed there was - then this would perform poorly.)

We moved the crashes manually out of the way, as a test.  This sped things up quite a bit.

We also noticed at this point that the 00/00 crashes had backed up on several days.  We had some orphaned crashes on disk (a known bug, when multiple retries fail), and this was the pattern.
01/10/00/00 - a moderate number of crashes
01/17/00/00 - ditto
(same for each succeeding day)
01/22/00/00 - a huge number of crashes

These days correlated to the days we had multidump code running in production.  We had kind of suspected that, but this was proof.

We rolled back a single collector to pre-multidump code, and it immediately resumed running at full speed.  We then rolled back the remainder of the collectors, and took them out of the pool one at a time so they could catch up.

Somewhere during our investigation (my notes don’t show when) the intermittent failures from HBase had stopped.

By Saturday 1/26, we had caught up on the backlog.  We had also by this time, discovered the code bug that wrote all files into a single directory, and patched it.  (The filesystem code no long had access to the time, so all times were 00/00.)

We thought we were out of the woods, and scheduled a postmortem for 1/31.  However, it wasn’t going to be that easy.

Act II: All this has happened before, and will happen again.

1/28/2013
We ran backfill for our aggregate processing, in order to recalculate totals with the additional processed crashes included.

Our working hypothesis at this stage was as follows.  An unknown event involving HBase connection outages (specifically on writes) had caused crashes to begin backing up, and then having a large number of crashes in a single directory made deletion slow.  We still wanted to know what had caused the HBase issue, but there were two factors that we knew about.  First, at the time of the problem, we had an outage on a single Region Server.  This shouldn’t cause a problem, but the timing was suspicious.  Secondly, we saw an increased number of errors from Thrift.  This has happened periodically and is short-term solved by restarting Thrift.  We believe it is partially caused by our code handling Thrift connections in a suboptimal way, something that is in the process of being solved by our intern.

Also on 1/28 we pushed a new build that incorporated the fix for the directory naming problem.  (see https://github.com/mozilla/socorro/commit/9a376d8c1b2c9bf40b3b612661a971a311a9738c)

1/31
A big day.  We had two things planned for this day: first, a postmortem for the multidump issue, and second, a PostgreSQL failover from our primary to secondary master so we could replace the disks with bigger ones.

Murphy, the god of outages, intermittent errors, and ironic timing, did not smile fondly upon us this day.

Crashes began backing up on collectors once again (see https://bugzilla.mozilla.org/show_bug.cgi?id=836845).  We saw no HBase connection errors at this time, and hence realized at this point that we must have missed something.  We rolled back to a pre-multidump build on collectors, and they immediately began catching up.  We held off running backfill of aggregates at this time, because we wanted to go ahead with the failover.  Disk was getting desperately short and we had already had to delay the failover once due to external factors.

We postponed the postmortem, because clearly we didn’t have a handle on the root cause(s) at this time.

We then discovered the cause.  Multidump code was using remove() instead of the previously used quickDelete(), which was used to replace remove() a number of years ago because it was so slow.  (a href=’https://bugzilla.mozilla.org/show_bug.cgi?id=836986)

We proceeded with the planned failover from master01 to master02, and replaced the disks in master01.  Our plan was to maintain master02 as primary, with master01 replicating from master02.  The failover went well, but the new disks for master01 turned out to be faulty, post-installation.  We were now in a position where we no longer had a hot standby.  Our disk vendor did not meet their SLA for replacement.

2/1
We ran backfill of aggregate reports, and from an end-user perspective everything was back to normal.

2/2
Replaced disks on master01 (again).  These too had some errors but we managed to solve that.

Later, we pushed a new build that solved the quickDelete() issue. We were officially out of the woods.

Epilogue

Things that went well:

  • The team, consisting of engineers, WebOps, and DCOps worked extremely well and constructively together.
  • As a result of looking closely at our filesystem/HBase interactions, we tuned disk performance and ordered some SSDs which have effectively doubled performance since installation.  Thrift appears to be the next bottleneck in the system.

Things we could have done better:

  • Release management: we broke our RM process and that led us to accidentally ship the code prematurely.
  • Not shipped broken code, you know, the usual. Although I do have to say this was more subtly broken than average.  The preventative measures here would have been better in-code documentation in the old code (”Using quickDelete here instead of remove because remove performs badly.”)  We did go through code review, unit and integration testing, and manual QA, as per usual, but given this code only performed poorly once other parts of the system showed degraded performance, this was hard to catch.
  • Relying on end-user observation to discover how the system was broken.  Monitoring can solve this.

Things we will change:

  • Improvements to monitoring.  We will now monitor the number of backed up crashes. It’s not a root cause monitor but an indicator of trouble somewhere in the system.  We have a few others of these, and they are good catch-alls for things we haven’t thought to monitor yet.  We are also working on better monitoring of Thrift errors using thresholds.  Right now we consider a 1% error rate on Thrift connections normal, and support limited retries with exponential fallback. We want to alert if the percentage increases.  We plan on doing more of these thresholded monitors by writing these errors to statsd, and pointing nagios at the rolling aggregates.  This will also work for monitoring degraded performance over time.
  • Improvements to our test and release cycles.  We have seen a few times now an issue where when we get a feature to staging we decide it’s not ready to ship, and this involves git wrangling and introduces a level of human error.  Our intention is to build out a set of “try” environments, that is parallel staging environments that run different branches from the repo.

Confession:
I like disasters.  They always lead to a better process and better code.  Also, when the team works well together, it’s a positive trust-building and team-building experience.  Much better than trust falls in my experience.

A final note
All of the troubleshooting was done with a remote team, working from various locations across North America, communicating via IRC and Vidyo.  It works.

Thanks to everyone involved in troubleshooting this issue: Jake Maul, Selena Deckelmann, Rob Helmer, Chris Lonnen, Dumitru Gherman, and Lars Lohn.

A billion crashes: 2012 in review

In 2012, on the Socorro project, we:

  • Collected more than one billion crashes: more than 150TB of raw data, amounting to around half a petabyte stored. (Not all at once: we now have a data expiration policy.)
  • Shipped 54 releases
  • Resolved 1010 bugs.  Approximately 10% of these were the Django rewrite, and 40% were UI bugs.  Many of the others were backend changes to support the front end work (new API calls, stored procedures, and so on).

New features include:

  • Reports available in build time as well as clock time (graphs, crashes/user, topcrashers)
  • Rapid beta support
  • Multiple dump support for plugin crashes
  • New signature summary report
  • Per OS top crashers
  • Addition of memory usage information, Android hardware information, and other new metadata
  • Timezone support
  • Correlation reports for Java
  • Better admin navigation
  • New crash trends report
  • Added exploitability analysis to processing and exposed this in the UI (for authorized users)
  • Support for ESR channel and products
  • Support for WebRT
  • Support for WebappRTMobile
  • Support for B2G
  • Explosiveness reporting (back end)
  • More than 50 UI tweaks for better UX

Non-user facing work included:

  • Automated most parts of our release process
  • All data access moved into a unified REST API
  • Completely rewrote front end in Python/Django (from old KohanaPHP version with no upgrade path)
  • Implemented a unified configuration management solution
  • Implemented unified cron job management
  • Implemented auto-recovery in connections for resilience
  • Added statsd data collection
  • Implemented fact tables for cleaner data reporting
  • Added rules-based transforms to support greater flexibility in adding new products
  • Refactored back end into pluggable fetch-transform-save architecture
  • Automated data export to stage and development environments
  • Created fakedata sandbox for development for both Mozilla employees and outside contributors
  • Implemented automated reprocessing of elfhack broken crashes
  • Automated tests run on all pull requests
  • Added views and stored procedures for metrics analysts
  • Opened read-only access to PostgreSQL and HBase (via Pig) for internal users

I believe we run one of the biggest software error collection services in the world.  Our code is used by open source users across the internet, games, gaming (casino), music, and audio industries.

As well as working on Socorro, the Webtools team worked on more than 30 other projects, fixed countless bugs, shipped many, many releases, and supported critical organizational goals such as stub installer and Firefox Health Report.  We contributed to Gaia, too.

We could not have done any of this without help from IT (especially WebOps, SRE, and DB Ops) and WebQA.  A huge thank you to those teams. <3

I’ll write a part two of this blog post to talk more about our work on projects other than crash reporting, but I figured collecting a billion crashes deserved its own blog post.

Edited to add: I learned from Corey Shields, our Systems Manager, that we had 100% uptime in Q4.  (He’s still working on statistics for the whole of 2012.)

Ship it: a big week in Webtools

They say multi-tasking is hard. They also say DevOps is hard. Let me tell you about a bunch of engineers who think “hard” means “a nice challenge”.

Last week was an amazing one for the Webtools family. We pushed three releases to three major products. People inside Mozilla don’t always know exactly what types of things the Webtools team works on, so allow me to tell you about them.

1. Bouncer

Bouncer is Mozilla’s download redirector. When you click on one of those nifty “Download Firefox” buttons on mozilla.org, that takes you to Bouncer, which redirects you to the correct CDN or mirror where you can actually get the product that you want. Bouncer is also one of the oldest webapps at Mozilla, having been originally authored by my boss, Mike Morgan, many years ago.

Bouncer hadn’t had code changes in a very long time, and when we realized we needed to change it to support the new stub installer for Firefox, we had to spin up new development and staging environments. In addition, IT built out a new production cluster up to the new standards that have come into use since the last time it was deployed.

The code changes for stub installer are mainly around being intelligent enough to understand that some products, like the stub, can only be served from an SSL CDN or mirror. We don’t want to serve all products over SSL because of cost.

On Wednesday we shipped the new infrastrucure, and the code changes. You can read more about that it in bug 800042.

Thanks to Brandon Savage (Webtools), Anthony Ricaud (Websites), Fred Wenzel (Dev Ecosystem), Jake Maul (WebOps), Chris Turra (WebOps), Corey Shields (Systems), Stephen Donner (Web QA), Matt Brandt (Web QA), Raymond Etnoram (Web QA), and Ben Hearsum (RelEng) for making this possible.

2. Air Mozilla

As you probably know, Air Mozilla is the website that broadcasts Mozilla meetings, brownbags and presentations. On Friday we shipped a brand new version, built on top of Django. The old version was hosted in Wordpress, and was really a simple way to present content. The new version has full calendaring integration, LDAP and BrowserID support, and better ways to find old presentations.

Thanks to Tim Mickel (Webtools Intern), Peter Bengtsson (Webtools), Richard Milewski (Air Mozilla), Zandr Milewski (SpecOps), Dan Maher (WebOps), Chris Turra (WebOps), Brandon Burton (WebOps), Jason Crowe (WebOps), and Corey Shields (Systems).

You can see details of the release in bug 799745.

3. Socorro

We also shipped a regular Wednesday Socorro release. Socorro is the crash reporting service for Mozilla products, including Firefox, Firefox for Mobile (”Fennec”), Firefox OS (”Boot to Gecko”), and Thunderbird.

In this release we shipped five bug fixes and enhancements. This number was a bit lower than usual, as most people are crunching to complete the front end rewrite (more on that in a moment).

You can read more about the release in bug 800140.

Thanks to the whole team for working on this: Adrian Gaudebert, Brandon Savage, Chris Lonnen, Lars Lohn, Peter Bengtsson, Robert Helmer, Schalk Neethling, Selena Deckelmann, and of course Matt Brandt (Web QA) and Brandon Burton (IT).

An aside: Socorro on Django

We are also very close to feature parity with the new Django-based version of the Socorro webapp to the old PHP webapp. We needed to rewrite this code, because the version of the framework used in the old version is four years out of date, and there was no upgrade path for it - newer versions break backwards compatibility. Since we had to rewrite it anyway, we have moved to use the same framework as the majority of other webapps at Mozilla. This allows for easier contributions by other Mozillians. We should reach parity in the next couple of days, and plan to ship the new code in parallel with the old, subject to secreview timing.

finally:

I am incredibly proud of the impact, quality, and sheer quantity of our work over the last weeks. These projects will enable many good things throughout Mozilla. Good work, people, stand tall.

Webtools is a small team, and we could not do what we do with incredible support from IT and QA. I like to think of this as the Webtools family: we are all one team; we all work together to get the job done come hell, high water, or zombies in the data center.

Just remember, there’s a reason the Webtools mascot is Ship It Squirrel.

Rapid releases: one webdev’s perspective

People still seem to be very confused about why Mozilla has moved to the new rapid release system. I thought I’d try and explain it from my perspective. I should point out that I am not any kind of official spokesperson, and should not be quoted as such. The following is just my own personal opinion.

Imagine, now, you work on a team of web developers, and you only get to push new code to production once a year, or once every eighteen months. Your team has decided to wait until the chosen twenty new features are finished, and not ship until those are totally done and passed through a long staging and QA period. The other hundreds of bugs/tickets you closed out in that 12-18 months would have to wait too.

Seems totally foreign in these days of continuous deployment, doesn’t it?

When I first heard about rapid releases, back at our December All Hands, I had two thoughts. The first was that this was absolutely the right thing to do. When stuff is done we should give it to users. We shouldn’t make them wait, especially when other browsers don’t make them wait.

The second thought was that this was completely audacious and I didn’t know if we could pull it off. Amazingly, it happened, and now Mozilla releases use the train model and leave the station every six weeks.

So now users get features shortly after they are done (big win), but there’s been a lot of fallout. Some of the fallout has been around internal tools breaking – we just pushed out a total death sprint release of Socorro to alleviate some of this, for example. Most of the fallout, however, has been external. I see three main areas, and I’ll talk about each one in turn.

Version numbers

The first thing is pushback on version numbers. I see lots of things like:
“Why is Mozilla using marketing-driven version numbers now?”
“What are they trying to prove?”
“How will I know which versions my addons are compatible with?”
“How will I write code (JS/HTML/CSS) that works on a moving target?”

Version numbers are on the way to becoming much less visible in Firefox, like they are in webapps, or in Chrome, for that matter. (As I heard a Microsoft person say, “Nobody asks ‘Which version of Facebook are you running?’”) So to answer: it’s not marketing-driven. In fact, I think not having big versions full of those twenty new features has been much, much harder for the Engagement (marketing) team to know how to market. I see a lot of rage around version numbers in the newsgroups and on tech news sites (HN, Slashdot, etc), which tells me that we haven’t done a good job communicating this to users. I believe this is a communication issue rather than because it’s a bad idea: nowhere do you see these criticisms of Chrome, which uses the same method.

(This blog post is, in part, my way of trying to help with this.)

Add-on compatibility

The add-ons team has been working really hard to minimize add-on breakage. In realistic terms, most add-ons will continue to work with each new release, they just need a version bump. The team has a process for bumping the compatible versions of an add-on automatically, which solves this problem for add-ons that are hosted on addons.mozilla.org. Self-hosted add-ons will continue to need manual updating, and this has caused problems for people.

The goal is, as I understand it, for add-on authors to use the Add-on SDK wherever possible, which will have versions that are stable for a long time. (Read the Add-ons blog on the roadmap for more information on this.)

Enterprise versions

The other thing that’s caused a lot of stress for people at large companies is the idea that versions won’t persist for a long time. Large enterprises tend not to upgrade desktop software frequently. (This is the sad reason why so many people are still on IE6.)

There is an Enterprise Working Group working on these problems: we are taking it seriously.

finally:

Overall, getting Firefox features to users faster is a good thing. Some of the fallout issues were understood well in advance and had a mitigation plan: add-on incompatibility for example. Some others we haven’t done a good job with.

I truly believe if we had continued to release Firefox at the rate of a major version every 18 months or so, that we would have been on a road to nowhere. We had to get faster. It’s a somewhat risky strategy, but it’s better to take that risk than quietly fade away.

At the end of the day we have to remember the Mozilla mission: to promote openness and innovation on the web. It’s hard to promote innovation within the browser if you ship orders of magnitude more slowly than your competitors.

Notice that I mention the mission: people often don’t know or tend to forget that Mozilla isn’t in this for the money. We’re a non-profit. We’re in it for the good of the web, and we want to do whatever’s needed to make the web a better and more open place. We do these things because we’re passionate about the web.

I’ve never worked anywhere else where the mission and the passion were so prominent. We may sometimes do things you don’t agree with, or communicate them in a less-than-optimal way, but I really want people who read this to understand that our intentions are positive and our goal is the same as it’s always been.

Being Open

I was recently privileged to be invited to come and give a talk at AOL, on the work we do with Socorro, how Mozilla works, and what it means to be open.

The audience for the talk was a new group at AOL called the Technology Leadership Group. It consists of exceptional technical people - engineers and operational staff - from all parts of the organization, who have come together to form a group of thought leaders.

One of the first items on their agenda is, as Erynn Petersen, who looks after Developer and Open Source Evangelism, puts it: “how we scale, how we use data in new and interesting ways, and what it means to treat all of our projects as open source projects.” My task was partly to talk about how open source communities work, what the challenges are, and how AOL might go about becoming more open.

It’s amazing how things come full circle.

I think every person in the audience was a user of Open Source, and many of them were already Open Source contributors on a wide range of projects. Some had been around since the days when Netscape was acquired by AOL.

I’ll include the (limited) slides here, but the best part of the session in my opinion was the Q&A. We discussed some really interesting questions, and I’ll feature some of those here. (I want to note that I am paraphrasing/summarizing the questions as I remember them, and am not quoting any individual.)

Q: Some of our software and services are subscription-based. If we give that code away, we lose our competitive advantage - no one will pay for it anymore.

A: There are a bunch of viable business models that revolve around making money out of open source. The Mozilla model is fairly unusual in the space. The most common models are:

  • Selling support, training, or a built and bundled version of the software. This model is used by Red Hat, Canonical, Cloudera, and many others.
  • Dual licensing models. One version of the software is available under an open source license and another version is available with a commercial license for embedding. This is (or has been) the MySQL model.
  • Selling a hosted version of Open Source software as a service. This model is used by Github (git) and Automattic (Wordpress), among others.
  • It’s also perfectly valid to make some of your software open and leave some proprietary. This is the model used by 37signals - they work on Ruby on Rails and sell SaaS such as Backpack and Basecamp.

Another point is that at Mozilla, our openness *is* our competitive advantage. Our users know that we have no secret agenda: we’re not in it for the money, but we’re also not in it to mine or exploit their personal data. We exist to care for the Open Web. There has been a lot of talk lately about this, best summarized by this statement, which you’ll see in blog posts and tweets from Mozillians:

Firefox answers to no-one but you.

Q: How do we get started? There’s a lot of code - how do we get past the cultural barriers of sharing it?

A: It’s easier to start open than to become open after the fact. However, it can be done - if it couldn’t be done Mozilla wouldn’t exist. Our organization was born from the opening of Netscape. A good number of people in the room were at AOL during the Netscape era, too, which must give them a sense of deja vu. I revisited jwz’s blog post about leaving the Mozilla project, back in those days after I drafted this post, and I recommend reading it as it talks about a lot of the issues.

My answer is that there’s a lot to think about here:

  • What code are we going to make open source? Not everything has to be open source, and it doesn’t have to happen all at once. I suggest starting up a site and repository that projects can graduate to as they become ready for sharing. Here at Mozilla basically everything we work
    on is open source as a matter of principle (”open by default”), but someof it is more likely to be reused than other parts. Tools and libraries are a great starting point.
  • How will that code be licensed? This is partly a philosophical question and partly a legal question. Legal will need to examine the licensing and ownership status of existing code. You might want a contributors’ agreement for people to sign too. Licenses vary and the answer to this question is also dependent on the business model you want to use.
  • How will we share things other than the code? This includes bug reports, documentation, and so on.
  • How will the project be governed? If I want to submit a patch, how do I do that? Who decides if, when, and how that patch will be applied? There are various models for this ranging from the benevolent dictator model to the committee voting model.

I would advise starting with a single project and going from there.

Q: How will we build a community and encourage contributions?
A: This is a great question. We’re actually trying to answer this question on Socorro right now. Here’s what we are doing:

  • Set up paths of communication for the community: mailing lists, newsgroups, discussion forums
  • Make sure you have developer documentation as well as end user documentation
  • If the project is hard to install, consider providing a VM with everything already installed. (We plan to do this both for development and for users who have a small amount of data.)
  • Choose some bugs and mark them as “good first bug” in your bug tracking system.
  • Make the patch submission process transparent and documented.

There was a lot more discussion. I really enjoyed talking to such a smart and engaging group, and I wish AOL the very best in their open source initiative.

The future of crash reporting

This post first appeared in the Mozilla Webdev Blog on August 5 2010.

In recent blog posts I’ve talked about our plans for Socorro and our move to HBase.

Today, I’d like to invite community feedback on the draft of our plans for Socorro 2.0. In summary, we have been moving our data into HBase, the Hadoop database. In 1.7 we began exclusively using HBase for crash storage. In 1.8 we will move the processors and minidump_stackwalk to Hadoop.

Here comes the future

In 1.9, we will enable pulling data from HBase for the webapp via a web services layer. This layer is also known as “the pythonic middleware layer”. (Nominations for a catchier name are open. My suggestion of calling it “hoopsnake” was not well received.)

In 2.0 we will expose HBase functionality to the end user. We also have a number of other improvements planned for the 2.x releases, including:

  • Full text search of crashes
  • Faceted search
  • Ability for users to run MapReduce jobs from the webapp
  • Better visibility for explosive and critical crashes
  • Better post-crash user engagement via email

Full details can be found in the draft PRD. If you prefer the visual approach you can read the slides I presented at the Mozilla Summit last month.

Give us feedback!

We welcome all feedback from the community of users - please take a look and let us know what we’re missing. We’re also really interested in feedback about the best order in which to implement the planned features.

You can send your feedback to laura at mozilla dot com - I look forward to reading it.

Moving Socorro to HBase

This post first appeared in the Mozilla Webdev Blog on July 26 2010.

We’ve been incredibly busy over on the Socorro project, and I have been remiss in blogging. Over the next week or so I’ll be catching up on what we’ve been doing in a series of blog posts. If you’re not familiar with Socorro, it is the crash reporting system that catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey. You can see the output of the system at http://crash-stats.mozilla.com. The project’s code is also being used by people outside Mozilla: most recently Vigil Games are using it to catch crashes from Warhammer 40,000: Dark Millenium Online.

Back in June we launched Socorro 1.7, and we’re now approaching the release of 1.8. In this post, I’ll review what each of these features represents on our roadmap.

First, a bit of history on data storage in Socorro. Until recently, when crashes were submitted, the collector placed them into storage in the file system (NFS). Because of capacity constraints, the collector follows a set of throttling rules in its configuration file in order to make a decision about how to disseminate crashes. Most crashes go to deferred storage and are not processed unless specifically requested. However, some crashes are queued into standard storage for processing. Generally this has been all crashes from alpha, beta, release candidate and other “special” versions; all crashes with a user comment; all crashes from low volume products such as Thunderbird and Camino; and a specified percentage of all other crashes. (Recently this has been between ten and fifteen percent.)

The monitor process watched standard storage and assigned jobs to processors. A processor would pick up crashes from standard storage, process them, and write them to two places: our PostgreSQL database, and back into file system storage. We had been using PostgreSQL for serving data to the webapp, and the file system storage for serving up the full processed crash.

For some time prior to 1.7, we’d been storing all crashes in HBase in parallel with writing them into NFS. The main goal of 1.7 was to make HBase our chief storage mechanism. This involved rewriting the collector and processor to write into HBase. The monitor also needed to be rewritten to look in HBase rather than NFS for crashes awaiting processing. Finally, we have a web service that allows users to pull the full crash, and this also needed to pull crashes from HBase rather than NFS.

Not long before code freeze, we decided we should add a configuration option to the processor to continue storing crashes in NFS as a fallback, in case we had any problems with the release. This would allow us to do a staged switchover, putting processed crashes in both places until we were confident that HBase was working as intended.

During the maintenance window for 1.7 we also took the opportunity to upgrade HBase to the latest version. We are now using Cloudera’s CDH2 Hadoop distribution and HBase 0.20.5.

The release went fairly smoothly, and three days later we were able to turn off the NFS fallback.

We’re now in the final throes of 1.8. While we now have crashes stored in HBase, we are still capacity constrained by the number of processors available. In 1.8, the processors and their associated minidump_stackwalk processes will be daemonized and move to run on the Hadoop nodes. This means that we will be able to horizontally scale the number of processors with the size of the data. Right now we are running fifteen Hadoop nodes in production and this is planned to increase over the rest of the year.

Some of the associated changes in 1.8 are also really exciting. We are introducing a new component to the system, called the registrar. This process will track heartbeats for each of the processors. Also in this version, we have added an introspection API for the processors. The registrar will act as a proxy, allowing us to request status and statistical information for each of the processors. We will need to rebuild the status page (visible at http://crash-stats.mozilla.com/status) to use this new API, but we will have much better information about what each processor is doing.

Update: we’re frozen on 1.8 and expect release later this month.

Socorro: Mozilla’s Crash Reporting System

(Cross-posted from the Mozilla WebDev blog.)

Recently, we’ve been working on planning out the future of Socorro.  If you’re not familiar with it, Socorro is Mozilla’s crash reporting system.

You may have noticed that Firefox has become a lot less crashy recently - we’ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers to find, diagnose, and fix the most common crashes, so crash reporting is critical to these improvements.

We receive on our peak day each week 2.5 million crash reports, and process 15% of those, for a total of 50 GB.  In total, we receive around 320Gb each day!  Right now we are handicapped by the limitations of our file system storage (NFS) and our database’s ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being written to HBase.  Soon this will become our main data storage, and we’ll be able to do a lot more interesting things with the data.  We’ll also be able to process 100% of crashes.  We want to do this because the long tail of crashes is increasingly interesting, and we may be able to get insights from the data that were not previously possible.

I’ll start by taking a look at how things have worked to date.

History of Crash Reporting

Current Socorro Architecture

The data flows as follows:

  • When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla’s end, this is where Socorro comes into play.
  • Crashes are submitted to the collector, which writes them to storage.
  • The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)
  • Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.
  • Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at http://crash-stats.mozilla.com

Implementation Details

  • The collector, processor, monitor and cron jobs are all written in Python.
  • Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.
  • The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.

Roadmap

Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.

  • 1.6 - 1.6.3 (in production)

    The current production version is 1.6.3, which was released last Wednesday.  We don’t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.

    When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.

    You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that’s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.

    If you are viewing one half of a hang pair for a pre-release or beta product, you’ll find a link to the other half at the top right of the report.

    You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.

  • 1.7 (Q2)

    We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.

  • 1.8 (Q2)

    In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.

    New Socorro Architecture

    With this release, the 15% throttling of Firefox release channel crashes goes away entirely.

  • 2.0 (Q3 2010)

    You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.

    Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven’t scheduled you for an interview and you think we ought to talk to you, please let us know.

Features under consideration

  • Full text search of crashes
  • Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.
    Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?
  • The ability to write and run your own Map/Reduce jobs (training will be provided)
  • Detection of “explosive crashes” that appear quickly
  • Viewing crashes by “build time” instead of clock time
  • Classification of crashes by component

This is a big list, obviously. We need your feedback - what should we work on first?

One thing that we’ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!

How to get involved

As always, we welcome feedback and input on our plans.

You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.

In addition, we always welcome contributions.  You can find our code repository at
http://code.google.com/p/socorro/

We hold project meetings on a Wednesday afternoon - details and agendas are here
https://wiki.mozilla.org/Breakpad/Status_Meetings