The dark craft of engineering management

VM Brasseur and I had a chat about what it means to be an engineering manager, as a follow up to her excellent talk on the subject at Open Source Bridge.  I promised her I would put my (lengthy, rambling) thoughts into an essay of sorts, so here it is.

“Management is the art of getting things done through people.”
This is a nice pithy quote, but I prefer my version:

“Management is the craft of enabling people to get things done.”
Yes, it’s less grammatical.  Sue me.

Why is management a craft?

It’s a craft for the same reasons engineering is a craft.  You can read all the books you want on something but crafts are learned by getting your hands in it and getting them dirty.  Crafts have rough edges, and shortcuts, and rules of thumb, and things that are held together with duct tape.  The product of craft is something useful and pleasing.

(Art to me is a good deal purer: more about aesthetics and making a statement than it is about making a thing.  Craft suits my analogy much better.)

Why enabling people to get things done?

Engineers, in general, know their jobs, to a greater or lesser extent.  My job, as an engineering manager, is to make their jobs easier.

What do engineers value?  This is of course going to be a sweeping generalization, but I’m going to resort to quoting Dan Pink: Mastery, autonomy, and purpose.

Mastery

Mastery has all kinds of implications.  As a manager, my job is to enable engineers to achieve and maintain mastery.  This means helping them to be good at and get better at their jobs.  Enabling them to ship stuff they are passionate about.  To learn the skills they need to do that.  To work alongside others who they can teach and learn from.  To have the right tools to do their jobs.

Autonomy

Autonomy is the key to scaling yourself as an engineering manager.  As an engineer, I hate nothing more than being micromanaged.  As an engineering manager, my job is to communicate the goals and where we want to get to, and work with you to determine how we’re going to get there.  Then I’m going to leave you the hell alone to get stuff done.

The two most important things I do as a manager are in this section.
The first is to act as a BS umbrella for my people.  This means going to meetings, fighting my way through the uncertainty, and coming up with clear goals for the team.  I am the wall that stands between bureaucracy and engineers.  This is also the most stressful part of my job.

The second is in 1:1s.  While I talk to my remote, distributed team all day every day in IRC as needed, this is the sacrosanct time each week where we get to talk.  There are three questions that make up the core of the 1:1:

  • How is everything going?  This is an opportunity for any venting, and lets the engineer set the direction of the conversation.
  • What are you going to do next?  Here, as a manager, I can help clarify priorities, and suggest next steps if the person is blocked.
  • What do you need? This can be anything from political wrangling to hardware.  I will do my best to get them what they need.

In Vicky’s talk she talked about getting all your ducks in a row.  In my view, the advantage of empowering your engineers with autonomy is that you get self-organizing ducks.

The key thing to remember with autonomy is this: Hire people you can trust, and then trust them to do their best.

Purpose

This is key to being a good manager, because you’re providing the purpose.  You help engineers work out what the goals should be, prioritize them, clarify requirements, and make sure everybody has a clear thing they are working towards.  Clarity of purpose is a powerful motivator.  Dealing with uncertainty is yet another roadblock you remove from the path of your team.

Why is management fun?  Why should I become a manager?

Don’t become an engineering manager because you want power - that’s the worst possible reason.  A manager is a servant to their team.  Become a manager if you want to serve.  Become a manager if you want to work on many things at once.  Becoming a manager helps you become a fulcrum for the engineering lever, and that’s a remarkably awesome place to be.

The Fifteen Minute Maker’s Schedule

If you haven’t read Paul Graham’s “Maker’s Schedule, Managers Schedule”, I recommend doing that before you read this or it won’t make any sense.

The Maker’s Schedule makes sense to me in a work setting, but how about for side projects, things you’re trying to do after hours?

I started fomenting this blog post a while ago.  A very good engineer I know said something to me which I must admit rubbed me up the wrong way.  He said something along the lines of, “See, you like to write for fun, and I like to code for fun.” Actually, I really like to code for fun too, but it’s much easier to write than code in fifteen minute increments, which is often all I have available to me on any given day.

Let’s be clear about one thing: I don’t think of myself as a consumer.  I barely watch TV, only when my two year old insists.  I can’t tell you the last time I had time to watch a movie, and I haven’t played a non-casual video game since college.  I do read books, but books, too, lend themselves well to being read in fifteen minute increments.

I want to be a producer: someone who makes things.  Unfortunately my life is not compatible with these long chunks of time that Paul Graham talks about.  I think any parent of small children would say the same.  When you’re not at work you are on an interrupt-driven schedule: not controlled by management, but controlled by the whims of the little people who are the center of your universe.

This is how I work:

When I’m doing one of the mindless things that consume some of my non-work time - showering, driving, grocery shopping, cleaning the house, laundry, barn chores - I’m planning.  Whether it’s cranking away on a work problem, planning a blog post or a plot for a novel that I want to write, thinking of what projects to build for our next PHP book, mapping out a conference talk, planning code that I want to work on.  This is brain priming time.  When I get fifteen minutes to myself I can act on those things.

In other words, planning is parallelizable.  Doing is not.  Since I have so little uninterrupted time to *do*, I plan it carefully, and use it as much as I can.

When I get the occasional hour or two - nap time on a weekend (and to hell with the laundry), my husband taking our child out somewhere, or those blessed, perfect hours on a transcontinental flight - I can get so much done it makes my head hurt.  But those are the exceptions, not the norm.  I expect that to be the case until our child is a good deal older.

I had to train myself to do *anything* in fifteen minutes.  It didn’t come naturally, but I heard the advice over and over again, particularly from women writers, some of them New York Times bestsellers.  One has five children and wrote six books last year, so it can be done.  The coding is coming.  Training myself to code in fifteen minute increments has taken a lot longer than training myself to write in the same time.

The trick is to do that planning.  Train your mind to immerse itself in the problem as soon as you get into the zone where your brain is being underutilized.  This kind of immersion thinking has been useful to me for years for problem solving, and I just had to retrain myself to use it for planning.

In summary: don’t despair of Graham’s Maker’s Schedule if you just don’t have those big chunks of time outside of work.  You can still be a maker.  You can still be a creative person.  You just have to practice.  Remember: the things that count are the things we do every day, even if it’s only for fifteen minutes.

Rapid releases: one webdev’s perspective

People still seem to be very confused about why Mozilla has moved to the new rapid release system. I thought I’d try and explain it from my perspective. I should point out that I am not any kind of official spokesperson, and should not be quoted as such. The following is just my own personal opinion.

Imagine, now, you work on a team of web developers, and you only get to push new code to production once a year, or once every eighteen months. Your team has decided to wait until the chosen twenty new features are finished, and not ship until those are totally done and passed through a long staging and QA period. The other hundreds of bugs/tickets you closed out in that 12-18 months would have to wait too.

Seems totally foreign in these days of continuous deployment, doesn’t it?

When I first heard about rapid releases, back at our December All Hands, I had two thoughts. The first was that this was absolutely the right thing to do. When stuff is done we should give it to users. We shouldn’t make them wait, especially when other browsers don’t make them wait.

The second thought was that this was completely audacious and I didn’t know if we could pull it off. Amazingly, it happened, and now Mozilla releases use the train model and leave the station every six weeks.

So now users get features shortly after they are done (big win), but there’s been a lot of fallout. Some of the fallout has been around internal tools breaking – we just pushed out a total death sprint release of Socorro to alleviate some of this, for example. Most of the fallout, however, has been external. I see three main areas, and I’ll talk about each one in turn.

Version numbers

The first thing is pushback on version numbers. I see lots of things like:
“Why is Mozilla using marketing-driven version numbers now?”
“What are they trying to prove?”
“How will I know which versions my addons are compatible with?”
“How will I write code (JS/HTML/CSS) that works on a moving target?”

Version numbers are on the way to becoming much less visible in Firefox, like they are in webapps, or in Chrome, for that matter. (As I heard a Microsoft person say, “Nobody asks ‘Which version of Facebook are you running?’”) So to answer: it’s not marketing-driven. In fact, I think not having big versions full of those twenty new features has been much, much harder for the Engagement (marketing) team to know how to market. I see a lot of rage around version numbers in the newsgroups and on tech news sites (HN, Slashdot, etc), which tells me that we haven’t done a good job communicating this to users. I believe this is a communication issue rather than because it’s a bad idea: nowhere do you see these criticisms of Chrome, which uses the same method.

(This blog post is, in part, my way of trying to help with this.)

Add-on compatibility

The add-ons team has been working really hard to minimize add-on breakage. In realistic terms, most add-ons will continue to work with each new release, they just need a version bump. The team has a process for bumping the compatible versions of an add-on automatically, which solves this problem for add-ons that are hosted on addons.mozilla.org. Self-hosted add-ons will continue to need manual updating, and this has caused problems for people.

The goal is, as I understand it, for add-on authors to use the Add-on SDK wherever possible, which will have versions that are stable for a long time. (Read the Add-ons blog on the roadmap for more information on this.)

Enterprise versions

The other thing that’s caused a lot of stress for people at large companies is the idea that versions won’t persist for a long time. Large enterprises tend not to upgrade desktop software frequently. (This is the sad reason why so many people are still on IE6.)

There is an Enterprise Working Group working on these problems: we are taking it seriously.

finally:

Overall, getting Firefox features to users faster is a good thing. Some of the fallout issues were understood well in advance and had a mitigation plan: add-on incompatibility for example. Some others we haven’t done a good job with.

I truly believe if we had continued to release Firefox at the rate of a major version every 18 months or so, that we would have been on a road to nowhere. We had to get faster. It’s a somewhat risky strategy, but it’s better to take that risk than quietly fade away.

At the end of the day we have to remember the Mozilla mission: to promote openness and innovation on the web. It’s hard to promote innovation within the browser if you ship orders of magnitude more slowly than your competitors.

Notice that I mention the mission: people often don’t know or tend to forget that Mozilla isn’t in this for the money. We’re a non-profit. We’re in it for the good of the web, and we want to do whatever’s needed to make the web a better and more open place. We do these things because we’re passionate about the web.

I’ve never worked anywhere else where the mission and the passion were so prominent. We may sometimes do things you don’t agree with, or communicate them in a less-than-optimal way, but I really want people who read this to understand that our intentions are positive and our goal is the same as it’s always been.

Capability For Continuous Deployment

Continuous deployment is very buzzword-y right now. I have some really strong opinions about deployment (just ask James Socol or Erik Kastner, who have heard me ranting about this). Here’s what I think, in a nutshell:

You should build the capability for continuous deployment even if you never intend to do continuous deployment. The machinery is more important than your deployment velocity.

Let me take a step back and talk about deployment maturity.

Immature deployment

At the immature end, a developer or team works on a system that has no staging environment. Code goes from the development environment straight to production. (I’m not going to even talk about the situation where the development environment is production. I am fully aware that these still exist, from asking questions to that effect of the audience in conference talks.) I’m also assuming, in this era of github, that everybody is using version control.

(I want to point out that it’s the easy availability of services like github that has enabled even tiny, disorganized teams to use version control. VC is ubiquitous now, and that is a huge blessing.)

This sample scenario is very common: work on dev, make a commit, push it out to prod. Usually, no ops team is involved, or even exists. This can work really well in an early stage company or project, especially if you’re pre-launch.

This team likely has no automation, and a variable number of tests (0+). Even if they have tests, they may have no coverage numbers and no continuous integration.

When you hear book authors, conference speakers or tech bloggers talk about the wonders of continuous deployment, this scenario is not what they are describing.

The machinery of continuous deployment

Recently in the Mozilla webdev team, we’ve had a bunch of conversations about CD. When we talked about what was needed to do this, I had a revelation.

Although we were choosing not to do CD on my team, we had in place every requirement that was needed:

  • Continuous integration with build-on-commit
  • Tests with good coverage, and a good feel for the holes in our coverage
  • A staging environment that reflects production – our stage environment is a scale model of production, with the same ratios between boxes
  • Managed configuration
  • Scripted deployment to a large number of machines

I realized then that the machinery for continuous deployment is different from the deployment velocity that you choose for your team. If we need to, we can make a commit and push it out inside of a couple of minutes, without breaking a sweat.

Why we don’t do continuous deployment on Socorro

We choose not to, except in urgent situations, for a few reasons:

  • We like to performance test our stuff, and we haven’t yet automated that
  • We like to have a human QA team test in addition to automated tests
  • We like to version our code and do “proper” releases because it’s open source and other people use our packages
  • A commit to one component of our system is often related to other commits in other components, which make more sense to ship as a bundle

Our process looks like this:

  • The “dev” environment runs trunk, and the “stage” environment runs a release branch.
  • On commit, Jenkins builds packages and deploys them to the appropriate environment.
  • To deploy to prod, we run a script that pulls the specified package from Jenkins and pushes it out to our production environment. We also tag the revision that was packaged for others to use, and for future reference. Note we are pushing the same package to prod that we pushed to stage, and stage reflects production.
  • If we need to change configuration for a release, we change it first in Puppet on staging, and then update production the same way.

We do intend to increase our deployment velocity in the coming months, but for us that’s not about the machinery of deployment, it’s about increasing our delivery velocity.

Delivery velocity is a different problem, which I’m wrestling with right now. We have a small team, and the work we’re trying to do tends not to come in small chunks but big ones, like a new report, or a system-wide change to the way we aggregate data (the thing we’re working on at the moment). It’s not that changes sit in trunk, waiting for a release. It’s more that it takes us a while to get something to a deployable stage. That is, deployment for us is not the blocker to getting features to our users faster.

finally:

It’s the same old theme you’ve seen ten times before on this blog: everybody’s environment is different, and continuous deployment may not be for everybody. On the other hand, the machinery for continuous deployment can be critical to making your life easier . Automating all this stuff certainly helps me sleep at night much better than I used to when we did manual pushes.

(Incidentally, I’d like to thank Rob Helmer and Justin Dow for automating our world: you couldn’t find better engineers to work with.)

All systems suck

I’ve been thinking a lot about this idea lately.  I’ve spent a lot of years as an engineer and consultant fixing other people’s systems that suck, writing my own systems that suck, and working on legacy systems, that, well, suck.

Don’t let anyone fool you.  All systems suck, to a greater or lesser extent.

If it’s an old system, there’s the part of the code that everybody is afraid to work on: the fragile code that is easier to replace than maintain or refactor.  Sometimes this seems hard, or nobody really understands it.  These parts of the code are almost always surrounded by an SEP field.  If you’re unfamiliar with the term, it means “Somebody Else’s Problem”.  Items with an SEP field are completely invisible to the average human.

New systems have the parts that haven’t been built yet, so you’ll hear things like “This will be so awesome once we build feature X”.  That sucks.

There’s also the prototype that made it into production, a common problem.  Something somebody knocked together over a weekend, whether it was because of lack of time, or because of their utter brilliance, is probably going to suck in ways you just haven’t worked out yet.

All systems, old and crufty or new and shiny, have bottlenecks, where a bottleneck is defined as the slow part, the part that will break first when the system is under excessive load.  This is also part of your system that sucks.

If someone claims their system has no bugs, I have news for you: their system sucks.  And they are overly optimistic (or naive).  (Possibly they just suck as an engineer, too.)

In our heads as engineers we have the Platonic Form of a system: the system that doesn’t suck, that never breaks, that runs perfectly and quietly without anyone looking at it.  We work tirelessly to make our systems approach that system.

Even if you produce this Platonically perfect system, it will begin to suck as soon as you release it.  As data grows and changes, there will start to be parts of the system that don’t work right, or that don’t work fast enough.  Users will find ways to make your system suck in ways you hadn’t even anticipated.  When you need to add features to your perfect system, they will detract from its perfection, and make it suck more.

Here’s the punchline: sucking is like scaling.  You just have to keep on top of it, keep fixing and refactoring and improving and rewriting as you go.  Sometimes you can manage the suck in a linear fashion with bug fixes and refactoring, and sometimes you need a phase change where you re-do parts or all of the system to recover from suckiness.

This mess is what makes engineering and ops different from pure mathematics.  Embrace the suck.  It’s what gets me up in the mornings.

Books change the world

Last week I read a tweet that really got my goat, so much so that I stewed on it all weekend. The author, who is someone from the tech/startup community, said, to Tim O’Reilly no less:

“No-one ever changed the world by writing books.”

This pushed my rant button.

I thought about mentioning some books that have changed the face of civilization. Religious books: the Bible, which defines the shape of many Western civilizations, and the equivalent books of religious law in other cultures. Science books: Copernicus’ “On the Revolutions of the Celestial Spheres”, which began the scientific revolution, and defined a heliocentric model of the universe. Newton’s “Philosophiæ Naturalis Principia Mathematica”, which outlines classical mechanics and gravitation. Einstein’s multitude of publications. Books on economics: Keynes, anyone? Feminism: Friedan’s “Feminine Mystique”. Political thought. Philosophy. Need I go on?

On a micro- level, think about a book you read as a child, as a teenager, last year, last week, that changed the way you felt, gave you hope, gave you relaxation that you needed, or an escape from an unpleasant reality.

I could go on about world-changing books all day. Instead, I’m going to tell you a story about a very unexciting book. This book happens to be one that I wrote, on a subject that I’m passionate about. It’s a book on web development.

Now, this book is only a technical book. It won’t start any revolutions or cause any epiphanies, neither will it make you laugh or cry. My family won’t ever read it, and when non-technical people who are excited to discover I’m a published author hear the topic, their faces fall. It will never be on the New York Times list, or on Oprah, or be banned in countries with oppressive governments. It is a humble technical book.

This technical book has, however, sold quite well over the years. Many people have bought it (thank you), some have liked it, and some have reviewed it. Copies are in libraries, used as the prescribed texts in colleges, sold secondhand, and pirated as PDFs on the internet.

Hundreds of thousands of people have read this book, and, I hope, learned a little something about coding for the web.

Some of those people have probably gotten jobs as a result. Some might have graduated college. Some have built a personal website. Some might have gotten a promotion.

Out of those people, I venture that there have to be a hundred, perhaps more, perhaps less, who have started a company that does some kind of web development, whether it’s a consulting company or a startup. Maybe some of those companies got funded, maybe some were bootstrapped, maybe some were successful.

I wonder if that benchmark is something that the author of the tweet might value.

I hope it’s not too arrogant as an author to hope these things: that the books you write change someone’s life for the better, and in doing so change the world. I continue to believe this, and that is why I continue to write.

Socorro’s Community

This post originally appeared on the Mozilla WebDev blog.

As readers of my blog posts know, Socorro is Mozilla’s crash reporting system. All of the code for Socorro (the crash catcher) and Breakpad (the client side) is open source and available on Google Code.

Some other companies are starting to use Socorro to process crashes. In particular, we are seeing adoption in the gaming and music sectors - people who ship connected client software.

One of these companies is Valve Software, the makers of Half Life, Left 4 Dead, and Portal, among other awesome games. Recently Elan Ruskin, a game developer at Valve, gave a talk at the Game Developers Conference about Valve’s use of crash analysis. His slides are up on his blog and are well worth a read.

If you’re thinking about trying Socorro, I’d encourage you to join the general discussion mailing list (or you can follow it on Google Groups). It’s very low traffic at present but I anticipate that it will grow as more people join.

Later in the year, we plan on hosting the first inaugural Crash Summit at Mozilla, where we’ll talk about tools, crash analysis, and the future of crash reporting. Email me if you’re interested in attending (laura at mozilla) or would like to present. The event will be open to Mozillians and others. I’ll post updates on this blog as we develop the event.

Big Data at SXSW (begin shameless plug)

On Monday March 14, I’ll be one of the presenters at a SXSW workshop called “Big Data and APIs for PHP Developers”, along with:

We’ll be talking about what Big Data is, how to work with it, Big Data APIs (how to design and implement your own, and how to consume them), data visualization, and the wonders of MapReduce. I’ll talk through a case study around Socorro: the nature of the data we have, how we manage it, and some of the challenges we have faced so far.

Workshops are new at SXSW. They are longer than the traditional panel - 2.5 hours - so we can actually get into some techinical content. We plan on making our presentation a conversation about data, with plenty of war stories.

Hope to see you there!

Being Open

I was recently privileged to be invited to come and give a talk at AOL, on the work we do with Socorro, how Mozilla works, and what it means to be open.

The audience for the talk was a new group at AOL called the Technology Leadership Group. It consists of exceptional technical people - engineers and operational staff - from all parts of the organization, who have come together to form a group of thought leaders.

One of the first items on their agenda is, as Erynn Petersen, who looks after Developer and Open Source Evangelism, puts it: “how we scale, how we use data in new and interesting ways, and what it means to treat all of our projects as open source projects.” My task was partly to talk about how open source communities work, what the challenges are, and how AOL might go about becoming more open.

It’s amazing how things come full circle.

I think every person in the audience was a user of Open Source, and many of them were already Open Source contributors on a wide range of projects. Some had been around since the days when Netscape was acquired by AOL.

I’ll include the (limited) slides here, but the best part of the session in my opinion was the Q&A. We discussed some really interesting questions, and I’ll feature some of those here. (I want to note that I am paraphrasing/summarizing the questions as I remember them, and am not quoting any individual.)

Q: Some of our software and services are subscription-based. If we give that code away, we lose our competitive advantage - no one will pay for it anymore.

A: There are a bunch of viable business models that revolve around making money out of open source. The Mozilla model is fairly unusual in the space. The most common models are:

  • Selling support, training, or a built and bundled version of the software. This model is used by Red Hat, Canonical, Cloudera, and many others.
  • Dual licensing models. One version of the software is available under an open source license and another version is available with a commercial license for embedding. This is (or has been) the MySQL model.
  • Selling a hosted version of Open Source software as a service. This model is used by Github (git) and Automattic (Wordpress), among others.
  • It’s also perfectly valid to make some of your software open and leave some proprietary. This is the model used by 37signals - they work on Ruby on Rails and sell SaaS such as Backpack and Basecamp.

Another point is that at Mozilla, our openness *is* our competitive advantage. Our users know that we have no secret agenda: we’re not in it for the money, but we’re also not in it to mine or exploit their personal data. We exist to care for the Open Web. There has been a lot of talk lately about this, best summarized by this statement, which you’ll see in blog posts and tweets from Mozillians:

Firefox answers to no-one but you.

Q: How do we get started? There’s a lot of code - how do we get past the cultural barriers of sharing it?

A: It’s easier to start open than to become open after the fact. However, it can be done - if it couldn’t be done Mozilla wouldn’t exist. Our organization was born from the opening of Netscape. A good number of people in the room were at AOL during the Netscape era, too, which must give them a sense of deja vu. I revisited jwz’s blog post about leaving the Mozilla project, back in those days after I drafted this post, and I recommend reading it as it talks about a lot of the issues.

My answer is that there’s a lot to think about here:

  • What code are we going to make open source? Not everything has to be open source, and it doesn’t have to happen all at once. I suggest starting up a site and repository that projects can graduate to as they become ready for sharing. Here at Mozilla basically everything we work
    on is open source as a matter of principle (”open by default”), but someof it is more likely to be reused than other parts. Tools and libraries are a great starting point.
  • How will that code be licensed? This is partly a philosophical question and partly a legal question. Legal will need to examine the licensing and ownership status of existing code. You might want a contributors’ agreement for people to sign too. Licenses vary and the answer to this question is also dependent on the business model you want to use.
  • How will we share things other than the code? This includes bug reports, documentation, and so on.
  • How will the project be governed? If I want to submit a patch, how do I do that? Who decides if, when, and how that patch will be applied? There are various models for this ranging from the benevolent dictator model to the committee voting model.

I would advise starting with a single project and going from there.

Q: How will we build a community and encourage contributions?
A: This is a great question. We’re actually trying to answer this question on Socorro right now. Here’s what we are doing:

  • Set up paths of communication for the community: mailing lists, newsgroups, discussion forums
  • Make sure you have developer documentation as well as end user documentation
  • If the project is hard to install, consider providing a VM with everything already installed. (We plan to do this both for development and for users who have a small amount of data.)
  • Choose some bugs and mark them as “good first bug” in your bug tracking system.
  • Make the patch submission process transparent and documented.

There was a lot more discussion. I really enjoyed talking to such a smart and engaging group, and I wish AOL the very best in their open source initiative.

The new Socorro

(This post first appeared on the Mozilla Webdev Blog.)

As per my previous blog post, we migrated Socorro to the new data center in Phoenix successfully on Saturday.

This has been a mammoth exercise, that sprang from two separate roots:

  • In June last year, we had a configuration problem that looked like a spike in crashes for a pre-release version of Firefox (3.6.4). This incident (known inaccurately, as “the crash spike”) made it clear that crash-stats is on the critical path to shipping Firefox releases.
  • Near the end of Q3 we realized we were rapidly approaching the capacity of the current Socorro infrastructure.

Around that time we also had difficulty with Socorro stability. This spawned the creation of a master Socorro Stability Plan, with six key areas. I’ll talk about each of these in turn.

Improve stability

Here, we solved some Hbase related issues, upgraded our software, scheduled restarts three times a week and, most importantly, sat down to conduct an architectural review.

The specific HBase issue that we finally solved had to do with intra-cluster communication. (We needed to upgrade our Broadcomm NIC drivers and kernel to solve a known issue when used with HBase. This problem surfaces as the number of TCP connections growing and growing until the box crashes. Solving this removed the need for constant system restarts.)

Architectural improvements

We collect incoming crashes directly to HBase. We determined that as HBase is relatively new and as we’d had stability problems, that we should get Hbase off the critical path for production uptime. We rewrote collectors to seamlessly fall back to disk if HBase was unavailable, and for them optionally to use disk as primary. As part of this process, the system that moves crashes from disk to HBase was replaced. It went from single threaded to multi-threaded, which makes playing catchup after an HBase downtime much much faster.

We still want to put a write through caching layer in front of HBase. This quarter, the Metrics team is prototyping a system for us to test, using Hazelcast.

Build process

We now have an automated build system. When code is checked into svn, Hudson notices, creates a build and runs tests. This build is then deployed in staging. Now we are on the new infrastructure, we will deploy new releases from Hudson as well.

Improved release practices

We have a greatly improved set of release guidelines, including writing a rollback plan for every major release. We did this as part of the migration, too. Developers now have read only access to everything in production: we can audit configs and tail logs.

The biggest change here, though, is switching to Puppet to manage all of our configurations. Socorro has a lot of moving parts, each with its own configuration, and thanks to the fine work by the team, all of these configurations are automatically and centrally managed and can be changed and deployed at the push of a button.

Improved insight into systems

As part of the migration, we audited our monitoring systems. We now have many more nagios monitors on all components, and have spent a lot of time tuning these. We also set up ganglia and have a good feel for what the system looks like under normal load.

We still intend to build a better ops dashboard so we can see all of these checks and balances in one place.

Move to bigger, better hardware in PHX

This one is the doozy. Virtually all Socorro team resources have been on this task full time since October, and we managed to subsume half of the IT/Ops team as well. I’d like to give a special shout-out here to NetOps, who patiently helped with our many many requests.

It’s worth noting that part of the challenge here is that Socorro has a fairly large number of moving parts. I present, for your perusal, a diagram of our current architecture.
Socorro Architecture, 1.7.6
I’ve blogged about our architecture before, and will do so again, soon, so this is just a teaser.

One of the most interesting parts of the migration was the extensive smoke and load testing we performed. We set up a farm of 40 Seamicro nodes and used them to launch crashes at the new infrastructure. This allowed us to find network bottlenecks, misconfigurations, and perform tuning, so that on the day of the migration we were really confident that things would go well. QA also really helped here because with the addition of a huge number of automated tests on the UI, we knew that things were looking great from an end user perspective.

The future

We now have a great infrastructure and set of processes to build on. The goal of Socorro - and Webtools in general - is to help Firefox ship as fast and seamlessly as possible, by providing information and tools that help to build the best possible browser. (I’ll talk more about this in future blog posts, too.)

Thanks

I’ll wrap up by thanking every member of the team that worked on the migration, in no particular order:

  • Justin Dow, Socorro Lead Sysadmin, Puppetmaster
  • Rob Helmer, Engineer, Smoke Tester Extraordinaire and Automater of Builds
  • Lars Lohn, Lead Engineer
  • Ryan Snyder, Web Engineer
  • Josh Berkus, PostgreSQL consultant DBA
  • Daniel Einspanjer, Metrics Architect
  • Xavier Stevens, HBase wizard
  • Corey Shields, Manager, Systems
  • Matthew Zeier, Director of Ops and IT
  • Ravi Pina, NetOps
  • Derek Moore, NetOps
  • Stephen Donner, WebQA lead
  • David Burns, Automated Tester
  • Justin Lazaro, Ops

Good job!
Mozilla Crash Reporting Team