Postmortem: an unfortunate name for a useful process

Let’s face it, unless your development process or self-control has utterly failed, nobody died.

The purpose of a postmortem is to meet and review a project, release, or other situation with the goal of reflection and development. To put it simply, you want to work out what went well and how to repeat it, and what went badly, and how to prevent that from happening again, or to mitigate it if it does. You can think of this process as being similar to a performance review for a team or project in a particular situation.

A postmortem shouldn’t devolve into finger pointing or shouting, or leave anyone walking away feeling miserable or full of rage. It’s not a kangaroo court, held purely to assign blame. Neither should it resemble the traditional Festivus “Airing of the Grievances”. People shouldn’t dread coming to postmortems, or they will avoid doing so.

Don’t just run postmortems for projects that have failed or otherwise been problematic. Run them for the successful projects as well. It’s important to capture what the team did that went well and made the project succeed. Make a postmortem part of your normal process: the bookend to a kickoff meeting.
Here, then, are some tips on running a constructive postmortem.

Timing

The ideal postmortem happens soon enough after the event that everybody remembers what happened. You need to balance this with giving people enough time to reflect and, if things have gone badly, to calm down. A few days to a week afterwards is often about right.

Leadership

Typically, a postmortem will be led by a project or release manager, or lead developer or sysadmin. If you’re reading this, this may well be you.

If you have strong emotions or opinions about what’s happened, I’d recommend getting them out beforehand. You can do by working out stress in whatever way appeals to you: writing a long angry email and then deleting it, going for a run, talking to a friend or spouse, or spending an evening gunning down zombies. The main thing is to have vented whatever steam you have built up before arriving at the meeting, or writing the agenda.

Agenda

Have a detailed agenda. I’d suggest:

  • Set the scope of what you’re talking about and stick to it. If the topic is Release X, say that upfront. Don’t stray off into general problems that exist with, for example, relationships between the development team and ops or marketing.
  • Write down some facts about the topic. This might include the timeline, who was responsible for what, and links to any external documents (project plan, budget, or bug reports, for example).
  • What went well? Even in the worst situation, something probably went well. Did the team work well together? Did all your boxes stay up under load? If they crashed, did your monitoring system tell you so? Seed this with a couple of items beforehand and add to it during the postmortem.
  • What could have gone better: Remember, avoid finger pointing. Chances are, if someone screwed up, they know it. If they are oblivious to personal poor performance, bringing it up in a group meeting won’t help, and you’ll need to address this via other avenues. Focus on tasks and items that failed or could have gone better, not people who could have done better. Again, seed this with a couple of items.
  • Suggested improvements for next time: This is the best and most constructive part of a postmortem. Given the facts that have just been discussed, this can be a brainstorming session on how to work better in future.
  • Actions : Improvements will go nowhere without actions. I recommend each action has an owner, a deadline, and a deliverable, even if it’s just emailing the group with the result of your research.

During the postmortem

Begin by making sure everyone understands the parameters of the meeting. Your job as leader is not to do all the talking or share out the blame, but to go over the facts, and make sure people stay on track.

If the discussion gets too heated or off track, go to your happy place, put down the Nerf gun, and get people back on the agenda. Sometimes you can achieve this by asking people to take a long, heated, or irrelevant discussion offline, or simply by saying “Let’s move on.”

You might be surprised at how creative and constructive people can be, especially in the face of failure. I think the best, most constructive postmortem I have been involved in was the one after my biggest disaster. Engineers and sysadmins hate to fail. Focus on problem solving for the next iteration.

Afterwards

These discussions can be draining. I tend to coast on adrenalin after a release or crisis, and only hit the post-adrenalin exhaustion after the postmortem. It’s not a bad thing to schedule a postmortem just before lunch, or at the end of the day, to give people a chance to relax and refuel afterwards.

Author’s Note: I originally drafted this post as part of an idea for a book last year. I still hope to write that book at some point, but I thought it would make a good blog post in the meantime.

Ship it: a big week in Webtools

They say multi-tasking is hard. They also say DevOps is hard. Let me tell you about a bunch of engineers who think “hard” means “a nice challenge”.

Last week was an amazing one for the Webtools family. We pushed three releases to three major products. People inside Mozilla don’t always know exactly what types of things the Webtools team works on, so allow me to tell you about them.

1. Bouncer

Bouncer is Mozilla’s download redirector. When you click on one of those nifty “Download Firefox” buttons on mozilla.org, that takes you to Bouncer, which redirects you to the correct CDN or mirror where you can actually get the product that you want. Bouncer is also one of the oldest webapps at Mozilla, having been originally authored by my boss, Mike Morgan, many years ago.

Bouncer hadn’t had code changes in a very long time, and when we realized we needed to change it to support the new stub installer for Firefox, we had to spin up new development and staging environments. In addition, IT built out a new production cluster up to the new standards that have come into use since the last time it was deployed.

The code changes for stub installer are mainly around being intelligent enough to understand that some products, like the stub, can only be served from an SSL CDN or mirror. We don’t want to serve all products over SSL because of cost.

On Wednesday we shipped the new infrastrucure, and the code changes. You can read more about that it in bug 800042.

Thanks to Brandon Savage (Webtools), Anthony Ricaud (Websites), Fred Wenzel (Dev Ecosystem), Jake Maul (WebOps), Chris Turra (WebOps), Corey Shields (Systems), Stephen Donner (Web QA), Matt Brandt (Web QA), Raymond Etnoram (Web QA), and Ben Hearsum (RelEng) for making this possible.

2. Air Mozilla

As you probably know, Air Mozilla is the website that broadcasts Mozilla meetings, brownbags and presentations. On Friday we shipped a brand new version, built on top of Django. The old version was hosted in Wordpress, and was really a simple way to present content. The new version has full calendaring integration, LDAP and BrowserID support, and better ways to find old presentations.

Thanks to Tim Mickel (Webtools Intern), Peter Bengtsson (Webtools), Richard Milewski (Air Mozilla), Zandr Milewski (SpecOps), Dan Maher (WebOps), Chris Turra (WebOps), Brandon Burton (WebOps), Jason Crowe (WebOps), and Corey Shields (Systems).

You can see details of the release in bug 799745.

3. Socorro

We also shipped a regular Wednesday Socorro release. Socorro is the crash reporting service for Mozilla products, including Firefox, Firefox for Mobile (”Fennec”), Firefox OS (”Boot to Gecko”), and Thunderbird.

In this release we shipped five bug fixes and enhancements. This number was a bit lower than usual, as most people are crunching to complete the front end rewrite (more on that in a moment).

You can read more about the release in bug 800140.

Thanks to the whole team for working on this: Adrian Gaudebert, Brandon Savage, Chris Lonnen, Lars Lohn, Peter Bengtsson, Robert Helmer, Schalk Neethling, Selena Deckelmann, and of course Matt Brandt (Web QA) and Brandon Burton (IT).

An aside: Socorro on Django

We are also very close to feature parity with the new Django-based version of the Socorro webapp to the old PHP webapp. We needed to rewrite this code, because the version of the framework used in the old version is four years out of date, and there was no upgrade path for it - newer versions break backwards compatibility. Since we had to rewrite it anyway, we have moved to use the same framework as the majority of other webapps at Mozilla. This allows for easier contributions by other Mozillians. We should reach parity in the next couple of days, and plan to ship the new code in parallel with the old, subject to secreview timing.

finally:

I am incredibly proud of the impact, quality, and sheer quantity of our work over the last weeks. These projects will enable many good things throughout Mozilla. Good work, people, stand tall.

Webtools is a small team, and we could not do what we do with incredible support from IT and QA. I like to think of this as the Webtools family: we are all one team; we all work together to get the job done come hell, high water, or zombies in the data center.

Just remember, there’s a reason the Webtools mascot is Ship It Squirrel.

A visit to Hacker School

In July, I was privileged to visit Hacker School as part of their Open Source week. Hacker School is an amazing place, where hackers from all walks of life work together to level up as programmers. It reminded me of all the good things about grad school. I really loved the atmosphere.

During Open Source Week, students’ goal is to submit their first patch to an existing Open Source project. A wide variety of projects were chosen by the students.

I gave a talk on getting started in Open Source, and then myself and two of my Mozilla colleagues helped some students get started on some Mozilla projects. At the end of the week, the organizers gathered together a list of what the students had contributed on our projects. I’d like to share those contributions with you. They include patches, pull requests, and filed bugs.

That’s a lot of contributions, right there.

Observations

Part of the reason the school is so successful, in my view, is the encouraging and non-judgemental atmosphere.  They have two rules about communication:

  1. No “Well-actually”.  This is that thing where we, as geeks, feel the need to correct one another to the nth degree.
  2. No feigned surprise.  That’s saying things like “I can’t believe you’ve never heard of Richard Stallman!”

The skill range of students varies from self-taught in the last six months, to several years’ experience, to PhD students on summer vacation.  But everyone works side by side, productively and enthusiastically.

Calls to action

I learned a lot from my day at Hacker School, and it inspired me to issue these calls to action:

  1. Coders: If you’re thinking about applying to Hacker School, do it.  It’s a truly amazing place.  Applications are open for the fall batch.
  2. Hackers: Nominate people (including yourself!) to be a Hacker School resident, working alongside students for a couple of weeks.
  3. Tech companies: consider sponsoring the next batch of students.
  4. Mozillians: we should sponsor and run and be involved with more hackathons on Mozilla. projects.  We should host Hackdays where we get brand new contributors involved with our projects.  I propose we do this at existing Open Source conferences, get-togethers, and MozCamps, and at informal hackathons wherever the opportunity presents itself.

Finally

I’d like to thank Nick Bergson-Shilcock, David Albert, Sonali Sridhar, Thomas Ballinger, and Alan O’Donnell for running Hacker School and hosting us, and Etsy, 37signals, and Yammer for their sponsorship of the school. And of course, I’d like to thank the students for being awesome, and for their contributions!

The dark craft of engineering management

VM Brasseur and I had a chat about what it means to be an engineering manager, as a follow up to her excellent talk on the subject at Open Source Bridge.  I promised her I would put my (lengthy, rambling) thoughts into an essay of sorts, so here it is.

“Management is the art of getting things done through people.”
This is a nice pithy quote, but I prefer my version:

“Management is the craft of enabling people to get things done.”
Yes, it’s less grammatical.  Sue me.

Why is management a craft?

It’s a craft for the same reasons engineering is a craft.  You can read all the books you want on something but crafts are learned by getting your hands in it and getting them dirty.  Crafts have rough edges, and shortcuts, and rules of thumb, and things that are held together with duct tape.  The product of craft is something useful and pleasing.

(Art to me is a good deal purer: more about aesthetics and making a statement than it is about making a thing.  Craft suits my analogy much better.)

Why enabling people to get things done?

Engineers, in general, know their jobs, to a greater or lesser extent.  My job, as an engineering manager, is to make their jobs easier.

What do engineers value?  This is of course going to be a sweeping generalization, but I’m going to resort to quoting Dan Pink: Mastery, autonomy, and purpose.

Mastery

Mastery has all kinds of implications.  As a manager, my job is to enable engineers to achieve and maintain mastery.  This means helping them to be good at and get better at their jobs.  Enabling them to ship stuff they are passionate about.  To learn the skills they need to do that.  To work alongside others who they can teach and learn from.  To have the right tools to do their jobs.

Autonomy

Autonomy is the key to scaling yourself as an engineering manager.  As an engineer, I hate nothing more than being micromanaged.  As an engineering manager, my job is to communicate the goals and where we want to get to, and work with you to determine how we’re going to get there.  Then I’m going to leave you the hell alone to get stuff done.

The two most important things I do as a manager are in this section.
The first is to act as a BS umbrella for my people.  This means going to meetings, fighting my way through the uncertainty, and coming up with clear goals for the team.  I am the wall that stands between bureaucracy and engineers.  This is also the most stressful part of my job.

The second is in 1:1s.  While I talk to my remote, distributed team all day every day in IRC as needed, this is the sacrosanct time each week where we get to talk.  There are three questions that make up the core of the 1:1:

  • How is everything going?  This is an opportunity for any venting, and lets the engineer set the direction of the conversation.
  • What are you going to do next?  Here, as a manager, I can help clarify priorities, and suggest next steps if the person is blocked.
  • What do you need? This can be anything from political wrangling to hardware.  I will do my best to get them what they need.

In Vicky’s talk she talked about getting all your ducks in a row.  In my view, the advantage of empowering your engineers with autonomy is that you get self-organizing ducks.

The key thing to remember with autonomy is this: Hire people you can trust, and then trust them to do their best.

Purpose

This is key to being a good manager, because you’re providing the purpose.  You help engineers work out what the goals should be, prioritize them, clarify requirements, and make sure everybody has a clear thing they are working towards.  Clarity of purpose is a powerful motivator.  Dealing with uncertainty is yet another roadblock you remove from the path of your team.

Why is management fun?  Why should I become a manager?

Don’t become an engineering manager because you want power - that’s the worst possible reason.  A manager is a servant to their team.  Become a manager if you want to serve.  Become a manager if you want to work on many things at once.  Becoming a manager helps you become a fulcrum for the engineering lever, and that’s a remarkably awesome place to be.

The Fifteen Minute Maker’s Schedule

If you haven’t read Paul Graham’s “Maker’s Schedule, Managers Schedule”, I recommend doing that before you read this or it won’t make any sense.

The Maker’s Schedule makes sense to me in a work setting, but how about for side projects, things you’re trying to do after hours?

I started fomenting this blog post a while ago.  A very good engineer I know said something to me which I must admit rubbed me up the wrong way.  He said something along the lines of, “See, you like to write for fun, and I like to code for fun.” Actually, I really like to code for fun too, but it’s much easier to write than code in fifteen minute increments, which is often all I have available to me on any given day.

Let’s be clear about one thing: I don’t think of myself as a consumer.  I barely watch TV, only when my two year old insists.  I can’t tell you the last time I had time to watch a movie, and I haven’t played a non-casual video game since college.  I do read books, but books, too, lend themselves well to being read in fifteen minute increments.

I want to be a producer: someone who makes things.  Unfortunately my life is not compatible with these long chunks of time that Paul Graham talks about.  I think any parent of small children would say the same.  When you’re not at work you are on an interrupt-driven schedule: not controlled by management, but controlled by the whims of the little people who are the center of your universe.

This is how I work:

When I’m doing one of the mindless things that consume some of my non-work time - showering, driving, grocery shopping, cleaning the house, laundry, barn chores - I’m planning.  Whether it’s cranking away on a work problem, planning a blog post or a plot for a novel that I want to write, thinking of what projects to build for our next PHP book, mapping out a conference talk, planning code that I want to work on.  This is brain priming time.  When I get fifteen minutes to myself I can act on those things.

In other words, planning is parallelizable.  Doing is not.  Since I have so little uninterrupted time to *do*, I plan it carefully, and use it as much as I can.

When I get the occasional hour or two - nap time on a weekend (and to hell with the laundry), my husband taking our child out somewhere, or those blessed, perfect hours on a transcontinental flight - I can get so much done it makes my head hurt.  But those are the exceptions, not the norm.  I expect that to be the case until our child is a good deal older.

I had to train myself to do *anything* in fifteen minutes.  It didn’t come naturally, but I heard the advice over and over again, particularly from women writers, some of them New York Times bestsellers.  One has five children and wrote six books last year, so it can be done.  The coding is coming.  Training myself to code in fifteen minute increments has taken a lot longer than training myself to write in the same time.

The trick is to do that planning.  Train your mind to immerse itself in the problem as soon as you get into the zone where your brain is being underutilized.  This kind of immersion thinking has been useful to me for years for problem solving, and I just had to retrain myself to use it for planning.

In summary: don’t despair of Graham’s Maker’s Schedule if you just don’t have those big chunks of time outside of work.  You can still be a maker.  You can still be a creative person.  You just have to practice.  Remember: the things that count are the things we do every day, even if it’s only for fifteen minutes.

Rapid releases: one webdev’s perspective

People still seem to be very confused about why Mozilla has moved to the new rapid release system. I thought I’d try and explain it from my perspective. I should point out that I am not any kind of official spokesperson, and should not be quoted as such. The following is just my own personal opinion.

Imagine, now, you work on a team of web developers, and you only get to push new code to production once a year, or once every eighteen months. Your team has decided to wait until the chosen twenty new features are finished, and not ship until those are totally done and passed through a long staging and QA period. The other hundreds of bugs/tickets you closed out in that 12-18 months would have to wait too.

Seems totally foreign in these days of continuous deployment, doesn’t it?

When I first heard about rapid releases, back at our December All Hands, I had two thoughts. The first was that this was absolutely the right thing to do. When stuff is done we should give it to users. We shouldn’t make them wait, especially when other browsers don’t make them wait.

The second thought was that this was completely audacious and I didn’t know if we could pull it off. Amazingly, it happened, and now Mozilla releases use the train model and leave the station every six weeks.

So now users get features shortly after they are done (big win), but there’s been a lot of fallout. Some of the fallout has been around internal tools breaking – we just pushed out a total death sprint release of Socorro to alleviate some of this, for example. Most of the fallout, however, has been external. I see three main areas, and I’ll talk about each one in turn.

Version numbers

The first thing is pushback on version numbers. I see lots of things like:
“Why is Mozilla using marketing-driven version numbers now?”
“What are they trying to prove?”
“How will I know which versions my addons are compatible with?”
“How will I write code (JS/HTML/CSS) that works on a moving target?”

Version numbers are on the way to becoming much less visible in Firefox, like they are in webapps, or in Chrome, for that matter. (As I heard a Microsoft person say, “Nobody asks ‘Which version of Facebook are you running?’”) So to answer: it’s not marketing-driven. In fact, I think not having big versions full of those twenty new features has been much, much harder for the Engagement (marketing) team to know how to market. I see a lot of rage around version numbers in the newsgroups and on tech news sites (HN, Slashdot, etc), which tells me that we haven’t done a good job communicating this to users. I believe this is a communication issue rather than because it’s a bad idea: nowhere do you see these criticisms of Chrome, which uses the same method.

(This blog post is, in part, my way of trying to help with this.)

Add-on compatibility

The add-ons team has been working really hard to minimize add-on breakage. In realistic terms, most add-ons will continue to work with each new release, they just need a version bump. The team has a process for bumping the compatible versions of an add-on automatically, which solves this problem for add-ons that are hosted on addons.mozilla.org. Self-hosted add-ons will continue to need manual updating, and this has caused problems for people.

The goal is, as I understand it, for add-on authors to use the Add-on SDK wherever possible, which will have versions that are stable for a long time. (Read the Add-ons blog on the roadmap for more information on this.)

Enterprise versions

The other thing that’s caused a lot of stress for people at large companies is the idea that versions won’t persist for a long time. Large enterprises tend not to upgrade desktop software frequently. (This is the sad reason why so many people are still on IE6.)

There is an Enterprise Working Group working on these problems: we are taking it seriously.

finally:

Overall, getting Firefox features to users faster is a good thing. Some of the fallout issues were understood well in advance and had a mitigation plan: add-on incompatibility for example. Some others we haven’t done a good job with.

I truly believe if we had continued to release Firefox at the rate of a major version every 18 months or so, that we would have been on a road to nowhere. We had to get faster. It’s a somewhat risky strategy, but it’s better to take that risk than quietly fade away.

At the end of the day we have to remember the Mozilla mission: to promote openness and innovation on the web. It’s hard to promote innovation within the browser if you ship orders of magnitude more slowly than your competitors.

Notice that I mention the mission: people often don’t know or tend to forget that Mozilla isn’t in this for the money. We’re a non-profit. We’re in it for the good of the web, and we want to do whatever’s needed to make the web a better and more open place. We do these things because we’re passionate about the web.

I’ve never worked anywhere else where the mission and the passion were so prominent. We may sometimes do things you don’t agree with, or communicate them in a less-than-optimal way, but I really want people who read this to understand that our intentions are positive and our goal is the same as it’s always been.

Capability For Continuous Deployment

Continuous deployment is very buzzword-y right now. I have some really strong opinions about deployment (just ask James Socol or Erik Kastner, who have heard me ranting about this). Here’s what I think, in a nutshell:

You should build the capability for continuous deployment even if you never intend to do continuous deployment. The machinery is more important than your deployment velocity.

Let me take a step back and talk about deployment maturity.

Immature deployment

At the immature end, a developer or team works on a system that has no staging environment. Code goes from the development environment straight to production. (I’m not going to even talk about the situation where the development environment is production. I am fully aware that these still exist, from asking questions to that effect of the audience in conference talks.) I’m also assuming, in this era of github, that everybody is using version control.

(I want to point out that it’s the easy availability of services like github that has enabled even tiny, disorganized teams to use version control. VC is ubiquitous now, and that is a huge blessing.)

This sample scenario is very common: work on dev, make a commit, push it out to prod. Usually, no ops team is involved, or even exists. This can work really well in an early stage company or project, especially if you’re pre-launch.

This team likely has no automation, and a variable number of tests (0+). Even if they have tests, they may have no coverage numbers and no continuous integration.

When you hear book authors, conference speakers or tech bloggers talk about the wonders of continuous deployment, this scenario is not what they are describing.

The machinery of continuous deployment

Recently in the Mozilla webdev team, we’ve had a bunch of conversations about CD. When we talked about what was needed to do this, I had a revelation.

Although we were choosing not to do CD on my team, we had in place every requirement that was needed:

  • Continuous integration with build-on-commit
  • Tests with good coverage, and a good feel for the holes in our coverage
  • A staging environment that reflects production – our stage environment is a scale model of production, with the same ratios between boxes
  • Managed configuration
  • Scripted deployment to a large number of machines

I realized then that the machinery for continuous deployment is different from the deployment velocity that you choose for your team. If we need to, we can make a commit and push it out inside of a couple of minutes, without breaking a sweat.

Why we don’t do continuous deployment on Socorro

We choose not to, except in urgent situations, for a few reasons:

  • We like to performance test our stuff, and we haven’t yet automated that
  • We like to have a human QA team test in addition to automated tests
  • We like to version our code and do “proper” releases because it’s open source and other people use our packages
  • A commit to one component of our system is often related to other commits in other components, which make more sense to ship as a bundle

Our process looks like this:

  • The “dev” environment runs trunk, and the “stage” environment runs a release branch.
  • On commit, Jenkins builds packages and deploys them to the appropriate environment.
  • To deploy to prod, we run a script that pulls the specified package from Jenkins and pushes it out to our production environment. We also tag the revision that was packaged for others to use, and for future reference. Note we are pushing the same package to prod that we pushed to stage, and stage reflects production.
  • If we need to change configuration for a release, we change it first in Puppet on staging, and then update production the same way.

We do intend to increase our deployment velocity in the coming months, but for us that’s not about the machinery of deployment, it’s about increasing our delivery velocity.

Delivery velocity is a different problem, which I’m wrestling with right now. We have a small team, and the work we’re trying to do tends not to come in small chunks but big ones, like a new report, or a system-wide change to the way we aggregate data (the thing we’re working on at the moment). It’s not that changes sit in trunk, waiting for a release. It’s more that it takes us a while to get something to a deployable stage. That is, deployment for us is not the blocker to getting features to our users faster.

finally:

It’s the same old theme you’ve seen ten times before on this blog: everybody’s environment is different, and continuous deployment may not be for everybody. On the other hand, the machinery for continuous deployment can be critical to making your life easier . Automating all this stuff certainly helps me sleep at night much better than I used to when we did manual pushes.

(Incidentally, I’d like to thank Rob Helmer and Justin Dow for automating our world: you couldn’t find better engineers to work with.)

All systems suck

I’ve been thinking a lot about this idea lately.  I’ve spent a lot of years as an engineer and consultant fixing other people’s systems that suck, writing my own systems that suck, and working on legacy systems, that, well, suck.

Don’t let anyone fool you.  All systems suck, to a greater or lesser extent.

If it’s an old system, there’s the part of the code that everybody is afraid to work on: the fragile code that is easier to replace than maintain or refactor.  Sometimes this seems hard, or nobody really understands it.  These parts of the code are almost always surrounded by an SEP field.  If you’re unfamiliar with the term, it means “Somebody Else’s Problem”.  Items with an SEP field are completely invisible to the average human.

New systems have the parts that haven’t been built yet, so you’ll hear things like “This will be so awesome once we build feature X”.  That sucks.

There’s also the prototype that made it into production, a common problem.  Something somebody knocked together over a weekend, whether it was because of lack of time, or because of their utter brilliance, is probably going to suck in ways you just haven’t worked out yet.

All systems, old and crufty or new and shiny, have bottlenecks, where a bottleneck is defined as the slow part, the part that will break first when the system is under excessive load.  This is also part of your system that sucks.

If someone claims their system has no bugs, I have news for you: their system sucks.  And they are overly optimistic (or naive).  (Possibly they just suck as an engineer, too.)

In our heads as engineers we have the Platonic Form of a system: the system that doesn’t suck, that never breaks, that runs perfectly and quietly without anyone looking at it.  We work tirelessly to make our systems approach that system.

Even if you produce this Platonically perfect system, it will begin to suck as soon as you release it.  As data grows and changes, there will start to be parts of the system that don’t work right, or that don’t work fast enough.  Users will find ways to make your system suck in ways you hadn’t even anticipated.  When you need to add features to your perfect system, they will detract from its perfection, and make it suck more.

Here’s the punchline: sucking is like scaling.  You just have to keep on top of it, keep fixing and refactoring and improving and rewriting as you go.  Sometimes you can manage the suck in a linear fashion with bug fixes and refactoring, and sometimes you need a phase change where you re-do parts or all of the system to recover from suckiness.

This mess is what makes engineering and ops different from pure mathematics.  Embrace the suck.  It’s what gets me up in the mornings.

Books change the world

Last week I read a tweet that really got my goat, so much so that I stewed on it all weekend. The author, who is someone from the tech/startup community, said, to Tim O’Reilly no less:

“No-one ever changed the world by writing books.”

This pushed my rant button.

I thought about mentioning some books that have changed the face of civilization. Religious books: the Bible, which defines the shape of many Western civilizations, and the equivalent books of religious law in other cultures. Science books: Copernicus’ “On the Revolutions of the Celestial Spheres”, which began the scientific revolution, and defined a heliocentric model of the universe. Newton’s “Philosophiæ Naturalis Principia Mathematica”, which outlines classical mechanics and gravitation. Einstein’s multitude of publications. Books on economics: Keynes, anyone? Feminism: Friedan’s “Feminine Mystique”. Political thought. Philosophy. Need I go on?

On a micro- level, think about a book you read as a child, as a teenager, last year, last week, that changed the way you felt, gave you hope, gave you relaxation that you needed, or an escape from an unpleasant reality.

I could go on about world-changing books all day. Instead, I’m going to tell you a story about a very unexciting book. This book happens to be one that I wrote, on a subject that I’m passionate about. It’s a book on web development.

Now, this book is only a technical book. It won’t start any revolutions or cause any epiphanies, neither will it make you laugh or cry. My family won’t ever read it, and when non-technical people who are excited to discover I’m a published author hear the topic, their faces fall. It will never be on the New York Times list, or on Oprah, or be banned in countries with oppressive governments. It is a humble technical book.

This technical book has, however, sold quite well over the years. Many people have bought it (thank you), some have liked it, and some have reviewed it. Copies are in libraries, used as the prescribed texts in colleges, sold secondhand, and pirated as PDFs on the internet.

Hundreds of thousands of people have read this book, and, I hope, learned a little something about coding for the web.

Some of those people have probably gotten jobs as a result. Some might have graduated college. Some have built a personal website. Some might have gotten a promotion.

Out of those people, I venture that there have to be a hundred, perhaps more, perhaps less, who have started a company that does some kind of web development, whether it’s a consulting company or a startup. Maybe some of those companies got funded, maybe some were bootstrapped, maybe some were successful.

I wonder if that benchmark is something that the author of the tweet might value.

I hope it’s not too arrogant as an author to hope these things: that the books you write change someone’s life for the better, and in doing so change the world. I continue to believe this, and that is why I continue to write.

Socorro’s Community

This post originally appeared on the Mozilla WebDev blog.

As readers of my blog posts know, Socorro is Mozilla’s crash reporting system. All of the code for Socorro (the crash catcher) and Breakpad (the client side) is open source and available on Google Code.

Some other companies are starting to use Socorro to process crashes. In particular, we are seeing adoption in the gaming and music sectors - people who ship connected client software.

One of these companies is Valve Software, the makers of Half Life, Left 4 Dead, and Portal, among other awesome games. Recently Elan Ruskin, a game developer at Valve, gave a talk at the Game Developers Conference about Valve’s use of crash analysis. His slides are up on his blog and are well worth a read.

If you’re thinking about trying Socorro, I’d encourage you to join the general discussion mailing list (or you can follow it on Google Groups). It’s very low traffic at present but I anticipate that it will grow as more people join.

Later in the year, we plan on hosting the first inaugural Crash Summit at Mozilla, where we’ll talk about tools, crash analysis, and the future of crash reporting. Email me if you’re interested in attending (laura at mozilla) or would like to present. The event will be open to Mozillians and others. I’ll post updates on this blog as we develop the event.