Archive for the ‘Uncategorized’ Category.

Capability For Continuous Deployment

Continuous deployment is very buzzword-y right now. I have some really strong opinions about deployment (just ask James Socol or Erik Kastner, who have heard me ranting about this). Here’s what I think, in a nutshell:

You should build the capability for continuous deployment even if you never intend to do continuous deployment. The machinery is more important than your deployment velocity.

Let me take a step back and talk about deployment maturity.

Immature deployment

At the immature end, a developer or team works on a system that has no staging environment. Code goes from the development environment straight to production. (I’m not going to even talk about the situation where the development environment is production. I am fully aware that these still exist, from asking questions to that effect of the audience in conference talks.) I’m also assuming, in this era of github, that everybody is using version control.

(I want to point out that it’s the easy availability of services like github that has enabled even tiny, disorganized teams to use version control. VC is ubiquitous now, and that is a huge blessing.)

This sample scenario is very common: work on dev, make a commit, push it out to prod. Usually, no ops team is involved, or even exists. This can work really well in an early stage company or project, especially if you’re pre-launch.

This team likely has no automation, and a variable number of tests (0+). Even if they have tests, they may have no coverage numbers and no continuous integration.

When you hear book authors, conference speakers or tech bloggers talk about the wonders of continuous deployment, this scenario is not what they are describing.

The machinery of continuous deployment

Recently in the Mozilla webdev team, we’ve had a bunch of conversations about CD. When we talked about what was needed to do this, I had a revelation.

Although we were choosing not to do CD on my team, we had in place every requirement that was needed:

  • Continuous integration with build-on-commit
  • Tests with good coverage, and a good feel for the holes in our coverage
  • A staging environment that reflects production – our stage environment is a scale model of production, with the same ratios between boxes
  • Managed configuration
  • Scripted deployment to a large number of machines

I realized then that the machinery for continuous deployment is different from the deployment velocity that you choose for your team. If we need to, we can make a commit and push it out inside of a couple of minutes, without breaking a sweat.

Why we don’t do continuous deployment on Socorro

We choose not to, except in urgent situations, for a few reasons:

  • We like to performance test our stuff, and we haven’t yet automated that
  • We like to have a human QA team test in addition to automated tests
  • We like to version our code and do “proper” releases because it’s open source and other people use our packages
  • A commit to one component of our system is often related to other commits in other components, which make more sense to ship as a bundle

Our process looks like this:

  • The “dev” environment runs trunk, and the “stage” environment runs a release branch.
  • On commit, Jenkins builds packages and deploys them to the appropriate environment.
  • To deploy to prod, we run a script that pulls the specified package from Jenkins and pushes it out to our production environment. We also tag the revision that was packaged for others to use, and for future reference. Note we are pushing the same package to prod that we pushed to stage, and stage reflects production.
  • If we need to change configuration for a release, we change it first in Puppet on staging, and then update production the same way.

We do intend to increase our deployment velocity in the coming months, but for us that’s not about the machinery of deployment, it’s about increasing our delivery velocity.

Delivery velocity is a different problem, which I’m wrestling with right now. We have a small team, and the work we’re trying to do tends not to come in small chunks but big ones, like a new report, or a system-wide change to the way we aggregate data (the thing we’re working on at the moment). It’s not that changes sit in trunk, waiting for a release. It’s more that it takes us a while to get something to a deployable stage. That is, deployment for us is not the blocker to getting features to our users faster.

finally:

It’s the same old theme you’ve seen ten times before on this blog: everybody’s environment is different, and continuous deployment may not be for everybody. On the other hand, the machinery for continuous deployment can be critical to making your life easier . Automating all this stuff certainly helps me sleep at night much better than I used to when we did manual pushes.

(Incidentally, I’d like to thank Rob Helmer and Justin Dow for automating our world: you couldn’t find better engineers to work with.)

All systems suck

I’ve been thinking a lot about this idea lately.  I’ve spent a lot of years as an engineer and consultant fixing other people’s systems that suck, writing my own systems that suck, and working on legacy systems, that, well, suck.

Don’t let anyone fool you.  All systems suck, to a greater or lesser extent.

If it’s an old system, there’s the part of the code that everybody is afraid to work on: the fragile code that is easier to replace than maintain or refactor.  Sometimes this seems hard, or nobody really understands it.  These parts of the code are almost always surrounded by an SEP field.  If you’re unfamiliar with the term, it means “Somebody Else’s Problem”.  Items with an SEP field are completely invisible to the average human.

New systems have the parts that haven’t been built yet, so you’ll hear things like “This will be so awesome once we build feature X”.  That sucks.

There’s also the prototype that made it into production, a common problem.  Something somebody knocked together over a weekend, whether it was because of lack of time, or because of their utter brilliance, is probably going to suck in ways you just haven’t worked out yet.

All systems, old and crufty or new and shiny, have bottlenecks, where a bottleneck is defined as the slow part, the part that will break first when the system is under excessive load.  This is also part of your system that sucks.

If someone claims their system has no bugs, I have news for you: their system sucks.  And they are overly optimistic (or naive).  (Possibly they just suck as an engineer, too.)

In our heads as engineers we have the Platonic Form of a system: the system that doesn’t suck, that never breaks, that runs perfectly and quietly without anyone looking at it.  We work tirelessly to make our systems approach that system.

Even if you produce this Platonically perfect system, it will begin to suck as soon as you release it.  As data grows and changes, there will start to be parts of the system that don’t work right, or that don’t work fast enough.  Users will find ways to make your system suck in ways you hadn’t even anticipated.  When you need to add features to your perfect system, they will detract from its perfection, and make it suck more.

Here’s the punchline: sucking is like scaling.  You just have to keep on top of it, keep fixing and refactoring and improving and rewriting as you go.  Sometimes you can manage the suck in a linear fashion with bug fixes and refactoring, and sometimes you need a phase change where you re-do parts or all of the system to recover from suckiness.

This mess is what makes engineering and ops different from pure mathematics.  Embrace the suck.  It’s what gets me up in the mornings.

Socorro’s Community

This post originally appeared on the Mozilla WebDev blog.

As readers of my blog posts know, Socorro is Mozilla’s crash reporting system. All of the code for Socorro (the crash catcher) and Breakpad (the client side) is open source and available on Google Code.

Some other companies are starting to use Socorro to process crashes. In particular, we are seeing adoption in the gaming and music sectors - people who ship connected client software.

One of these companies is Valve Software, the makers of Half Life, Left 4 Dead, and Portal, among other awesome games. Recently Elan Ruskin, a game developer at Valve, gave a talk at the Game Developers Conference about Valve’s use of crash analysis. His slides are up on his blog and are well worth a read.

If you’re thinking about trying Socorro, I’d encourage you to join the general discussion mailing list (or you can follow it on Google Groups). It’s very low traffic at present but I anticipate that it will grow as more people join.

Later in the year, we plan on hosting the first inaugural Crash Summit at Mozilla, where we’ll talk about tools, crash analysis, and the future of crash reporting. Email me if you’re interested in attending (laura at mozilla) or would like to present. The event will be open to Mozillians and others. I’ll post updates on this blog as we develop the event.

Big Data at SXSW (begin shameless plug)

On Monday March 14, I’ll be one of the presenters at a SXSW workshop called “Big Data and APIs for PHP Developers”, along with:

We’ll be talking about what Big Data is, how to work with it, Big Data APIs (how to design and implement your own, and how to consume them), data visualization, and the wonders of MapReduce. I’ll talk through a case study around Socorro: the nature of the data we have, how we manage it, and some of the challenges we have faced so far.

Workshops are new at SXSW. They are longer than the traditional panel - 2.5 hours - so we can actually get into some techinical content. We plan on making our presentation a conversation about data, with plenty of war stories.

Hope to see you there!

The new Socorro

(This post first appeared on the Mozilla Webdev Blog.)

As per my previous blog post, we migrated Socorro to the new data center in Phoenix successfully on Saturday.

This has been a mammoth exercise, that sprang from two separate roots:

  • In June last year, we had a configuration problem that looked like a spike in crashes for a pre-release version of Firefox (3.6.4). This incident (known inaccurately, as “the crash spike”) made it clear that crash-stats is on the critical path to shipping Firefox releases.
  • Near the end of Q3 we realized we were rapidly approaching the capacity of the current Socorro infrastructure.

Around that time we also had difficulty with Socorro stability. This spawned the creation of a master Socorro Stability Plan, with six key areas. I’ll talk about each of these in turn.

Improve stability

Here, we solved some Hbase related issues, upgraded our software, scheduled restarts three times a week and, most importantly, sat down to conduct an architectural review.

The specific HBase issue that we finally solved had to do with intra-cluster communication. (We needed to upgrade our Broadcomm NIC drivers and kernel to solve a known issue when used with HBase. This problem surfaces as the number of TCP connections growing and growing until the box crashes. Solving this removed the need for constant system restarts.)

Architectural improvements

We collect incoming crashes directly to HBase. We determined that as HBase is relatively new and as we’d had stability problems, that we should get Hbase off the critical path for production uptime. We rewrote collectors to seamlessly fall back to disk if HBase was unavailable, and for them optionally to use disk as primary. As part of this process, the system that moves crashes from disk to HBase was replaced. It went from single threaded to multi-threaded, which makes playing catchup after an HBase downtime much much faster.

We still want to put a write through caching layer in front of HBase. This quarter, the Metrics team is prototyping a system for us to test, using Hazelcast.

Build process

We now have an automated build system. When code is checked into svn, Hudson notices, creates a build and runs tests. This build is then deployed in staging. Now we are on the new infrastructure, we will deploy new releases from Hudson as well.

Improved release practices

We have a greatly improved set of release guidelines, including writing a rollback plan for every major release. We did this as part of the migration, too. Developers now have read only access to everything in production: we can audit configs and tail logs.

The biggest change here, though, is switching to Puppet to manage all of our configurations. Socorro has a lot of moving parts, each with its own configuration, and thanks to the fine work by the team, all of these configurations are automatically and centrally managed and can be changed and deployed at the push of a button.

Improved insight into systems

As part of the migration, we audited our monitoring systems. We now have many more nagios monitors on all components, and have spent a lot of time tuning these. We also set up ganglia and have a good feel for what the system looks like under normal load.

We still intend to build a better ops dashboard so we can see all of these checks and balances in one place.

Move to bigger, better hardware in PHX

This one is the doozy. Virtually all Socorro team resources have been on this task full time since October, and we managed to subsume half of the IT/Ops team as well. I’d like to give a special shout-out here to NetOps, who patiently helped with our many many requests.

It’s worth noting that part of the challenge here is that Socorro has a fairly large number of moving parts. I present, for your perusal, a diagram of our current architecture.
Socorro Architecture, 1.7.6
I’ve blogged about our architecture before, and will do so again, soon, so this is just a teaser.

One of the most interesting parts of the migration was the extensive smoke and load testing we performed. We set up a farm of 40 Seamicro nodes and used them to launch crashes at the new infrastructure. This allowed us to find network bottlenecks, misconfigurations, and perform tuning, so that on the day of the migration we were really confident that things would go well. QA also really helped here because with the addition of a huge number of automated tests on the UI, we knew that things were looking great from an end user perspective.

The future

We now have a great infrastructure and set of processes to build on. The goal of Socorro - and Webtools in general - is to help Firefox ship as fast and seamlessly as possible, by providing information and tools that help to build the best possible browser. (I’ll talk more about this in future blog posts, too.)

Thanks

I’ll wrap up by thanking every member of the team that worked on the migration, in no particular order:

  • Justin Dow, Socorro Lead Sysadmin, Puppetmaster
  • Rob Helmer, Engineer, Smoke Tester Extraordinaire and Automater of Builds
  • Lars Lohn, Lead Engineer
  • Ryan Snyder, Web Engineer
  • Josh Berkus, PostgreSQL consultant DBA
  • Daniel Einspanjer, Metrics Architect
  • Xavier Stevens, HBase wizard
  • Corey Shields, Manager, Systems
  • Matthew Zeier, Director of Ops and IT
  • Ravi Pina, NetOps
  • Derek Moore, NetOps
  • Stephen Donner, WebQA lead
  • David Burns, Automated Tester
  • Justin Lazaro, Ops

Good job!
Mozilla Crash Reporting Team

PHP and Big Data

This post first appeared as part of the PHP Advent Calendar.

Big data, data science, analytics. These are some of the hottest buzzwords in tech right now. Five years ago, the boasting rights went to the geek with the largest number of users: these days he with the biggest data wins.

There are a number of approaches to dealing with vast quantities of data, but one of the best known is Apache Hadoop. Hadoop is a toolkit for managing large data sets, based originally on the Google whitepapers about MapReduce and the Google File System. For Socorro, the Mozilla crash reporting system, we use HBase, a non-relational (NoSQL) database built on the Hadoop ecosystem.

The Hadoop world is largely a Java world, since all the tools are written in Java. However, if you feel the same way about Java as Sean Coates, you should not lose hope. You, too, can use PHP to work with Hadoop.

Let’s start by understanding MapReduce. This is a framework for distributed processing of large datasets.

A MapReduce job consists of two pieces of code:

A Mapper
The job of the Mapper is to map input key-value pairs to output key-value pairs.
A Reducer
The Reducer receives and collates results from Mappers.

More parts are needed to make this work:

  • An Input reader generates splits of data for each Mapper to work through.
  • A Partition function takes the output of Mappers and chooses a destination Reducer.
  • An Output writer takes the output of the Reducers and writes it to the Hadoop Distributed File System (HDFS).

In summary, the Mapper and Reducer are the core functionality of a MapReduce job. Now, let’s get set up to write a Mapper and Reducer against Hadoop with PHP.

Setting up Hadoop is a non-trivial task; luckily, a number of VMs are available to help. For this example, I am using the Training VM from Cloudera. (You’ll need VMWare Player for Windows or Linux, or VMWare Fusion for OS X to run this VM.)

Once you’ve started the VM, open up a terminal window. (This VM is Ubuntu based.)

The VM you have just installed comes with a sample data set of the complete works of Shakespeare. You’ll need to put these files into HDFS so that we can work with them. Run the following commands to put the files into HDFS:

cd ~/git/data
tar vzxf shakespeare.tar.gz
hadoop fs -put input /user/training/input

You can confirm this worked by viewing the files in the input directory on HDFS: hadoop fs -ls /user/training

Next, we need to create the mapper and reducer. To demonstrate these, we’ll reproduce what is often referred to as the canonical MapReduce example: word count.

You can find the Java version of this code in the Cloudera Hadoop Tutorial.

As you can see (if you know Java), the mapper reads words from input, and for each word it encounters, emits to standard output the word and the value 1 to indicate that the word has been encountered. The reducer takes output from mappers and aggregates it to produce a set of words and counts.

The easiest way to communicate from PHP to Hadoop and back again is using the Hadoop Streaming API. This expects mappers and reducers to use standard input and output as a pipe for communication.

This is how we write the word count mapper in PHP, which we’ll name mapper.php:

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");

while ($line = fgets($input)) {
$line = strtolower($line);
if ($words = preg_split("/\W/", $line)) {
foreach ($words as $word) {
echo "$word\t1\n";
}
}
}

fclose($input);

We open standard input for reading a line at a time, split that line into an array along word boundaries using a regular expression, and emit output as the word encountered followed by a 1. (I delimited this with tabs, but you may use whatever you like.)

Now, here’s the reducer (reducer.php):

#!/usr/bin/php
<?php

$input = fopen("php://stdin", "r");
$counts = array();

while($line = fgets($input)) {
$tuple = explode("\t", $line);
$counts[$tuple[0]] += $tuple[1];
}

fclose($input);

foreach($counts as $word => $count) {
echo("$word $count\n");
}

Again, we read a line at a time from standard input, and summarize the results in an array. Finally, we write out the array to standard output.

Copy these scripts to your VM, and once you have saved them, make them executable:

chmod a+x mapper.php
chmod a+x reducer.php

You can run this example code in the VM using the following command:

hadoop \
jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+320.jar \
-mapper mapper.php \
-reducer reducer.php \
-input input \
-output wordcount-php-output

In the output from the command you will see a URL where you can trace the execution of your MapReduce job in a web browser as it runs. Once the job has finished running, you can view the output in the location you specified:

hadoop fs \
-ls /user/training/wordcount-php-output

You should see something like:

Found 2 items
drwxr-xr-x   - training supergroup          0 2010-12-14 15:40
/user/training/wordcount-php-output/_logs
-rw-r--r--   1 training supergroup     279706 2010-12-14 15:40
/user/training/wordcount-php-output/part-00000

You can view the output, too:

hadoop fs \
-cat /user/training/wordcount-php-output/part-00000

An excerpt from the output should look like this:

yeoman 13
yeomen 1
yerk 2
yes 211
yest 1
yesterday 25

This is a pretty trivial example, but once you have this set up and running, it’s easy to extend this to whatever you need to do. Some examples of the kinds of things you can use it for are inverted index construction, machine learning algorithms, and graph traversal. The data you can transform is limited only by your imagination, and, of course, the size of your Hadoop cluster. That’s a topic for another day.

Parenting Versus Programming

This post was written for and first appeared in the PHP Advent Calendar 2009.

Advent calendars are about Christmas, and for me Christmas has always been a time for family. This year I have recently joined the ranks of the parents among you. I am taking a short break from work and focusing on being a mother rather than being a programmer. This has led me to reflect on the similarities between parenting and coding. I present these here for your enlightenment, or so you can laugh at me.

Lesson One: Smells

Babies, like programs, are associated with a variety of interesting smells. If you try to ignore the smells associated with babies, they only get worse over time. Then the screaming starts.

It was Martin Fowler who popularized the notion of code smells. These are defined as parts of your code that do things in an ugly way, or to put it a different way, they are hacks. Typically, when we find these parts of code, our eyes begin to glaze over, and we enter a strange, zombie-like state. (This will also be familiar to parents.) In this state, we are paralyzed and thus prevented from doing anything about the smell. It is only when we emerge from the cave of code smells into the daylight of clean code that we can again be productive. I am not sure whether it is the horror of the code, a tendency to procrastinate that is endemic among programmers, or simply a fear of breaking things that prevents most people from doing something about the mess. It is always the smelliest parts of the code that are the most fragile.

The lesson programmers can learn from babies, here, is to face your smells and get rid of them as quickly as possible. Then everybody’s happy.

Lesson Two: Sleep

I have now been programming for mumble years. Not long after I started work at Mozilla, a delightful Canadian journalist asked me what it was like to work with a group of people “so much younger than myself.” While I am not actually that old, I am old enough to have picked up some skills. One of these skills is the ability to survive on 4 to 6 hours of sleep a night for extended periods. Although this is not my favorite hobby, I have gained a certain level of mastery.

Another thing I’ve learned is that surviving on 4 to 6 hours of sleep a night for an extended period makes you mildly deranged. With parenting, as with programming, it is sometimes required. When you achieve a certain level of derangement, you find that very silly things start to sound like good ideas, or they just start to happen, whether you intended them to or not. For example, you find yourself accidentally filing the baby in the filing cabinet or implementing a new framework.

This is a lesson both programmers and parents can learn from the military: sleep when you can do so safely, and as often as possible. If you can snatch a few minutes of nap here and there, it may not make you feel a lot better, but your degraded IQ will recover somewhat.

Lesson Three: With Great Power Comes Great Responsibility

When you set out to implement a new web app or raise a child — and yes, I do realize these things are not quite on the same scale — you have a great responsibility to do a good job. Otherwise, everyone who has to interact with your app or child in the future will curse your name. Repeatedly. Your baby — whether it is human or code — is totally dependent upon you to do a good job.

Lesson Four: It Takes a Village

On that note, it is important to realize that it is very hard to raise a child or write a web app completely on your own. Some things are better done with a little help from your friends. Whether that help is providing a role model, a shoulder to cry on, advice when you just don’t know what to do, or purely someone to vent to, it really helps to surround yourself with people you can rely on.

That summarizes the commonalities between parenting and programming that I have learned over the last 7 weeks. I suspect I have a great deal more to learn.

One final note: If anyone approaches you to work on a web app that will take 18 years, run away as fast as you can, and do not look back.

Seven Things

I feel like I’m about last to the party, but after getting tagged by Ben Ramsey I thought I’d contribute to this meme/tag/whatever that’s going around the PHP community.  I intend to blog more in 2009, so this is a starting point anyway.

Seven Things you didn’t know about me:

1.   I learned to program in the 4th grade on an Apple II, in LOGO.  A high school near me had a program where a group of selected nerds from local schools would go there once a week and share two machines.  I was the only one in primary (elementary) school.

2. I used to do a lot of singing - school choir, madrigals, musicals, an a capella group with my friends, and a youth gospel band.  I sing alto.

3. I’m not religious and consider myself a (non militant) atheist.  We were raised that way since my family is a combination of Catholic, Jewish and Presbyterian and my parents didn’t believe in organized religion.  Despite this I went to an Anglican girls’ school for 11 years (also see the above mentioned gospel band) and am technically Jewish.

4. I met my husband Luke Welling in Advanced Software Engineering at college (RMIT University in Melbourne, Australia).  We had to do a review of each other’s code as part of an assignment.  The first words he ever said to me were, “This is crap.”

5.  I moved house every 1-2 years when I was a kid, left school at 16, went back at 19, and then spent far too much time nerding out at college, which I loved.

6.  I’ve been riding horses since I was four years old.  At four, I rode my Shetland pony Froggy through the house to annoy my mother.  (I believe I succeeded.)

7.  I sold my first article to a magazine when I was about 13 years old.  It was a light humorous piece about the challenges involved in buying a horse and was printed in a national horse magazine.  (I always wanted to be a writer when I grew up…or possibly a veterinarian…or maybe a secret agent.  I never thought it would end up being tech books that I wrote.)  I have also written one complete bad novel, and the beginnings of several others.  (Nanowrimo, I’m looking at you.)

OK, now here are my tag-ees.

Tag-wise, here are the rules:

  • Link your original tagger(s), and list these rules on your blog.
  • Share seven facts about yourself in the post—some random, some weird.
  • Tag seven people at the end of your post by leaving their names and the links to their blogs.
  • Let them know they’ve been tagged by leaving a comment on their blogs and/or Twitter.

Write Beautiful Code at OSCON

I gave my talk yesterday at OSCON 2008, and here are the slides.

It’s interesting - I think every time I have given this talk I focus on a slightly different aspect. Yesterday it was the importance of decoupling parts of your application architecture as much as possible. This is better for security reasons (allows paranoid coding practices), for scaling (allows you to switch out and/or scale components independently and quickly), and for maintainability.

OSCON is good as usual - if you’re here be sure to join Mozilla at Beerforge tonight, and come say hi.

Foxes

We have a litter of fox kits in the back field at our house.  Today we managed to catch them on film.  Please enjoy our very own foxkehs.  :)

Foxkeh?

Edited to add: Some people apparently don’t know Foxkeh…for comparison:
Foxkeh,  (C) 2006 Mozilla Japan

You can view the whole set here:
http://flickr.com/photos/lauraxthomson/sets/72157605003262452/