<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>tech ramblings</title>
	<atom:link href="http://www.laurathomson.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.laurathomson.com</link>
	<description>Laura Thomson's random thoughts and rants about tech and FOSS</description>
	<pubDate>Fri, 14 Jun 2013 20:21:20 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
	<language>en</language>
			<item>
		<title>DXR: code search and static analysis</title>
		<link>http://www.laurathomson.com/2013/06/dxr-code-search-and-static-analysis/</link>
		<comments>http://www.laurathomson.com/2013/06/dxr-code-search-and-static-analysis/#comments</comments>
		<pubDate>Fri, 14 Jun 2013 20:21:20 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Mozilla]]></category>

		<category><![CDATA[DXR]]></category>

		<category><![CDATA[Webtools]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=148</guid>
		<description><![CDATA[I asked Erik Rose from my team to blog about his work on DXR (docs), the code search and static analysis tool for the Firefox codebase.  He did so on the Mozilla Webdev blog, so it would show up on Planet Mozilla. Today, it was pointed out to me that the Webdev blog is not [...]]]></description>
			<content:encoded><![CDATA[<p>I asked <a href="http://www.grinchcentral.com/">Erik Rose</a> from my team to blog about his work on <a href="http://dxr.mozilla.org/">DXR</a> (<a href="https://wiki.mozilla.org/DXR">docs</a>), the code search and static analysis tool for the Firefox codebase.  He did so on the <a href="https://blog.mozilla.org/webdev">Mozilla Webdev blog</a>, so it would show up on <a href="http://planet.mozilla.org/">Planet Mozilla</a>. Today, it was pointed out to me that the Webdev blog is not on Planet.</p>
<p>It&#8217;s a great post, summarizing all the things he&#8217;s done in the last few months.  Go read the article: <a href="https://blog.mozilla.org/webdev/2013/06/13/dxr-digests-the-firefox-codebase/">DXR digests the Firefox codebase</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2013/06/dxr-code-search-and-static-analysis/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Weight of Impostor Syndrome</title>
		<link>http://www.laurathomson.com/2013/04/the-weight-of-impostor-syndrome/</link>
		<comments>http://www.laurathomson.com/2013/04/the-weight-of-impostor-syndrome/#comments</comments>
		<pubDate>Tue, 23 Apr 2013 19:54:30 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Personal]]></category>

		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=147</guid>
		<description><![CDATA[Talk of impostor syndrome is almost memetic at the moment. If you don&#8217;t know what it is, go look it up.  I&#8217;ll wait.
Like lots of other people, I struggle with this constantly. I&#8217;m not as smart as everybody else in the room. I&#8217;m not as good a coder. I&#8217;m not as good a manager. Sooner [...]]]></description>
			<content:encoded><![CDATA[<p>Talk of impostor syndrome is almost memetic at the moment. If you don&#8217;t know what it is, <a href="http://en.wikipedia.org/wiki/Impostor_syndrome">go look it up</a>.  I&#8217;ll wait.</p>
<p>Like lots of other people, I struggle with this constantly. I&#8217;m not as smart as everybody else in the room. I&#8217;m not as good a coder. I&#8217;m not as good a manager. Sooner or later I will be found out for what I am: an impostor.</p>
<p>Thing is, I can rationally defeat many of those things by looking at objective evidence. I recite the evidence to myself. I am smart: my IQ is nearly 150. I wrote a programming book that some people really like - note I first wrote that as &#8220;great&#8221;, deleted it, wrote &#8220;best-selling&#8221;, deleted it, and settled for &#8220;some people really like&#8221;. I have worked on some interesting coding projects. I manage a successful team at an interesting company doing things that are technically difficult and that will hopefully make a difference in the world.</p>
<p>But in the back of my brain, a little voice says, that was just luck.</p>
<p>I recently realized that impostor syndrome is present in all parts of my life, not just in my career.  Everyone is better at riding horses than I am, even though I&#8217;ve been doing it since I was four. My fiction writing sucks, and my critique group will eject me once they figure it out.  My house is messier than everyone else&#8217;s, and I think I&#8217;m a terrible cook. I can&#8217;t co-ordinate my wardrobe.</p>
<p>The worst part is standing at the playground, thinking that every other parent there knows what they are doing except for me.</p>
<p>I have to remind myself these things aren&#8217;t true. Every day. I heard some good advice recently, which was to speak to yourself as if you were your best friend. You wouldn&#8217;t say to your best friend, &#8220;You&#8217;re an idiot&#8221;, now, would you? Even if your BFF did something objectively stupid, you might tell them, &#8220;You&#8217;re not stupid. We all do dumb things, sometimes.&#8221;</p>
<p>How about you? If you have strategies for overcoming impostor syndrome, share them in the comments.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2013/04/the-weight-of-impostor-syndrome/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Webtools in 2012: Part 2</title>
		<link>http://www.laurathomson.com/2013/03/webtools-in-2012-part-2/</link>
		<comments>http://www.laurathomson.com/2013/03/webtools-in-2012-part-2/#comments</comments>
		<pubDate>Tue, 05 Mar 2013 18:56:21 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=146</guid>
		<description><![CDATA[A couple of weeks ago I put up a blog post about what we did on Socorro in 2012.  I promised another blog post about all the non-Socorro things we did.  I have probably missed some things, but here&#8217;s the list:
Elmo
Elmo is a localization management dashboard.  We worked this into a Playdoh app, completed a [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago I put up a <a href="http://www.laurathomson.com/2013/02/a-billion-crashes-2012-in-review/">blog post about what we did on Socorro in 2012</a>.  I promised another blog post about all the non-Socorro things we did.  I have probably missed some things, but here&#8217;s the list:</p>
<h2>Elmo</h2>
<p><a href="https://l10n.mozilla.org/">Elmo</a> is a localization management dashboard.  We worked this into a Playdoh app, completed a redesign, built a new homepage, deployed it on new infrastructure, moved it to a new domain, added metrics and launched the app!</p>
<h2>Bouncer</h2>
<p><a href="https://github.com/mozilla/tuxedo/">Bouncer</a> is the download redirector and is one of the oldest webapps at Mozilla. In 2012 we revived the project in order to support the stub installer.  We worked with IT to build out new dev, stage and prod clusters.  We added support for the redirects that stub installer needed, and made Bouncer SSL aware.  We also fixed a number of other issues.</p>
<h2>Air Mozilla</h2>
<p>We built and launched the brand-new <a href="https://air.mozilla.org/">Air Mozilla</a> webapp, including support for Persona, secure/private streams, integrated event scheduling, and a bunch of other exciting features.</p>
<h2>Perfomatic</h2>
<p>We worked with A-team to update <a href="http://graphs.mozilla.org/">graph server</a> into Datazilla to support changes to make Talos more statistically reliable.</p>
<h2>DXR</h2>
<p><a href="http://dxr.mozilla.org/">DXR</a> is a code search tool based on static analysis of the code.  We ran a <a href="https://wiki.mozilla.org/Webtools/DXR_User_Research">usability study</a> and built mockups in preparation for the work we&#8217;ve been doing this year (new UI, MXR parity).</p>
<h2>Etherpad / Etherpad Lite</h2>
<p>We deployed <a href="https://etherpad.mozilla.org/">Etherpad with Persona support</a>, and added Persona and Teampad support to <a href="https://github.com/rhelmer/etherpad-lite">Etherpad Lite</a> (staged on the PaaS so I won&#8217;t link it here).  We are working on security review of EL prior to deployment, and also on getting our changes upstreamed.</p>
<h2>Whistlepig</h2>
<p>We developed the UI for a <a href="https://github.com/ossreleasefeed/WhistlePig">new IT maintenance and outage dashboard</a>.  Bonus: This looks like it will be part of the NOC we are going to build out in 2013.</p>
<h2>Tools-as-a-service</h2>
<p>We developed a product plan for building out a webtools-as-a-service offering for <a href="https://marketplace.mozilla.org/">Marketplace</a>.</p>
<h2>MozLDAP</h2>
<p>We built an <a href="https://github.com/mozilla/moz-ldap">API</a> that wraps LDAP, so if you want to write a Mozilla webapp that uses LDAP for auth you can use this library and avoid having to build your own LDAP code.</p>
<h2>Balrog (AUS)</h2>
<p>We built an admin UI for the new <a href="https://github.com/mozilla/balrog">Automated Update Service</a> for Release Engineering.</p>
<h2>Plugincheck</h2>
<p>We developed a new UI for <a href="http://www.mozilla.org/plugincheck">plugincheck</a>. This is about to launch.</p>
<h2>Dragnet</h2>
<p>We developed code for a new <a href="https://github.com/mozilla/dragnet">crowdsourced DLL directory</a>, based on the DLL information that we have in Socorro.  This is code complete and in a pre-launch state.</p>
<h2>Mediawiki-Bugzilla plugin</h2>
<p>We took over development and launched this <a href="https://github.com/mozilla/mediawiki-bugzilla">plugin</a> after <a href="http://twitter.com/legneato">Legneato</a> left. It provides integration between Bugzilla and Mediawiki.</p>
<h2>Privacy Hub</h2>
<p>We developed a new set of pages (UI and code development) to hold Mozilla&#8217;s Privacy policies.  These will form part of <a href="http://mozilla.org">mozilla.org</a>.</p>
<h2>Gaia</h2>
<p>We worked on the UX for Gaia, the user-facing layer of <a href="https://www.mozilla.org/en-US/firefox/partners/">FirefoxOS</a>.</p>
<h2>Playdoh</h2>
<p>We contributed many patches to <a href="https://github.com/mozilla/playdoh">Playdoh</a>, the Mozilla version of Django.</p>
<h2>Verbatim</h2>
<p>We added support for <a href="https://localize.mozilla.org/verbatim-contributors.html">contributor acknowledgments</a> which was accepted upstream in Pootle.</p>
<h2>PTO</h2>
<p>We built out a new PTO app for reporting vacation. This was completed but did not launch as a different approach is being pursued.</p>
<h2>Sheriffs</h2>
<p>We built out a new app for co-ordinating the Sheriffs calendar. This was completed but did not launch due to hiring a perma-sheriff (probably a better solution than a webapp).</p>
<h2>Bramble/Briar-patch</h2>
<p>We prototyped a monitoring and capacity planning dashboard for the build farm.  This project was later put on hold and did not launch.</p>
<h2>Team growth and development</h2>
<p>During the year, we welcomed new team members <a href="http://www.chesnok.com/">Selena Deckelmann</a> and <a href="http://twitter.com/erikrose">Erik Rose</a>, and intern Tim Mickel.  We participated in several Mozilla workweeks, including a Stability themed work week with Engineering, a team-only workweek at DjangoCon, and a Webdev workweek.  We gave talks at several conferences and participated in HackerSchool.</p>
<p>We got better at working with Ops, QA, and RelEng and built trust and relationships with those groups.</p>
<p>We automated a bunch of processes, perhaps most notably building on pull requests with <a href="https://github.com/Lonnen/leeroy">Leeroy</a> (awesome!).</p>
<h2>finally:</h2>
<p>My New Year&#8217;s Resolution for 2012 was, &#8220;Do more. Go Faster.&#8221;  Mission accomplished.</p>
<p>If I could change anything it would be avoiding the rabbithole of projects that were later killed - it&#8217;s a waste of team effort.  We had a small handful of these.</p>
<p>Overall, it was an awesome, invigorating, and exhausting year. I hope we can do even more and cooler things in 2013.</p>
<p>One point to note is that we are a broadly distributed and largely remote team, but we work well together and ship a lot of stuff.  We are currently spread across Mountain View, northern California, Oregon (multiple locations), Maryland (multiple locations), France, and South Africa.</p>
<p>My thanks to the Webtools team: Adrian Gaudebert, Brandon Savage, Chris Lonnen, Erik Rose, K Lars Lohn, Peter Bengtsson, Rob Helmer, Schalk Neethling, and Selena Deckelmann; and interns Tim Mickel and Tony Young.  You are all awesome.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2013/03/webtools-in-2012-part-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Dumps, disks, and disasters: a detective story</title>
		<link>http://www.laurathomson.com/2013/02/dumps-disks-and-disasters-a-detective-story/</link>
		<comments>http://www.laurathomson.com/2013/02/dumps-disks-and-disasters-a-detective-story/#comments</comments>
		<pubDate>Tue, 26 Feb 2013 21:46:00 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[DevOps]]></category>

		<category><![CDATA[Mozilla]]></category>

		<category><![CDATA[postmortem]]></category>

		<category><![CDATA[troubleshooting]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=145</guid>
		<description><![CDATA[Not long ago, in a datacenter not far away&#8230;this is a story about stuff going wrong and how we fixed it.
Prologue
My team works on Socorro, the Firefox crash reporting system, among other things.
When Firefox crashes, it submits two files to us via HTTP POST. One is JSON metadata, and one is a minidump. This is [...]]]></description>
			<content:encoded><![CDATA[<p>Not long ago, in a datacenter not far away&#8230;this is a story about stuff going wrong and how we fixed it.</p>
<h2>Prologue</h2>
<p>My team works on Socorro, the Firefox crash reporting system, among other things.</p>
<p>When Firefox crashes, it submits two files to us via HTTP POST. One is JSON metadata, and one is a minidump. This is similar in nature to a core dump. It&#8217;s a binary file, median size between 150 and 200 kB.</p>
<p>When we have a plugin problem (typically with Flash), we get two of these crash reports: one from the browser process, and one from the plugin container.  Until recently it was challenging to reunite the two halves of the crash information.  Benjamin Smedberg made a change to the way plugin crashes are reported.  We now get a single JSON metadata file, with both minidumps, the one from the browser, and the one from the plugin container.  We may at some point get another 1-2 dumps as part of the same crash report.</p>
<p>We needed to make a number of code changes to Socorro to support this change in our data format.  From here on in, I shall refer to this architectural change as &#8220;multidump support&#8221;, or just &#8220;multidump&#8221;.</p>
<p>Crashes arrive via our collectors.  This is a set of boxes that run two processes:<br />
1. Collector: this is Python (web.py) running in a mod_wsgi process via Apache.  Collector receives crashes via POST, and writes them to local filesystem storage.</p>
<p>2. Crash mover: This is a Python daemon that picks up crashes from the filesystem and writes them to HBase.</p>
<p>You may be saying, &#8220;Wow, local disk? That is the worst excuse for a queue I&#8217;ve ever seen.&#8221; You would be right.  The collector uses pluggable storage, so it can write wherever you want (from Postgres, HBase, filesystem).  We have previously written crashes to NFS, and more recently and less successfully directly to HBase.  That turned out to be a Bad Idea &#8482;, so about two years ago I suggested we write them to local disk &#8220;until we can implement a proper queue&#8221;.  Local storage has largely turned out to be &#8220;good enough&#8221;, which is why it has persisted for so long.</p>
<p>Adding multidump support changed the filesystem code, among other things.</p>
<h2>Act I: An Unexpected Journey</h2>
<p>1/10/2013<br />
We had landed multidump support on our master and stage branch, but engineers and QA agreed that we were not quite comfortable enough with it to ship it.  Although we had planned to ship it this day, we didn&#8217;t, but we had some other stuff we needed to ship.  Instead of what we usually do (in git, push master to stage, which is our release branch), we stashed stage changes between the last release and now, and then cherry picked the stuff we needed to ship.</p>
<p>What we didn&#8217;t realize was that we accidentally left multidump in the stage branch, so when we pushed, we pushed multidump support.  It ran for several hours in production seemingly without problems.  We did not apply the PostgreSQL schema migration, but we had previously changed the HBase schema to support this, so it didn&#8217;t cause any problems, but was not end-user visible.  When we realized the error, we rolled back, rebuilt, and pushed the intended changes.  This happened within a couple of hours.  (The rollback/rebuild/repush took a minute or two.)</p>
<p>1/17/2013<br />
We intentionally pushed multidump support.  It passed QA, and everything seemed to be going swimmingly.</p>
<p>1/22/2013<br />
A Socorro user (Kairo) noticed that our crash volume had been lower than average for the last couple of days.</p>
<p>Investigation showed that many, many crashes were backed up in filesystem storage, and that HBase writes were giving frequent errors, meaning that the crashmovers were having trouble keeping up.</p>
<p>We decided to take one collector box at a time out of the pool, to allow it to catch up.  We also noticed at this time that all the collectors were backed up except collector04, which was keeping up.  This was a massive red herring as it later turned out.  We ran around checking the config and build and netflows on collector04 were the same as on the other collectors.  While we watched, collector04 gradually began backing up, and then was in the same boat as the others.</p>
<p>Based on previous experiences, many bad words were said about Thrift at this point.  (If you don&#8217;t know Thrift, it&#8217;s a mechanism we use for talking to HBase. We use it because our code is in Python and not a JVM language, so we use Thrift as middleman.)   But this was instinct, not empirical evidence, and therefore not useful for problem solving.</p>
<p>To actually diagnose the problem, we first tried strace-ing the crashmover process, and then decided to push an instrumented build to a single box.  By &#8220;instrumented&#8221; I mean &#8220;it logs a lot&#8221;.  As soon as we had the instrumentation in place, syslog began to tell a story.  Each crash move was taking 4-5 seconds to complete.  Our normal throughput on a single collector topped out at around 2800-3000 crashes/minute, so something was horribly wrong.</p>
<p>As it turned out the slow part was actually *deleting* the crashes from disk.  That was consuming almost all of the 4-5 seconds.</p>
<p>While looking at the crashes on disk, trying to discern a pattern, we made an interesting discovery.  Our filesystem code uses radix storage: files are distributed among directories on a YY/MM/DD/HH/MM/ basis.  (There are also symlinks to access the crashes by the first digit of their hex OOID, or crash ID.)  We discovered that instead of distributing crashes like this, all the crashes on each collector were in a directory named [YY]/[MM]/[DD]/00/00.  Given the backlog, that meant that, on the worst collector, we had 750,000 crashes in a single directory, on ext4. What could possibly go wrong?</p>
<p>At this point we formed the hypothesis that deletes were taking so long because of the sheer number of files in a directory.  (If there&#8217;s any kind of stat in the code - and strace showed there was - then this would perform poorly.)</p>
<p>We moved the crashes manually out of the way, as a test.  This sped things up quite a bit.</p>
<p>We also noticed at this point that the 00/00 crashes had backed up on several days.  We had some orphaned crashes on disk (a known bug, when multiple retries fail), and this was the pattern.<br />
01/10/00/00 - a moderate number of crashes<br />
01/17/00/00 - ditto<br />
(same for each succeeding day)<br />
01/22/00/00 - a huge number of crashes</p>
<p>These days correlated to the days we had multidump code running in production.  We had kind of suspected that, but this was proof.</p>
<p>We rolled back a single collector to pre-multidump code, and it immediately resumed running at full speed.  We then rolled back the remainder of the collectors, and took them out of the pool one at a time so they could catch up.</p>
<p>Somewhere during our investigation (my notes don&#8217;t show when) the intermittent failures from HBase had stopped.</p>
<p>By Saturday 1/26, we had caught up on the backlog.  We had also by this time, discovered the code bug that wrote all files into a single directory, and patched it.  (The filesystem code no long had access to the time, so all times were 00/00.)</p>
<p>We thought we were out of the woods, and scheduled a postmortem for 1/31.  However, it wasn&#8217;t going to be that easy.</p>
<h2>Act II: All this has happened before, and will happen again.</h2>
<p>1/28/2013<br />
We ran backfill for our aggregate processing, in order to recalculate totals with the additional processed crashes included.</p>
<p>Our working hypothesis at this stage was as follows.  An unknown event involving HBase connection outages (specifically on writes) had caused crashes to begin backing up, and then having a large number of crashes in a single directory made deletion slow.  We still wanted to know what had caused the HBase issue, but there were two factors that we knew about.  First, at the time of the problem, we had an outage on a single Region Server.  This shouldn&#8217;t cause a problem, but the timing was suspicious.  Secondly, we saw an increased number of errors from Thrift.  This has happened periodically and is short-term solved by restarting Thrift.  We believe it is partially caused by our code handling Thrift connections in a suboptimal way, something that is in the process of being solved by our intern.</p>
<p>Also on 1/28 we pushed a new build that incorporated the fix for the directory naming problem.  (see <a href="https://github.com/mozilla/socorro/commit/9a376d8c1b2c9bf40b3b612661a971a311a9738c">https://github.com/mozilla/socorro/commit/9a376d8c1b2c9bf40b3b612661a971a311a9738c</a>)</p>
<p>1/31<br />
A big day.  We had two things planned for this day: first, a postmortem for the multidump issue, and second, a PostgreSQL failover from our primary to secondary master so we could replace the disks with bigger ones.</p>
<p>Murphy, the god of outages, intermittent errors, and ironic timing, did not smile fondly upon us this day.</p>
<p>Crashes began backing up on collectors once again (see <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=836845">https://bugzilla.mozilla.org/show_bug.cgi?id=836845</a>).  We saw no HBase connection errors at this time, and hence realized at this point that we must have missed something.  We rolled back to a pre-multidump build on collectors, and they immediately began catching up.  We held off running backfill of aggregates at this time, because we wanted to go ahead with the failover.  Disk was getting desperately short and we had already had to delay the failover once due to external factors.</p>
<p>We postponed the postmortem, because clearly we didn&#8217;t have a handle on the root cause(s) at this time.</p>
<p>We then discovered the cause.  Multidump code was using remove() instead of the previously used quickDelete(), which was used to replace remove() a number of years ago because it was so slow.  (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=836986">a href=&#8217;https://bugzilla.mozilla.org/show_bug.cgi?id=836986</a>)</p>
<p>We proceeded with the planned failover from master01 to master02, and replaced the disks in master01.  Our plan was to maintain master02 as primary, with master01 replicating from master02.  The failover went well, but the new disks for master01 turned out to be faulty, post-installation.  We were now in a position where we no longer had a hot standby.  Our disk vendor did not meet their SLA for replacement.</p>
<p>2/1<br />
We ran backfill of aggregate reports, and from an end-user perspective everything was back to normal.</p>
<p>2/2<br />
Replaced disks on master01 (again).  These too had some errors but we managed to solve that.</p>
<p>Later, we pushed a new build that solved the quickDelete() issue.  We were officially out of the woods.</p>
<h2>Epilogue</h2>
<p>Things that went well:</p>
<ul>
<li> The team, consisting of engineers, WebOps, and DCOps worked extremely well and constructively together.</li>
<li> As a result of looking closely at our filesystem/HBase interactions, we tuned disk performance and ordered some SSDs which have effectively doubled performance since installation.  Thrift appears to be the next bottleneck in the system.</li>
</ul>
<p>Things we could have done better:</p>
<ul>
<li> Release management: we broke our RM process and that led us to accidentally ship the code prematurely.</li>
<li> Not shipped broken code, you know, the usual. Although I do have to say this was more subtly broken than average.  The preventative measures here would have been better in-code documentation in the old code (&#8221;Using quickDelete here instead of remove because remove performs badly.&#8221;)  We did go through code review, unit and integration testing, and manual QA, as per usual, but given this code only performed poorly once other parts of the system showed degraded performance, this was hard to catch.</li>
<li> Relying on end-user observation to discover how the system was broken.  Monitoring can solve this.</li>
</ul>
<p>Things we will change:</p>
<ul>
<li> Improvements to monitoring.  We will now monitor the number of backed up crashes. It&#8217;s not a root cause monitor but an indicator of trouble somewhere in the system.  We have a few others of these, and they are good catch-alls for things we haven&#8217;t thought to monitor yet.  We are also working on better monitoring of Thrift errors using thresholds.  Right now we consider a 1% error rate on Thrift connections normal, and support limited retries with exponential fallback. We want to alert if the percentage increases.  We plan on doing more of these thresholded monitors by writing these errors to statsd, and pointing nagios at the rolling aggregates.  This will also work for monitoring degraded performance over time.</li>
<li> Improvements to our test and release cycles.  We have seen a few times now an issue where when we get a feature to staging we decide it&#8217;s not ready to ship, and this involves git wrangling and introduces a level of human error.  Our intention is to build out a set of &#8220;try&#8221; environments, that is parallel staging environments that run different branches from the repo.</li>
</ul>
<p>Confession:<br />
I like disasters.  They always lead to a better process and better code.  Also, when the team works well together, it&#8217;s a positive trust-building and team-building experience.  Much better than trust falls in my experience.</p>
<p>A final note<br />
All of the troubleshooting was done with a remote team, working from various locations across North America, communicating via IRC and Vidyo.  It works.</p>
<p>Thanks to everyone involved in troubleshooting this issue: Jake Maul, Selena Deckelmann, Rob Helmer, Chris Lonnen, Dumitru Gherman, and Lars Lohn.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2013/02/dumps-disks-and-disasters-a-detective-story/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A billion crashes: 2012 in review</title>
		<link>http://www.laurathomson.com/2013/02/a-billion-crashes-2012-in-review/</link>
		<comments>http://www.laurathomson.com/2013/02/a-billion-crashes-2012-in-review/#comments</comments>
		<pubDate>Sat, 09 Feb 2013 22:42:30 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Firefox]]></category>

		<category><![CDATA[Mozilla]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[big data]]></category>

		<category><![CDATA[socorro]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=144</guid>
		<description><![CDATA[In 2012, on the Socorro project, we:

Collected more than one billion crashes: more than 150TB of raw data, amounting to around half a petabyte stored. (Not all at once: we now have a data expiration policy.)
Shipped 54 releases
Resolved 1010 bugs.  Approximately 10% of these were the Django rewrite, and 40% were UI bugs.  Many of [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">In 2012, on the Socorro project, we:</p>
<ul>
<li>Collected more than one billion crashes: more than 150TB of raw data, amounting to around half a petabyte stored.<span style="mso-spacerun: yes;"> </span>(Not all at once: we now have a data expiration policy.)</li>
<li>Shipped 54 releases</li>
<li>Resolved 1010 bugs.  Approximately 10% of these were the Django rewrite, and 40% were UI bugs.  Many of the others were backend changes to support the front end work (new API calls, stored procedures, and so on).</li>
</ul>
<p class="MsoNormal">New features include:</p>
<ul>
<li>Reports available in build time as well as clock time (graphs, crashes/user, topcrashers)</li>
<li>Rapid beta support</li>
<li>Multiple dump support for plugin crashes</li>
<li>New signature summary report</li>
<li>Per OS top crashers</li>
<li>Addition of memory usage information, Android hardware information, and other new metadata</li>
<li>Timezone support</li>
<li>Correlation reports for Java</li>
<li>Better admin navigation</li>
<li>New crash trends report</li>
<li>Added exploitability analysis to processing and exposed this in the UI (for authorized users)</li>
<li>Support for ESR channel and products</li>
<li>Support for WebRT</li>
<li>Support for WebappRTMobile</li>
<li>Support for B2G</li>
<li>Explosiveness reporting (back end)</li>
<li>More than 50 UI tweaks for better UX</li>
</ul>
<p class="MsoNormal">
<p class="MsoNormal">Non-user facing work included:</p>
<ul>
<li>Automated most parts of our release process</li>
<li>All data access moved into a unified REST API</li>
<li>Completely rewrote front end in Python/Django (from old KohanaPHP version with no upgrade path)</li>
<li>Implemented a unified configuration management solution</li>
<li>Implemented unified cron job management</li>
<li>Implemented auto-recovery in connections for resilience</li>
<li>Added statsd data collection</li>
<li>Implemented fact tables for cleaner data reporting</li>
<li>Added rules-based transforms to support greater flexibility in adding new products</li>
<li>Refactored back end into pluggable fetch-transform-save architecture</li>
<li>Automated data export to stage and development environments</li>
<li>Created fakedata sandbox for development for both Mozilla employees and outside contributors</li>
<li>Implemented automated reprocessing of elfhack broken crashes</li>
<li>Automated tests run on all pull requests</li>
<li>Added views and stored procedures for metrics analysts</li>
<li>Opened read-only access to PostgreSQL and HBase (via Pig) for internal users</li>
</ul>
<p>I believe we run one of the biggest software error collection services in the  world.  Our code is used by open source users across the internet, games, gaming (casino),  music, and audio industries.</p>
<p>As well as  working on Socorro, the Webtools team worked on more than 30 other projects, fixed  countless bugs, shipped many, many releases, and supported critical  organizational goals such as stub installer and Firefox Health Report.   We contributed to Gaia, too.</p>
<p>We could not have done any of this without help from IT (especially WebOps, SRE, and DB Ops) and WebQA.  A huge thank you to those teams. &lt;3</p>
<p>I&#8217;ll write a part two of this blog post to talk more about our work on projects other than crash reporting, but I figured collecting a billion crashes deserved its own blog post.</p>
<p><em>Edited to add: </em>I learned from Corey Shields, our Systems Manager, that we had <strong>100% uptime</strong> in Q4.  (He&#8217;s still working on statistics for the whole of 2012.)</p>
<p class="MsoNormal">
<p class="MsoNormal">
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2013/02/a-billion-crashes-2012-in-review/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Postmortem: an unfortunate name for a useful process</title>
		<link>http://www.laurathomson.com/2012/11/postmortem-an-unfortunate-name-for-a-useful-process/</link>
		<comments>http://www.laurathomson.com/2012/11/postmortem-an-unfortunate-name-for-a-useful-process/#comments</comments>
		<pubDate>Tue, 20 Nov 2012 16:11:47 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[DevOps]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=143</guid>
		<description><![CDATA[Let’s face it, unless your development process or self-control has utterly failed, nobody died.
The purpose of a postmortem is to meet and review a project, release, or other situation with the goal of reflection and development.  To put it simply, you want to work out what went well and how to repeat it, and [...]]]></description>
			<content:encoded><![CDATA[<p>Let’s face it, unless your development process or self-control has utterly failed, nobody died.</p>
<p>The purpose of a postmortem is to meet and review a project, release, or other situation with the goal of reflection and development.  To put it simply, you want to work out what went well and how to repeat it, and what went badly, and how to prevent that from happening again, or to mitigate it if it does.  You can think of this process as being similar to a performance review for a team or project in a particular situation.</p>
<p>A postmortem shouldn’t devolve into finger pointing or shouting, or leave anyone walking away feeling miserable or full of rage.  It’s not a kangaroo court, held purely to assign blame.  Neither should it resemble the traditional Festivus “Airing of the Grievances”.  People shouldn’t dread coming to postmortems, or they will avoid doing so.</p>
<p>Don’t just run postmortems for projects that have failed or otherwise been problematic.  Run them for the successful projects as well.  It’s important to capture what the team did that went well and made the project succeed.  Make a postmortem part of your normal process: the bookend to a kickoff meeting.<br />
Here, then, are some tips on running a constructive postmortem.</p>
<h2>Timing</h2>
<p>The ideal postmortem happens soon enough after the event that everybody remembers what happened.  You need to balance this with giving people enough time to reflect and, if things have gone badly, to calm down.  A few days to a week afterwards is often about right.</p>
<h2>Leadership</h2>
<p>Typically, a postmortem will be led by a project or release manager, or lead developer or sysadmin.  If you’re reading this, this may well be you.  </p>
<p>If you have strong emotions or opinions about what’s happened, I’d recommend getting them out beforehand.  You can do by working out stress in whatever way appeals to you: writing a long angry email and then deleting it, going for a run, talking to a friend or spouse, or spending an evening gunning down zombies.  The main thing is to have vented whatever steam you have built up before arriving at the meeting, or writing the agenda.</p>
<h2>Agenda</h2>
<p>Have a detailed agenda.  I’d suggest:</p>
<ul>
<li>Set the scope of what you’re talking about and stick to it.   If the topic is Release X, say that upfront.  Don’t stray off into general problems that exist with, for example, relationships between the development team and ops or marketing.</li>
<li>Write down some facts about the topic.  This might include the timeline, who was responsible for what, and links to any external documents (project plan, budget, or bug reports, for example).</li>
<li>What went well?  Even in the worst situation, something probably went well.   Did the team work well together?  Did all your boxes stay up under load?  If they crashed, did your monitoring system tell you so?  Seed this with a couple of items beforehand and add to it during the postmortem.</li>
<li>What could have gone better: Remember, avoid finger pointing.  Chances are, if someone screwed up, they know it.  If they are oblivious to personal poor performance, bringing it up in a group meeting won’t help, and you’ll need to address this via other avenues.  Focus on tasks and items that failed or could have gone better, not people who could have done better.  Again, seed this with a couple of items.</li>
<li>Suggested improvements for next time: This is the best and most constructive part of a postmortem.   Given the facts that have just been discussed, this can be a brainstorming session on how to work better in future.</li>
<li>Actions : Improvements will go nowhere without actions.  I recommend each action has an owner, a deadline, and a deliverable, even if it’s just emailing the group with the result of your research.</li>
</ul>
<h2>During the postmortem</h2>
<p>Begin by making sure everyone understands the parameters of the meeting.  Your job as leader is not to do all the talking or share out the blame, but to go over the facts, and make sure people stay on track.</p>
<p>If the discussion gets too heated or off track, go to your happy place, put down the Nerf gun, and get people back on the agenda.  Sometimes you can achieve this by asking people to take a long, heated, or irrelevant discussion offline, or simply by saying “Let’s move on.”</p>
<p>You might be surprised at how creative and constructive people can be, especially in the face of failure.  I think the best, most constructive postmortem I have been involved in was the one after my biggest disaster.  Engineers and sysadmins hate to fail.  Focus on problem solving for the next iteration.</p>
<h2>Afterwards</h2>
<p>These discussions can be draining.  I tend to coast on adrenalin after a release or crisis, and only hit the post-adrenalin exhaustion after the postmortem.  It’s not a bad thing to schedule a postmortem just before lunch, or at the end of the day, to give people a chance to relax and refuel afterwards.</p>
<p><strong>Author&#8217;s Note:</strong> I originally drafted this post as part of an idea for a book last year. I still hope to write that book at some point, but I thought it would make a good blog post in the meantime.</sp></p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2012/11/postmortem-an-unfortunate-name-for-a-useful-process/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Ship it: a big week in Webtools</title>
		<link>http://www.laurathomson.com/2012/10/ship-it-a-big-week-in-webtools/</link>
		<comments>http://www.laurathomson.com/2012/10/ship-it-a-big-week-in-webtools/#comments</comments>
		<pubDate>Mon, 15 Oct 2012 16:34:05 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[DevOps]]></category>

		<category><![CDATA[Firefox]]></category>

		<category><![CDATA[Mozilla]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=142</guid>
		<description><![CDATA[They say multi-tasking is hard.  They also say DevOps is hard.  Let me tell you about a bunch of engineers who think &#8220;hard&#8221; means &#8220;a nice challenge&#8221;.
Last week was an amazing one for the Webtools family.  We pushed three releases to three major products. People inside Mozilla don&#8217;t always know exactly what [...]]]></description>
			<content:encoded><![CDATA[<p>They say multi-tasking is hard.  They also say DevOps is hard.  Let me tell you about a bunch of engineers who think &#8220;hard&#8221; means &#8220;a nice challenge&#8221;.</p>
<p>Last week was an amazing one for the Webtools family.  We pushed three releases to three major products. People inside Mozilla don&#8217;t always know exactly what types of things the Webtools team works on, so allow me to tell you about them.</p>
<h2>1.  Bouncer</h2>
<p>Bouncer is Mozilla&#8217;s download redirector.  When you click on one of those nifty &#8220;Download Firefox&#8221; buttons on mozilla.org, that takes you to Bouncer, which redirects you to the correct CDN or mirror where you can actually get the product that you want.  Bouncer is also one of the oldest webapps at Mozilla, having been originally authored by my boss, Mike Morgan, many years ago.</p>
<p>Bouncer hadn&#8217;t had code changes in a very long time, and when we realized we needed to change it to support the new stub installer for Firefox, we had to spin up new development and staging environments.  In addition, IT built out a new production cluster up to the new standards that have come into use since the last time it was deployed.</p>
<p>The code changes for stub installer are mainly around being intelligent enough to understand that some products, like the stub, can only be served from an SSL CDN or mirror.  We don&#8217;t want to serve all products over SSL because of cost.</p>
<p>On Wednesday we shipped the new infrastrucure, and the code changes.  You can read more about that it in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=800042">bug 800042</a>.</p>
<p>Thanks to Brandon Savage (Webtools), Anthony Ricaud (Websites), Fred Wenzel (Dev Ecosystem), Jake Maul (WebOps), Chris Turra (WebOps), Corey Shields (Systems), Stephen Donner (Web QA), Matt Brandt (Web QA), Raymond Etnoram (Web QA), and Ben Hearsum (RelEng) for making this possible.</p>
<h2>2. Air Mozilla</h2>
<p>As you probably know, <a href="https://air.mozilla.org/">Air Mozilla</a> is the website that broadcasts Mozilla meetings, brownbags and presentations.  On Friday we shipped a brand new version, built on top of Django.  The old version was hosted in Wordpress, and was really a simple way to present content.  The new version has full calendaring integration, LDAP and BrowserID support, and better ways to find old presentations.</p>
<p>Thanks to Tim Mickel (Webtools Intern), Peter Bengtsson (Webtools), Richard Milewski (Air Mozilla), Zandr Milewski (SpecOps), Dan Maher (WebOps), Chris Turra (WebOps), Brandon Burton (WebOps), Jason Crowe (WebOps), and Corey Shields (Systems).</p>
<p>You can see details of the release in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=799745">bug 799745</a>.</p>
<h2>3.  Socorro</h2>
<p>We also shipped a regular Wednesday <a href="http://crash-stats.mozilla.com/">Socorro</a> release.  Socorro is the crash reporting service for Mozilla products, including Firefox, Firefox for Mobile (&#8221;Fennec&#8221;), Firefox OS (&#8221;Boot to Gecko&#8221;), and Thunderbird.</p>
<p>In this release we shipped <a href="https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced;target_milestone=22;product=Socorro;list_id=4632641">five bug fixes and enhancements</a>.  This number was a bit lower than usual, as most people are crunching to complete the front end rewrite (more on that in a moment).</p>
<p>You can read more about the release in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=800140">bug 800140</a>.</p>
<p>Thanks to the whole team for working on this: Adrian Gaudebert, Brandon Savage, Chris Lonnen, Lars Lohn, Peter Bengtsson, Robert Helmer, Schalk Neethling, Selena Deckelmann, and of course Matt Brandt (Web QA) and Brandon Burton (IT).</p>
<h3>An aside: Socorro on Django</h3>
<p>We are also very close to feature parity with the new Django-based version of the Socorro webapp to the old PHP webapp.  We needed to rewrite this code, because the version of the framework used in the old version is four years out of date, and there was no upgrade path for it - newer versions break backwards compatibility.  Since we had to rewrite it anyway, we have moved to use the same framework as the majority of other webapps at Mozilla.  This allows for easier contributions by other Mozillians.  We should reach parity in the next couple of days, and plan to ship the new code in parallel with the old, subject to secreview timing.</p>
<h2>finally:</h2>
<p>I am incredibly proud of the impact, quality, and sheer quantity of our work over the last weeks.  These projects will enable many good things throughout Mozilla.  Good work, people, stand tall.</p>
<p>Webtools is a small team, and we could not do what we do with incredible support from IT and QA.  I like to think of this as the Webtools family: <strong>we are all one team</strong>; we all work together to get the job done come hell, high water, or zombies in the data center.</p>
<p>Just remember, there&#8217;s a reason the Webtools mascot is Ship It Squirrel.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2012/10/ship-it-a-big-week-in-webtools/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A visit to Hacker School</title>
		<link>http://www.laurathomson.com/2012/08/a-visit-to-hacker-school/</link>
		<comments>http://www.laurathomson.com/2012/08/a-visit-to-hacker-school/#comments</comments>
		<pubDate>Fri, 17 Aug 2012 15:35:04 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=141</guid>
		<description><![CDATA[In July, I was privileged to visit Hacker School as part of their Open Source week.  Hacker School is an amazing place, where hackers from all walks of life work together to level up as programmers.  It reminded me of all the good things about grad school.  I really loved the atmosphere.
During [...]]]></description>
			<content:encoded><![CDATA[<p>In July, I was privileged to visit <a href="https://www.hackerschool.com/">Hacker School</a> as part of their Open Source week.  Hacker School is an amazing place, where hackers from all walks of life work together to level up as programmers.  It reminded me of all the good things about grad school.  I really loved the atmosphere.</p>
<p>During Open Source Week, students&#8217; goal is to submit their first patch to an existing Open Source project.  A wide variety of projects were chosen by the students.</p>
<p>I gave a talk on <a href="http://www.slideshare.net/lauraxthomson/hacker-school-gettingstartedinopensource">getting started in Open Source</a>, and then myself and two of my Mozilla colleagues helped some students get started on some Mozilla projects.  At the end of the week, the organizers gathered together a list of what the students had contributed on our projects.  I&#8217;d like to share those contributions with you.  They include patches, pull requests, and filed bugs.<br />
<a href=" https://github.com/mozilla/socorro-crashstats/pull/59"><br />
</a></p>
<ul>
<li><a href=" https://github.com/mozilla/socorro-crashstats/pull/59"> https://github.com/mozilla/socorro-crashstats/pull/59</a></li>
<li><a href="https://github.com/mozilla/socorro-crashstats/issues/50">https://github.com/mozilla/socorro-crashstats/issues/50</a></li>
<li><a href="https://github.com/jns2/moz-x-tags/tree/listview-branch/listview">https://github.com/jns2/moz-x-tags/tree/listview-branch/listview</a> with sample code:<br />
<a href="http://ec2-23-22-34-255.compute-1.amazonaws.com/x-tag/newelements/listview/test.html">http://ec2-23-22-34-255.compute-1.amazonaws.com/x-tag/newelements/listview/test.html</a></li>
<li><a href="https://github.com/csuwldcat/moz-x-tags/pull/8/">https://github.com/csuwldcat/moz-x-tags/pull/8/</a></li>
<li><a href="https://github.com/csuwldcat/moz-x-tags/pull/10">https://github.com/csuwldcat/moz-x-tags/pull<br />
/10</a></li>
<li><a href=" https://github.com/mozilla/socorro-crashstats/issues/3">https://github.com/mozilla/socorro-crashstats/issues/3</a></li>
<li><a href="https://github.com/mozilla/socorro-crashstats/issues/51">https://github.com/mozilla/socorro-crashstats/issues/51</a></li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=772228 (patch)">https://bugzilla.mozilla.org/show_bug.cgi?id=772228 (patch)</a></li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=772628 (bug)">https://bugzilla.mozilla.org/show_bug.cgi?id=772628 (bug)</a></li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=772655 (bug)">https://bugzilla.mozilla.org/show_bug.cgi?id=772655 (bug)</a></li>
<li><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=648681 (bug)">https://bugzilla.mozilla.org/show_bug.cgi?id=648681 (bug)</a></li>
<li><a href=" https://github.com/csuwldcat/moz-x-tags/pull/4/">https://github.com/csuwldcat/moz-x-tags/pull/4/</a></li>
<li><a href="https://github.com/csuwldcat/moz-x-tags/pull/7/">https://github.com/csuwldcat/moz-x-tags/pull/7/</a></li>
<li><a href="https://github.com/csuwldcat/moz-x-tags/pull/1">https://github.com/csuwldcat/moz-x-tags/pull/1</a></li>
</ul>
<p>That&#8217;s a lot of contributions, right there.</p>
<h2>Observations</h2>
<p>Part of the reason the school is so successful, in my view, is the encouraging and non-judgemental atmosphere.  They have two rules about communication:</p>
<ol>
<li>No &#8220;Well-actually&#8221;.  This is that thing where we, as geeks, feel the need to correct one another to the nth degree.</li>
<li>No feigned surprise.  That&#8217;s saying things like &#8220;I can&#8217;t believe you&#8217;ve never heard of Richard Stallman!&#8221;</li>
</ol>
<p>The skill range of students varies from self-taught in the last six months, to several years&#8217; experience, to PhD students on summer vacation.  But everyone works side by side, productively and enthusiastically.</p>
<h2>Calls to action</h2>
<p>I learned a lot from my day at Hacker School, and it inspired me to issue these calls to action:</p>
<ol>
<li>Coders: If you&#8217;re thinking about applying to Hacker School, do it.  It&#8217;s a truly amazing place.  <a href="https://www.hackerschool.com/blog/4-fall-2012-applications-open">Applications are open</a> for the fall batch.</li>
<li>Hackers: Nominate people (including yourself!) to be a <a href="https://docs.google.com/spreadsheet/viewform?formkey=dG0tbDdSc2t6U3RWdWxfWnowZ0MyTFE6MQ">Hacker School resident,</a> working alongside students for a couple of weeks.</li>
<li>Tech companies: consider <a href="mailto:sponsors@hackerschool.com">sponsoring</a> the next batch of students.</li>
<li>Mozillians: we should sponsor and run and be involved with more hackathons on Mozilla. projects.  We should host Hackdays where we get brand new contributors involved with our projects.  I propose we do this at existing Open Source conferences, get-togethers, and MozCamps, and at informal hackathons wherever the opportunity presents itself.</li>
</ol>
<h2>Finally</h2>
<p>I&#8217;d like to thank Nick <span class="gD">Bergson-Shilcock, David Albert, Sonali Sridhar, Thomas Ballinger, and Alan O&#8217;Donnell for running Hacker School and hosting us, and <a href="http://etsy.com/">Etsy</a>, <a href="http://37signals.com/">37signals</a>, and <a href="https://www.yammer.com/">Yammer</a> for their sponsorship of the school.  And of course, I&#8217;d like to thank the students for being awesome, and for their contributions!<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2012/08/a-visit-to-hacker-school/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The dark craft of engineering management</title>
		<link>http://www.laurathomson.com/2012/08/the-dark-craft-of-engineering-management/</link>
		<comments>http://www.laurathomson.com/2012/08/the-dark-craft-of-engineering-management/#comments</comments>
		<pubDate>Tue, 07 Aug 2012 18:35:58 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=140</guid>
		<description><![CDATA[VM Brasseur and I had a chat about what it means to be an engineering manager, as a follow up to her excellent talk on the subject at Open Source Bridge.  I promised her I would put my (lengthy, rambling) thoughts into an essay of sorts, so here it is.
&#8220;Management is the art of getting [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://anonymoushash.vmbrasseur.com/" target="_blank">VM Brasseur</a> and I had a chat about what it means to be an engineering manager, as a follow up to her excellent <a href="http://opensourcebridge.org/sessions/806">talk</a> on the subject at <a href="http://opensourcebridge.org/" target="_blank">Open Source Bridge</a>.  I promised her I would put my (lengthy, rambling) thoughts into an essay of sorts, so here it is.</p>
<p>&#8220;Management is the art of getting things done through people.&#8221;<br />
This is a nice pithy quote, but I prefer my version:</p>
<p>&#8220;Management is the craft of enabling people to get things done.&#8221;<br />
Yes, it&#8217;s less grammatical.  Sue me.</p>
<h2>Why is management a craft?</h2>
<p>It&#8217;s a craft for the same reasons engineering is a craft.  You can read all the books you want on something but crafts are learned by getting your hands in it and getting them dirty.  Crafts have rough edges, and shortcuts, and rules of thumb, and things that are held together with duct tape.  The product of craft is something useful and pleasing.</p>
<p>(Art to me is a good deal purer: more about aesthetics and making a statement than it is about making a thing.  Craft suits my analogy much better.)</p>
<h2>Why enabling people to get things done?</h2>
<p>Engineers, in general, know their jobs, to a greater or lesser extent.  My job, as an engineering manager, is to make their jobs easier.</p>
<p>What do engineers value?  This is of course going to be a sweeping generalization, but I&#8217;m going to resort to quoting <a href="http://www.danpink.com/drive">Dan Pink</a>: Mastery, autonomy, and purpose.</p>
<h3>Mastery</h3>
<p>Mastery has all kinds of implications.  As a manager, my job is to enable engineers to achieve and maintain mastery.  This means helping them to be good at and get better at their jobs.  Enabling them to ship stuff they are passionate about.  To learn the skills they need to do that.  To work alongside others who they can teach and learn from.  To have the right tools to do their jobs.</p>
<h3>Autonomy</h3>
<p>Autonomy is the key to scaling yourself as an engineering manager.  As an engineer, I hate nothing more than being micromanaged.  As an engineering manager, my job is to communicate the goals and where we want to get to, and work with you to determine how we&#8217;re going to get there.  Then I&#8217;m going to leave you the hell alone to get stuff done.</p>
<p>The two most important things I do as a manager are in this section.<br />
The first is to act as a BS umbrella for my people.  This means going to meetings, fighting my way through the uncertainty, and coming up with clear goals for the team.  I am the wall that stands between bureaucracy and engineers.  This is also the most stressful part of my job.</p>
<p>The second is in 1:1s.  While I talk to my remote, distributed team all day every day in IRC as needed, this is the sacrosanct time each week where we get to talk.  There are three questions that make up the core of the 1:1:</p>
<ul>
<li>How is everything going?  This is an opportunity for any venting, and lets the engineer set the direction of the conversation.</li>
<li>What are you going to do next?  Here, as a manager, I can help clarify priorities, and suggest next steps if the person is blocked.</li>
<li>What do you need?  This can be anything from political wrangling to hardware.  I will do my best to get them what they need.</li>
</ul>
<p>In Vicky&#8217;s talk she talked about getting all your ducks in a row.  In my view, the advantage of empowering your engineers with autonomy is that you get self-organizing ducks.</p>
<p>The key thing to remember with autonomy is this: Hire people you can trust, and then trust them to do their best.</p>
<h3>Purpose</h3>
<p>This is key to being a good manager, because you&#8217;re providing the purpose.  You help engineers work out what the goals should be, prioritize them, clarify requirements, and make sure everybody has a clear thing they are working towards.  Clarity of purpose is a powerful motivator.  Dealing with uncertainty is yet another roadblock you remove from the path of your team.</p>
<h2>Why is management fun?  Why should I become a manager?</h2>
<p>Don&#8217;t become an engineering manager because you want power - that&#8217;s the worst possible reason.  A manager is a servant to their team.  Become a manager if you want to serve.  Become a manager if you want to work on many things at once.  Becoming a manager helps you become a fulcrum for the engineering lever, and that&#8217;s a remarkably awesome place to be.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2012/08/the-dark-craft-of-engineering-management/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Fifteen Minute Maker&#8217;s Schedule</title>
		<link>http://www.laurathomson.com/2012/02/the-fifteen-minute-makers-schedule/</link>
		<comments>http://www.laurathomson.com/2012/02/the-fifteen-minute-makers-schedule/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 15:39:58 +0000</pubDate>
		<dc:creator>laura</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[creativity]]></category>

		<category><![CDATA[makers schedule]]></category>

		<category><![CDATA[productivity]]></category>

		<guid isPermaLink="false">http://www.laurathomson.com/?p=139</guid>
		<description><![CDATA[If you haven&#8217;t read Paul Graham&#8217;s &#8220;Maker&#8217;s Schedule, Managers Schedule&#8221;, I recommend doing that before you read this or it won&#8217;t make any sense.
The Maker&#8217;s Schedule makes sense to me in a work setting, but how about for side projects, things you&#8217;re trying to do after hours?
I started fomenting this blog post a while ago.  [...]]]></description>
			<content:encoded><![CDATA[<p>If you haven&#8217;t read Paul Graham&#8217;s <a href='http://www.paulgraham.com/makersschedule.html'>&#8220;Maker&#8217;s Schedule, Managers Schedule&#8221;</a>, I recommend doing that before you read this or it won&#8217;t make any sense.</p>
<p>The Maker&#8217;s Schedule makes sense to me in a work setting, but how about for side projects, things you&#8217;re trying to do after hours?</p>
<p>I started fomenting this blog post a while ago.  A very good engineer I know said something to me which I must admit rubbed me up the wrong way.  He said something along the lines of, &#8220;See, you like to write for fun, and I like to code for fun.&#8221;  Actually, I really like to code for fun too, but it&#8217;s much easier to write than code in fifteen minute increments, which is often all I have available to me on any given day.</p>
<p>Let&#8217;s be clear about one thing: I don&#8217;t think of myself as a consumer.  I barely watch TV, only when my two year old insists.  I can&#8217;t tell you the last time I had time to watch a movie, and I haven&#8217;t played a non-casual video game since college.  I do read books, but books, too, lend themselves well to being read in fifteen minute increments.</p>
<p>I want to be a producer: someone who makes things.  Unfortunately my life is not compatible with these long chunks of time that Paul Graham talks about.  I think any parent of small children would say the same.  When you&#8217;re not at work you are on an interrupt-driven schedule: not controlled by management, but controlled by the whims of the little people who are the center of your universe.</p>
<p>This is how I work:</p>
<p>When I&#8217;m doing one of the mindless things that consume some of my non-work time - showering, driving, grocery shopping, cleaning the house, laundry, barn chores - I&#8217;m planning.  Whether it&#8217;s cranking away on a work problem, planning a blog post or a plot for a novel that I want to write, thinking of what projects to build for our next PHP book, mapping out a conference talk, planning code that I want to work on.  This is brain priming time.  When I get fifteen minutes to myself I can act on those things.</p>
<p>In other words, planning is parallelizable.  Doing is not.  Since I have so little uninterrupted time to *do*, I plan it carefully, and use it as much as I can.</p>
<p>When I get the occasional hour or two - nap time on a weekend (and to hell with the laundry), my husband taking our child out somewhere, or those blessed, perfect hours on a transcontinental flight - I can get so much done it makes my head hurt.  But those are the exceptions, not the norm.  I expect that to be the case until our child is a good deal older.</p>
<p>I had to train myself to do *anything* in fifteen minutes.  It didn&#8217;t come naturally, but I heard the advice over and over again, particularly from women writers, some of them New York Times bestsellers.  One has five children and wrote six books last year, so it can be done.  The coding is coming.  Training myself to code in fifteen minute increments has taken a lot longer than training myself to write in the same time.</p>
<p>The trick is to do that planning.  Train your mind to immerse itself in the problem as soon as you get into the zone where your brain is being underutilized.  This kind of immersion thinking has been useful to me for years for problem solving, and I just had to retrain myself to use it for planning.</p>
<p>In summary: don&#8217;t despair of Graham&#8217;s Maker&#8217;s Schedule if you just don&#8217;t have those big chunks of time outside of work.  You can still be a maker.  You can still be a creative person.  You just have to practice.  Remember: the things that count are the things we do every day, even if it&#8217;s only for fifteen minutes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.laurathomson.com/2012/02/the-fifteen-minute-makers-schedule/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
