Archive for May 2014

Data center consolidation (or how to save $900k a year)

tl;dr: The move of Mozilla’s Release Engineering infrastructure from the SCL1 datacenter to SCL3 will begin having an impact May 19, and continue for the following six weeks. No tree closures are anticipated.

On Monday, we will begin the major part of the work of moving out of the SCL1 datacenter. Some of our pandas have already been relocated, as a test run. The rest of our machines will now move on a series of “move trains”. The majority of non-pandas are in four big trains, which will move each Monday for the next few weeks.

Impact

The implications for engineering include:

  • Build farm capacity will be degraded at times (especially on Mondays and Tuesdays).
  • No tree closures are anticipated.
  • Datacenter Ops, Release Operations, and Release Engineering will be busy during the moves.

Background

We are moving out of SCL1 and consolidating our infrastructure into SCL3. This will produce cost savings of $900K/year and speed problem resolution (as we have staff located in SCL3).

The end user visible part of the process will start on Monday May 19, and continue for 5-6 weeks. We’ll make a general announcement in the Monday project meeting on May 12, with a follow-up posting to dev-platform.

Data Center Operations, Release Operations, and Release Engineering have been planning for the move for several months. There are approximately 20 racks of equipment to move. The plan is to move key machines in batches (”trains”), spread across functional areas. That is, we’ll degrade all platforms slightly, rather than take a single platform offline. We don’t believe there will be any need to close the trees with this approach, although sheriffs may not merge as often.

The key systems will be moved each Monday morning, and should be back online no later than Wednesday noon, worst case. We are leaving slack in each week’s schedule to ensure some uninterrupted dev time each week, and to allow for minimal impact to release schedules. There is also some slack in the overall schedule for contingencies. If a critical event such as a chemspill occurs, we will be able to cancel that week’s move on short notice.

My thanks to IT for conceiving this grand plan, and all y’all in DCOps, Release Ops, and Release Engineering for execution.

If you have questions about any of this, please let me know!