Monday, August 31, 2009

Wasted Hours, Hidden Costs, and Perverse Incentives

I didn't get to save the company $15 million dollars today; and I'm a little upset about it.

So for the last five months I've been working on a major project; taking up about 1/3 of my work hours.

We've got a building, which isn't supposed to be used as a datacenter, but which has sort of organically grown into one. Right now there are over 1000 lab and dev servers in there, and even a few production server (which is a BIG no-no).

We've decided not to renew the lease on this building, and now we've got to have everything out of there before the end of first quarter 2010 or we face a $1 million a month penalty.

If you've ever moved a datacenter (and unfortunately I have) you know that seven months is basically NO time to do the job; especially when those months are split across two budgeting years.

Well, let me tell you a very long and very irritating story... or more accurately, let me bitch for a few minutes.

Ok, before we go on, this middle section here is going to be large enterprise IT geekish, with a lot of rather boring numbers, unless you're in the business. Y'all may just want to skip down a bit.

On the other hand if you ARE in the biz; or if perhaps you are an accountant, or just want to see how ridiculous large corporations can be; this may be interesting to you.

Actually, yaknow what? I'mna just cut a bit portion of the numbers bit out entirely, and put it into a separate post from my bitching.

Ok, moving on... Sorry, bad and unintentional pun...

As part of the move, we've been trying to get rid of a bunch of old servers that are beyond end of life (some of them as much as 9 years old). Especially since a lot of the older models are rather large and rather power inefficient; and all of them are long out of software support, and security patching etc...

Really, they're a big security risk; and they're just slow and underpowered as hell, while using up way too much juice, and putting out way too much heat.

Then there's the facilities charges (the actual cost of sitting in a space in the datacenter). They run about $85 per rack unit per month in the enterprise datacenters, and $185 per rack unit per month in the high cost (downtown in a major city) datacenters we're dealing with for this project (and the average server in question is 4 rack units).

Now, I don't know if you've ever moved a datacenter, but for secure moves (we are a financial institution after all. When we move boxes we need to either destroy the hard drives, securely wipe them, or move them under bonded physical security) the cost per box runs from $1500 to $2500 a piece, not including end user labor (getting the applications ready for a move, moving over to the business continuity site, testing the box when it comes back up in the new site etc...)

That sounds like an awful lot of money for just moving a box; but again, that's taking into account ALL the costs. You have to power the thing down, unrack it and uncable it, rack it, power it, and cable it at the other end (which takes a quantity of, rather expensive, labor; that can be surprising to the uninitiated), plus wiping the drives and moving the box by freight, or moving the boxes "securely"; and finally you have to test the boxes in their new location for power and proper network and application connectivity (before you hand it back to the end users for their testing).

Frankly, it just isn't cheap.

I tell you this so you understand, just MOVING 600 boxes, without otherwise touching them or changing anything, will run between $900,000 and $1.5 million.

By decomissioning everything we can, consolidating a bunch of what's left into a lower number of more powerful machines, then virtualizing everything we can, we managed to bring the numbers down from 1100 to 600.

We figured on an average cost to reprovision in the standard enterprise process (explained in another post) of about $19,000 per system to cover the entire 4 year extended cost (that's hardware, support, maintenance etc... for 4 years. The actual hardware cost is between 10% and 25% of that number). We only needed to budget for the upfront costs, because the extended costs get covered on the annual run rate budgets.

We run everything on a 49 month total cost accounting cost basis. That means we try to account for every cost associated with a system that we can; we take the acquisition cost, and first year of support and maintenace, plus labor and install fees, and pay that as "upfront", and the rest comes out as monthly "maintenance" charges over 49 months.

There is one significant element not included in that per system charge though, and that's the $85 to $185 cost per rack unit per month, for the facilities.

Through virtualization, consolidation, decommissioning and repurposing, we managed to cut the host count down from over 1080, to about 600 total (including virtuals), about 200 of which are going to go into other less expensive and better served locations.

The virtual count though...

We actually started off with the directive to virtualize everything that couldn't be decommissioned, except the systems that absolutely couldn't be virtualized (with approved business case and exception required). That was a tall order.

After months of interviews and discovery sessions with server and application end users, I managed to put 300 additional systems on the de-com list; intending to either eliminate them entirely, or consolidate their function with another box. I also manged to come up with a list of 600 potential virtualization candidates. The remaining 180 or so were just going to have to move, but not necessarily to the expensive location.

So, that's what we asked for as budget for the 600, up front $4.1 million; which would cover virtualization, plus the moving for whatever ones we couldn't virtualize.

That is what we honestly needed to do that many systems. It wasn't a blue sky "nice to have" number, it was the real number.

That's not what we got.

They gave us $1.8 million for upfront.

Not 1.8 million for virtualization, 1.8 million.

Total.

Including all the moves, and anything going into the enterprise solution.

Oh crap...

Ok... so, there was no way we were going to be able to do that using the standard enterprise methodologies. It just wasn't going to happen.

I said flat out, "at that budget, using standard enterprise costings, this project cannot be completed".

I was actually putting my neck on the line by saying that, but it was the truth, and it had to be said...

but...

I had an idea.

So I said to the CIO, and the head of the "steering committee": "Ok... let us get creative here, and I think we can do it. We're going to need to throw the rules out the window, but I have an idea, and I know who to work with and how to get it done... If we have a free hand, we can do it."

He told me to run with it.

So, we went back to all our end users, and pared down even more, consolidated even more, convinced them we were going to do something really cool but they'd have to take the risk with us...

And after about a month of non stop effort, we whittled the "must move" list to about 240, 120 of which were going to go to cheaper facilities where there was more space. We found a few more decoms through creative consolidation, and moving more functions into virtual environments. We got the virtualization list down to about 290, with about 50 where they would be in the standard enterprise virtual farms, and the other 240 on the new "creative" solution.

Then I went heads down with some of the lead architects at VMWare, HP, Sun, and IBM for an entire month; and we came up with something.

We used new products, new technologies, and new ways of putting things together, to come up with a certified and supportable software stack that could be virtualized at less than half the cost we had before.

I even got them all to give us SERIOUSLY special pricing, and agree to only charge us 25% up front, with the remainder payable as we filled up the virtual farms, in a capacity on demand model.

These are lab machines. They don't need the kind of support, or the kind of overhead, that running in the enterprise farms entails. They also don't need the high cost ultra reliable high performance storage, or the high cost ultra reliable everything else that goes with it.

All of those coasts add up.

Like I've said several times above, it isn't the hardware that makes up most of the cost. In the enterprise farms, a single virtual machines share of the hardware cost is less than $2000, including all the storage, etc... It's all the support, and supporting infrastructure.

By getting creative, we were able to cut the average 49 month cost per virtual machine down to about $5,000.

Of course, we still have to account for the staffing to support these NEW virtual farms; at 4 employees, 50% time for 4 years, at an average of about $70 an hour (the "standard rate" we use for internal chargeback of mid level sysadmins and system engineers). Basically 4160 hours per year, for about $300,000 a year over 4 years.

Total solution cost worked out to about $2.7 million, with $1.8 million in up front costs (not exactly the typical ratio, because we weren't doing things the typical way):

New virtual farms: $1.2 million with $700,000 up front (that's TCA not just hardware)
Staff: $1.2 million, with $300,000 up front
Moves: $600,000 (just in case, we used the $2500 estimate)
Enterprise VMs: $600,000 with $200,000 up front

If it came down to it, we could probably trim out another $240k on the upfronts by using the lower moving estimate, to get down to $1.56 million.

Now remember, when we started, it looked like we were going to have to move 1100 servers, at a total cost of almost $2.8 million just for moving, and a cumulative cost to the company of $814,000 per month in facilities fees; for a 49 month total cost of about $40 million dollars.

We're talking about going from $40 million, down to $2.7 million, with only $1.8 million up front.

We went back before the steering committee, presented the new solution along with the planning and the end users signoffs that they were good to go...

And they cut our budget again...

To 1.1 million...

Including the moves, and the stuff going into the enterprise...

Now, you can't even MOVE 530 boxes for $1.1 million; at least not without hitting the bottom end of projections. Not only that, but without doing the creative solution thing, we weren't going to be able to cut the numbers so far. We'd be back to 600 boxes.

Thing is though, we were the victim of our own success.

If we'd have had to move the original 1100 boxes, or even virtualize more than a couple hundred, there's no way they could have contemplated doing anything other than our new solution. The costs would have just been too high.

But we'd managed to cut that number down so far, and to split the moves up into several different locations so the new building (which only had capacity for 400 of those 1100) wouldn't take the full load; that they were able to contemplate just moving everything.

If you add up the totals... $2.7 million total and $1.8 million up front to implement the technically correct, and better for the business solution; vs. the cost of moving 600 boxes, with a low end estimate of $1500 a pop for a total of $900,000...

Of course that discounts the facilities costs, but apparently that's how they want to account for it.

I should say, there IS a valid reason for it.

The moving expenses can be written down as a one time structural charge related to our recent big mergers and acquisitions (like every surviving major bank, we ate another bank whole last year).

The ongoing floorspace charges on the other hand are already sunk into the buildings fixtures and facilities budgets (and reported as such); so by breaking them out into a cross charge, we actually disadvantage ourselves for earnings purposes.

This is completely honest by the way, all costs are being accounted for; it just skews the real cost of IT, by pushing some of that cost out onto facilities.

So, this morning I reported to the committee, the CIO and everyone else, that with the newly revised budget, and the numbers we managed to cut down to, and how they wanted to account for costs... that just moving all of the boxes, was the "right" solution.

Actually, I put it differently. I said that it was technically the wrong thing to do, but for accounting reasons, it was the right thing to do.

I did make clear I thought that long term, the new solution was better both technically and financially; and in prioritizing the upfront costs we were going to miss a potential savings of well over $15 million over four years

But that's over four years, vs the $900,000 today... Who was going to provide those staff members. where would we get clearance to hire them. Whose books would they be carried. Who was going to cover all that maintenance when the lab machines weren't seeing any maintenance charges now, and the end users didn't want to assume those charges in 12 months...

What I didn't mention, was that we had just flushed two and a third full time senior staff members labor (a full time senior technical project manage, a full time senior finance manager, and me, a 1/3 time chief architect) for 5 months as well, at $120 an hour (the internal chargeback rate for senior staff).

There's another $250k wasted.

Nor did I mention how great wasting those hours made us all feel.

Oh and now, we're going to have to go back to all those end users we convinced to come try something new with us, and we're going to have to get them to sign off on a straight move.

It's going to take us at least six weeks, by which time we're only going to have 5 months left to get everything moved, one month of which will be taken up by our mid December to mid January change freeze for end of year.

That means everything is going to have to be on an emergency escalated basis, and will take longer and cost more, and mistakes will be made...

The wasted political capital, wasted good will, lost trust, and wasted energy... They dwarf the wasted $250k... in fact they probably dwarf the wasted $15 million in the log run, because they make us all less effective at our jobs.

All because people don't want to think beyond the first 12 months... or the next 90 days for that matter. 4 years isn't even long term, but it's not THIS year, and THIS earnings period, and THIS bonus check...

What's scary, is that we're actually BETTER about this than almost every large company I've worked with. In fact, I've worked with every other major bank in the U.S. and many of them in Europe; and we're the best of all of them when it comes to being efficient, and adopting good solutions etc...

Yes, seriously. Everyone else is even more convoluted with more perverse incentives than we are.

The one good spot is that I think I've convinced the management who counts, that we can use the new architecture we developed for our datacenter reclamation efforts next year, as we try to take 25% of all our servers off the floor, while still expanding our capacity by 40%.

By the by, when I say reducing footprint by 25%, that's from about 50,000 servers. About half of those are development and testing systems that don't need all that overhead associated with production systems.

By going that way, we could save some serious money. Enough to make that $15 million we didn't save today, look like nothing (maybe $200 million); and the $250k we spent developing the solution as less than nothing.

If they let us do it.

There are days I love my job. When I'm allowed to do it, it's great.

Then there are days like today.