Friday, April 20, 2012

Stress Test for IT Ops Shops

Cringely wrote something interesting today, ostensibly about how to fix IBM (a cause I believe is currently lost. their management is too far down the hole to see daylight anymore); but particularly, what I think is actually a pretty great stress test for any IT ops shop.

For those of you unfamiliar with the IT outsourcing business, many companies, both large and small; contract out some or all of the operations, design, implementation, maintenance, service, and support, of their IT operations to professional service providers.

These services can be on a long term or short term basis; and in scope can extended anywhere from staff supplementation (a few extra bodies), to staff replacement, all the way to complete "blackbox" outsourcing; where your company literally has no IT staff OR infrastructure whatsoever (sometimes even no desktop PCs), and everything is handled by an outside company.

This has of course been going on since the 1950s and 1960s with mainframes, and 70s and 80s with minicomputers; but over the past 15 years or so, has dramatically accelerated, to the point that many organizations even outsource all of their IT desktop and server operations, support, application management, administration... Some even outsource their IT management, policy setting... really everything related to information technology; except perhaps the senior management (CIO, CTO etc...).

Very frequently, this means that no single person actually employed by an organization, has any IT knowledge, expertise, or access to the IT resources of that organization; or, if they do, they don't have the time necessary to properly oversee these functions (because, after all, it's about cost savings; which means man hour savings more than anything else).

What results from this, is in theory a cost savings (and often in practice); but when no-one employed by your company has the skill or understanding to properly evaluate the service they are getting from these contractors... well, things can get out of control very quickly.

The U.S. Navy, U.S. Air Force, DOD, the State of California, and New York City (among thousands of other organizations around the world); have all found this out the hard way, to the tune of billions of dollars in cost overruns, and project failures.

As it happens, I have worked in this business much of my career; mostly in the design, architecture, and implementation functions, but some in the operations and support functions. In many of the roles I've held along the way, I've dealt with a lot of these contracts, in all their many variants; and the many companies that provide them (including IBM, and my new employer; both quite extensively).

In fact, the new job I start Monday is an operations role (managing information security operations), on a "staff" contract (which is one of the variants of these types of contracts, where instead of completely outsourcing the function, the company brings in outsourced contractors, who act within the structure of the parent company as if they were employees... thus "staff"... even though most of the actual IT roles are in fact filled by contractors).

I don't disagree with anything Cringely has written in his post; or for that matter, the rest of the post series on IBM (which, if you have any interest in IBM or in IT, you should read, if you haven't already. It's big news, and I think it's correct).

The test he proposes, SHOULD be one that any competent and well run IT organization SHOULD be able to pass; whether outsourced, or organic to a company.

Frankly, I know a lot of shops that wouldn't pass this test... In fact I don't know many shops that could pass every part of this test in a reasonable amount of time... and I'm willing to bet a lot of folks are going to be seeing this come from their clients or their management in the next few days or weeks.

So, the test that Cringely proposes:
Ask your IT outsourcing provider to produce the following:

1) A list of all your servers under their support. That list should include:

  • Make
  • Model
  • Serial Number
  • Purchase Date
  • Original and current asset value
  • Processor type and speed
  • Memory
  • Disk Storage
  • Hostname
  • IP address(s)
  • Operating system(s)
  • Software product(s)
  • Business Application(s)

Is this list complete? How long did it take your provider to produce the list? Did they have all this information readily accessible and in one place?

2) A report on the backup for your servers for the last 2 weeks.

  • Are all servers being backed up?
  • Are all the backups running in the planned time window? Is there ample time left over, or is the operation using every minute of the backup window?
  • When backup runs on a server there are always files that are open or locked and the backup cannot copy them. Every day the backup team needs to look at their reports and make sure that files that were missed are backed up. In your examination of the backup reports you should see evidence of this being done.
  • If you spot any potential problems with a server ask for a list of all the files on the server. The list should show the filenames, date’s, and if the archive (backup) bit has been flipped

Is this list complete? How long did it take your provider to produce the report? How often does your provider conduct a data recovery test? If a file is accidentally deleted, how long does it take your provide to recover it? Can your provider perform a “bare metal” restoration? (bare metal is the recovery of everything, the operating system included onto a blank system)

3) A report on the antivirus software on your Windows servers.

  • Is antivirus software running on all your Windows servers?
  • Is it the same (standard) version?
  • Are the virus signature file(s) current?
  • Ask for case information on any recent virus infections

Is this list complete? How long did it take your provider to produce the report? When a virus is detected on a server, how is the alert communicated to your IT provider? How fast do they log the event and act on it?

4) A report on your network. It should include:

  • Illustrations of the major network equipment including routers, switches, firewalls, etc.
  • IP address allocations.
  • Internal DNS entries.
  • Current routing and firewall rules.

Is this information complete and current? How long did it take your provider to produce this information? Is this information stored in a readily accessible place so that anyone from your IT provider can use it to diagnose problems?

5) Information on your Disaster Recovery plans Here is what you want to know:

  • Documentation on a recent DR test, the plan and results. It should show the actual times tasks were started and completed. Problems should be logged. (it is okay for there to be some problems, that is the purpose of the test)
  • Ask for a list of names from the IT provider of the people who worked on the test.
  • How many people who worked on the test live full time in the same country as your DR facility?
  • Did your IT provider fly in an army of offshore support folks for the test?
  • If there was a real disaster how long would it take your IT provider to assemble a team to support your emergency?
  • Ask for a list of your critical applications to be provided and supported in a disaster.

Is the list complete and correct? Is there sufficiently detailed information on each critical application? How much data is involved? Is the data actively sync’d over a network? How often is the sync’ing process checked? What hostnames and filesystems need to be restored? What application skills are needed to start up the applications?

6) Help desk information. Here is what you want to know:

  • Ask for a report of all the help desk tickets for the last 2 weeks.
  • Independently ask your company (not your IT provider) for information on known IT problems over the last two weeks.
  • Compare the information from the helpdesk and your company sources.
  • Pick a few random incidents from the help desk ticket report. How long did it take to discover the problem? How long did it take your IT provider to begin to work on the problem? How long did it take your IT provider to fix the problem? Was the problem really fixed?
  • Is there an active problem prevention program? Is your IT provider examining the reported IT problems and finding ways to reduce the number and frequency of problems?
  • How long did it take your provider to produce this report? Did they have all the help desk ticket information readily accessible to everyone and in one place?

7) Look for evidence of continuous improvement.

  • Repeat this process once a month.
  • Look for changes and improvements month to month and over several months.
  • Are the total number of problems being reduced?
  • Is the response time to fix problems being improved?
  • Is there clear evidence your IT provide has an active and effective continuous improvement program.

A good IT provider will have the tools to automatically collect this data and will have reports like these readily available. It should be very easy and quick for a good IT provider to produce this information.

A key thing to observe is how much time and effort does it take your IT provider to produce this information. If they can’t produce it quickly, then they don’t have it. If they don’t have it they can’t be using it to support you. This then will lead you to the most important question: are they doing the work you are paying them for?
If you're in IT, I'd be willing to bet your own shop can't pass this test in all aspects; at least not within 30 minutes or an hour, or even the same business day. In my rather broad experience across hundreds of clients, if you can even get most of it within a business day, you're doing pretty well.

That's not as it should be; but it's often how it is. Many shops, if not most, just don't have all the elements they need to maintain this level of operational fitness.

Something that Cringely didn't explicitly write here, but which should be addressed (and is implied in the test); is that passing this sort of test is really dependent on four elements:

  1. Proper tools: Your team needs to have the right tools, access to them, and have them properly configured; so that they can do all of these tasks efficiently, effectively, and consistently.
  2. Proper process: Your IT processes need to account for all aspects of your operations. They need to be easy to understand, well documented, readily accessible and readable by anyone who needs them, consistent but flexible, goal oriented and mission focused, PROPERLY TESTED; and your staff must be properly trained on them.
  3. Adequate staff levels: You need to have enough people to cover all the work required for your IT needs. To an extent, good tools and good process can reduce your staffing requirements (and in some ways, the skill and training requirements for that staff), but you can only cut so far. You MUST have adequate coverage, and that coverage must have sufficient skill, knowledge, and training; to meet your needs. Further, you must understand that your staff are human beings, with lives and pursuits outside of work. They have vacations, and family emergencies, and they get sick. They have different skillsets, and different skill levels. Some are more or less efficient or effective than others at some tasks . Treating your staff as fungible man hours is a sure and certain recipe for failure.
  4. Good IT management: Without good management, none of these things will happen. If handed them on a silver platter, without good management, they will stop working. Management must keep all these factors in mind at all times, and maintain MISSION FOCUS above all else. You are not here to meet a metric, you are here to get a job done; a job that enables other peoples jobs. Meeting your metric isn't doing your job; making sure others can do their jobs is.

Of course, that's just my opinion, I could be wrong.