Monday 16 May 2011

Should We Abandon the Cloud?

It's been a bad month for the cloud.

First there was the major Amazon EC2 (Elastic Cloud) outage April 21-22 that brought down many business and websites.  Some of the data was unrecoverable and transactions were lost.

Next, the May 10-13 outage of Microsoft's cloud based email and Office services (Business Productivity Online Suite) caused major angst among its customers who thought that the cloud offered increased reliability

Then we had the May 11-13 Google Blogger outage which brought down editing, commenting, and content for thousands of blogs.

Outages from the 3 largest providers of cloud services within a 2 week period does not bode well.

Yesterday, Twitter went down as well.

Many have suggested we abandon a cloud only strategy.

Should we abandon the cloud for healthcare?  Absolutely not.

Should we reset our expectations that highly reliable, secure computing can be provided at very low cost by "top men" in the cloud?  Absolutely yes.

I am a cloud provider.   At my Harvard Medical School Data Center, I provide 4000 Cores and 2 petabytes of data to thousands of faculty and staff.   At BIDMC, I provide 500 virtualized servers and a petabyte of data to 12,000 users.   Our BIDPO/BIDMC Community EHR Private Cloud provides electronic health records to 300 providers.

I know what it takes to provide 99.999% uptime.  Multiple redundant data centers, clustered servers, arrays of tiered storage, and extraordinary power engineering.

With all of this amazing infrastructure comes complexity.   With complexity comes unanticipated consequences, change control challenges, and human causes of failure.

Let's look at the downtime I've had this year.

1.  BIDMC has a highly redundant, geographically dispersed Domain Name System (DNS) architecture.   It theory it should not be able to fail.  In practice it did.  The vendor was attempting to add features that would make us even more resilient.  Instead of making changes to a test DNS appliance, they accidentally made changes to a production DNS appliance.   We experienced downtime in several of our applications.

2.  HMS has clustered thousands of computing cores together to create a highly robust community resource connected to a petabyte of distributed storage nodes.   In theory is should be invincible.   In practice it went down.   A user with limited high performance computing experience launched a poorly written job to  400 cores in parallel that caused a core dump every second contending for the same disk space.   Storage was overwhelmed and went offline for numerous applications.

3.  BIDMC has a highly available cluster to support clinical applications.    We've upgraded to the most advanced and feature rich Linux operating system.  Unfortunately, it had a bug that when used in a very high performance clustered environment, the entire storage filesystem became unavailable.  We had downtime.

4.  BIDMC has one of the most sophisticated power management systems in the industry - every component is redundant.   As we added features to make us even more redundant, we needed to temporarily reroute power, which is not an issue for us because every network router and switch has two power supplies.   We had competed 4 of 5 data center switch migrations when the redundant power supply failed on the 5th switch, bringing down several applications.

5.  The BIDPO EHR hosting center has a highly redundant and secure network.  Unfortunately, bugs in the network operating system on some of the key components led to failure of all traffic to flow.

These examples illustrate that even the most well engineered infrastructure can fail due to human mistakes, operating system bugs, and unanticipated consequences of change.

The cloud is truly no different.  Believing that Microsoft, Google, Amazon or anyone else can engineer perfection at low cost is fantasy.   Technology is changing so fast and increasing demand requires so much change that every day is like replacing the wings on a 747 while it's flying.   On occasion bad things will happen.   We need to have robust downtime procedures and business continuity planning to respond to failures when they occur.

The idea of creating big data in the cloud, clusters of processors, and leveraging the internet to support software as a service applications is sound.

There will be problems.   New approaches to troubleshooting issues in the cloud will make "diagnosis and treatment" of slowness and downtime faster.

Problems on a centralized cloud architecture that is homogenous, well documented, and highly staffed can be more rapidly resolved than problems in distributed, poorly staffed, one-off installations.

Thus, I'm a believer in the public cloud and private clouds.  I will continue to use them for EHRs and other healthcare applications.   However, I have no belief that the public cloud will have substantially less downtime or lower cost than I can engineer myself.

The reason to use the public cloud is so that my limited staff can spend their time innovating - creating infrastructure and applications that the public cloud has not yet envisioned or refuses to support because of regulatory requirements  (such as HIPAA).

Despite the black cloud of the past two weeks, the future of the cloud, tempered by a dose of reality to reset expectations, is bright.

No comments:

Post a Comment

Girls Generation - Korean