How Complex Systems Fail

EDN Admin

Well-known member
Joined
Aug 7, 2010
Messages
12,794
Location
In the Machine
http://upload.wikimedia.org/wikipedia/commons/d/da/Rolling-thunder-cloud.jpg" rel="lightbox <img src="http://upload.wikimedia.org/wikipedia/commons/d/da/Rolling-thunder-cloud.jpg" alt="A shelf cloud over Enschede, Netherlands" width="447" height="276
<small>(Photo by John Kerstholt)</small> The notion that non-tech-industry people should care about cloud computing continues to weasel its way into mainstream media and popular culture, most recently with a slew of stories about how last weeks East Coast storms were to http://www.ibtimes.com/articles/358221/20120630/pinterest-netflix-instagram-down-storm-virginia.htm blame for the unscheduled downtime that affected a bunch of popular services on the web like Instagram, Pinterest, and Netflix. Who knew there was actually such a thing as a http://en.wikipedia.org/wiki/Derecho" target="_blank derecho or a http://en.wikipedia.org/wiki/Arcus_cloud" target="_blank shelf cloud (sinister photo above)? I digress. Anyways, as people continue to come to terms with their reliance on availability and uptime for the unseen back ends of their favorite web destinations, the discussion rages on about we can learn from this and what wed do differently next time. A few have posed the http://techcrunch.com/2012/06/30/could-instagram-and-other-sites-avoid-going-down-with-amazons-ship/ question , but its kind of the same old same old ... a discussion of computing architecture, availability zones, replication, failover, and other IT-specific stuff. Its useful, for sure, but it begs the most fundamental question of all: how do complex systems fail? If you ask the question in the context of cloud computing, youll get to comb through lots of tech papers by tech people talking all sorts of tech jargon about stuff that only other tech people would understand. I guess thats fine, but it puts a tech lens on something that should be more basic and fundamental than that. So I went poking around and found a proverbial diamond in the rough ... a paper written over 12 years ago by a doctor, and not a guy with a doctoral degree in computer science, but a medical doctor. Richard Cook is a director at the Cognitive Technologies Laboratory at the University of Chicago who has done a bunch of work examining impact of health IT on patient safety, and in 2000 published a paper he described as a "short treatise" on the nature of failure entitled, http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf How Complex Systems Fail . In reading this paper, youd think hed had the web on his mind, even back then, but no ... this short 4-pager rises above the level of vertical-industry-specific context and gets to the heart of how stuff breaks. Its very, very good. And reading this last week while news of the storms impact on the web was making headlines here in the U.S. made it an obvious connection back to how we need to think about these new architectures and design points. I thought I knew a thing or two about this stuff, but this paper was pretty eye-opening. - Tim <img src="http://m.webtrends.com/dcs1wotjh10000w0irc493s0e_6x1g/njs.gif?dcssip=channel9.msdn.com&dcsuri=http://channel9.msdn.com/Feeds/RSS&WT.dl=0&WT.entryid=Entry:RSSView:bef7621248dc420a9783a0820153356a

View the full article
 
Back
Top