Thursday, March 1, 2012

Bug 2000 (Y2K) is back in Microsoft's Azure

In my Vendor's Survival posts I wrote about many leading vendors. Few posts were on Microsoft: 

Will Microsoft survive until 2021? - revisited


I also posted about Vendors technologies, strategies and acquisitions related to their Long Term survival. The post 

Microsoft's Skype acquisition: Warning Signs ahead

is an example.

Yesterday, there was a long Microsoft Azure Service Outage.
The reason for the outage was a time miscalculation for a leap year.
 A long Service Outage is not unique to Windows Azure's PaaS platform. There were long Service Outage in services provided by other Cloud Computing vendors such as Amazon and Google.

The outage has nothing to do with Microsoft's Long Term Survival probability.  
The reason for the outage is related to Microsoft's Survival: It is an indication of a very poor Software Quality Assurance.
Yesterday Microsoft proved that it learned nothing from Year 2000 Bug (Y2K), which was a major issue more than a decade ago.

We may discover tomorrow or in a year or two years or four years, what else Microsoft did not learn lesson from. 

Y2K IT and Business
My blog's name is SOA Filling the Gaps because of gaps between IT and Business, which SOA is about addressing them (as well as BPM and Business Oriented Architecture).

Y2K was a major cause for widening these gaps: IT mangers promised to the Business Managers that if they will spend a lot of money and resources, the organization's computerized systems will not collapse.
The IT managers did not promise any functional extensions or software improvements.

Y2K  FUDS and Facts
Y2K is about systems failures, due to usage of two digits year presentation in a date field. The result is that both 1900 and 2000 are represented by 00 and therefore calculations will use 1900 instead of 2000 by mistake.

Y2K projects were actually Risk Management projects. However, they were justified by a lot of FUDS  and few real risk evidence.
The so called risks range span from sending to 105 years woman invitation to join kindergarten (Low severity risk) up to Nuclear Reactors explosions (High severity Risk).   

The only real evidences were Case Studies of few installations failed to address the leap year of 1996. 
Y2k is a higher severity risk than leap year miscalculation.
Addressing Y2K is a lot more complex issue than addressing leap year miscalculation.
The conclusion was that Y2K damage potential, can be quantified as hundreds or thousands multiples of measured 1996 leap year  miscalculation measured damages.
 
If I remember correctly, the most cited example of 1996 leap year problem was The Brussels Stock Exchange. This risk event occurrence could be quantified to actual sums of money lost due to treating February 29th as if it was March 1st. 

I am sure that Windows Azure Outage due to Leap Year miscalculation costs a lot more than the Brussels Stock Exchange same miscalculation in 1996.

4 comments:

Avi Rosenthal said...

LinkedIn Groups

Group: KnowYourCloud
Discussion: Bug 2000 (Y2K) is back in Microsoft's Azure
I wonder if there are MSFT customers that overcome the Azure downtime by a specific architecture of failover.
Posted by Ofir Nachmani

Avi Rosenthal said...

LinkedIn Groups

Group: iCMG Architecture World
Discussion: Bug 2000 (Y2K) is back in Microsoft's Azure
No surprise there. When you convert the 6-digit year to an eight-digit year, there is a cutoff point (in our case it was 2032) where the two-digit year assumes its century from. For example, is '09 = 1909 or 2009? For the conversions we did, '31 became 1931 and '32 became 2032.

I should imagine it only relates to converted dates at the time of conversion. Surely they didn't retain the 6-digit date format and continued to convert everything?
Posted by Doug Scott

Avi Rosenthal said...

LinkedIn Groups

Group: iCMG Architecture World
Discussion: Bug 2000 (Y2K) is back in Microsoft's Azure
Glad we did not use a "windowing" solution when we fixed our systems a couple of jobs ago!

Ah, but don't forget the Unix problem if you are running 32-bit libraries. Runs out of digits around 2038 (see http://en.wikipedia.org/wiki/Year_2038_problem ). I'll be retired by then but guess there'll be a few issues.

BTW, we did see one issue in my project that would have had a major impact if not detected at the time it happened. And there were a couple of problems on Dec-31-2000 due to the leap year (exception to an exception): 7-11 had a POS problem and a Scandinavian train system (I think it was Norway) also had issues. Finally, Apple's website showed 1/1/19100 on Jan-01-2000. :)
Posted by Jose Solera, MBA, PMP®, CSM, CSPO, CSP

Avi Rosenthal said...

LinkedIn Groups

Group: iCMG Architecture World
Discussion: Bug 2000 (Y2K) is back in Microsoft's Azure
I must be getting old. I had forgotten the term "Windowing solution"; I had always thought it risky, particularly for an insurance company that was selling policies for people aged over 100 years old (there are an increasing number of such people). Still for normal accounting purposes, it worked fine - it was messy, that's all, and I hate messy solutions.
Posted by Doug Scott

Public Cloud Core Banking: Hype or Reality? - Revisited

  More than 4 years ago I was asked if Public Cloud Core Banking is a Hype or a Short Term Reality? If you had read the post, you would prob...