The Context
If you’ve ever had an unstable e-commerce site to contend with you can empathize with what a frustrating experience it can be.

Usually the scenario goes like this: just as the November holiday traffic starts to build, something flips and the online store becomes unresponsive to the point of failure. Sometimes this descent into the abyss takes several hours. Other times it is near instantaneous. The applications are rebooted as all wait anxiously to see if it will happen again.

And it does. Over and over again.

There’s no predictability to it. There’s only vague hunches put forward by network operators trying to explain possible causes. All the while, hitting those comparables for the year becomes more and more elusive.

Screen Shot 2014-11-25 at 11.17.47 PM

We started Tacit Knowledge troubleshooting these issues. Perhaps it was because in 2002 most software development had moved offshore. Perhaps it was because of the glut of bad, hurried software written during the dotcom boom. Whatever the reasons, production outage resolution became our bread and butter during the first few years. I’m proud to say that with just one exception, we have stabilized every system we have come across in under six weeks.

The exception? Most stability problems have between one to five root causes. Like a who-done-it mystery, it is the investigation that takes time. The fixes are like a Scooby-Doo ending; masks removed and the plot unveiled. The application that erred from this pattern was a case of death by a thousand cuts. No operational maintenance had been performed for years, and the entire system needed an overhaul.

it’s the approach

The secret to our success has not been the proprietary technology we’ve developed or the experience gleaned over time. It’s all in the approach, and in this article I’m going to explain how and why we do what we do. I’m going to bring in some lean terminology from time to time, but don’t worry – you won’t have to have a poster of W. Edwards Deming on your wall to follow these concepts. I’m also going to touch a little on the nature of these systems, but I promise to keep it out of the technical weeds. My first point on approach:

1. You must work as a team and avoid rogue troubleshooting.

I’ve seen this too often. The crisis hits. A response team is organized. Each member of the team has an opinion of a possible cause. Often, these hunches aren’t based on anything more than a similar problem the person witnessed years ago. One of the classic recommendations is to update the patch level on a particular piece of software. In an effort to be helpful, these opinions are voiced, the team lead takes the best suggestion and days or weeks are spent to applying the fix to production.

It never works.

It doesn’t work because these systems have so many moving pieces (complexity), that the statistical likelihood of a trial and error approach being successful in days or weeks is as remote as hitting a hole in one. Consider the diagram below. The circle represents the problem domain: a map of all possible causes. Each dot on the map is the actual cause (and associated fix). Trial and error amounts to throwing darts at this domain blind-folded. Little to no information is derived from each throw: there isn’t any feedback.

piechart1

Now consider an approach that uses reductionism. Aspects of the system are isolated. Each applied change is carefully constructed to prove or disprove that one or more of the culprits reside in a part of the system. Put differently, the goal is to reduce the problem domain – not to complete the Hail Mary pass.

It is important that the charter of the team is to reduce this domain as quickly as possible. This requires a united effort.

2. The team’s charter is to reduce the problem domain quickly – not to fix the system.

For the more technical audience, I describe how this approach is applied to multi-tiered e-commerce stacks in Part 2 of this article.

small batches

Now it’s time to introduce the first Lean concept: batch-size. The easiest way to explain batch-size in an e-commerce context is to look at a software release schedule. Some retailers we’ve encountered maintain a cadence built around a big holiday release (one per year). Others are more initiative-based with five or six smaller releases rolling out in a given year. The big-bang holiday pattern is an example of a release strategy with a large batch-size. The latter schedule with five or six releases has a smaller batch-size.

There is a cost-benefit analysis that is performed by Lean practitioners. On the one hand, each release has a transaction cost. This cost is often borne in the form of QA, release engineering, scheduled outages, operational support, code merges, etc. If we apply this ad absurdum and adopt a ten release per day strategy, the transaction costs outweigh the economic benefits.

There is a financial reason for the small-batch approach, however. This is calculated from the opportunity cost of having those new features sit on the shelf.

For example, a new promotion designed to have a positive impact on average order value loses revenue each day its release is delayed. This is the cost of delay or holding cost in Lean parlance.

The diagram below shows the tension between the transaction cost and the cost of delay. The U-curve optimization is just a way to visualize how we find the economic sweet spot.

There is another issue that factors into the cost of delay , and it is crucial to the quick remediation of stability problems. That issue is feedback. Starting with the promotions example above, releasing the promotion sooner gives a merchant feedback as to whether or not the promotion worked. This feedback is incorporated into the business strategy. If it was wildly successful perhaps similar promotions are prioritized to the head of the development queue. If not, similar promotions may be deprioritized. Either way, the business adapts to feedback.

Within the context of application troubleshooting, this feedback is critical to shrinking the problem domain quickly. It’s so critical that we won’t agree to help a merchant unless they agree to a rapid pace of change.
3. For troubleshooting, plan on one change to production every day.

These changes aren’t invasive and are often no more than additional logging or monitoring. But there’s an obstacle: the transaction cost of a software release. Luckily, we can turn the economic model on its head by reducing this cost to near zero.
4. Automate every step of a software release and deployment to production so that you can push a new release with a single-line command.

The cost-benefit optimization now looks like this.


Even if this takes several days it will be well worth it, and you’ll emerge from this crisis with a more adaptable digital business.

reduce cycle-time

At this point, I’ve proceeded with the assumption that the stability problem only occurs in production and cannot be replicated in any other environment. We contend with this frequently, and stability issues are often isolated and resolved in production before a test environment is set up. It is common that a performance test environment will diverge from a live environment over time. It is also common that companies won’t have a test environment at all.

We strongly advocate creating a test environment that mirrors production. In fact, there are several milestones we use to measure the progress of a stabilization initiative between the start and identifying one or more root causes. These milestones have one thing in common. They are all significant because they reduce cycle time.

 
TK_stability_art_092913
 

Cycle time is the time it takes to run a process from start to finish. This process could be a step on an assembly line. Cycle time applies to e-commerce sites. For example, the approximate time it takes for a submitted order to be transferred to a warehouse for fulfillment is the average cycle time for order processing.

In any site stabilization exercise, the cycle time we are interested in is the time it takes a team to formulate a hypothesis that will reduce the problem domain, code the experiment to prove or disprove that hypothesis, apply the test and extract the results.

Even with all the best practices proposed, when contending with a sporadic production issue, the formulate/test/interpret cycle time is measured in days. If the issue can be replicated in a load test, that cycle time is reduced to hours. If a developer can reproduce an issue on his or her workstation, the cycle time is reduced to minutes. Therefore, we always strive to reduce cycle time while we troubleshoot because the speed at which we converge on the cause and associated fix increases exponentially. Because of the significance of cycle time ; we measure progress against the following milestones:
5. Track these milestones : a) pushing one change per day to production, b) problem replicated in a test environment with a load test, c) problem reproduced in a developer sandbox environment.

In order to maximize the likelihood of reproducing the problem in another environment, we try to make all environments mirror production as closely as possible.

By this time, I hope I’ve successfully conveyed the thinking behind the methodology and why the approach, not tools or IP, really is the crucial determinant of timely success. In the second installment of this series, I dive into how a team can most effectively reduce the problem domain on an n-tiered e-commerce system. It’s a bit more technical, but not rocket science.

 

>> Read Part 2 of the article