TK_stability_art_092913

In the previous article, I covered the team approach to solving
stability issues, and put forward the argument that this approach was the
linchpin in the timely resolution of site stability problems. I left out
an important topic however: how one reduces the problem domain in an
e-commerce solution.

The simple answer is by applying queuing theory – another lean concept.
I’ll abandon the technology and substitute a car service center example
to illustrate.

A customer is having their car serviced, and they want to find out if it
is ready yet. They call the shop, and the next available receptionist
answers. The receptionist puts the customer on hold. The receptionist
has two courses of action available. He looks out the window at the pick
up lot. If the car is there, the response is immediate
(“your car is ready”). If the car isn’t there, the receptionist calls a
service center rep charged with tracking all of the service orders.

The next available rep looks up the customer and either discovers the
car is serviced or that it is currently in the shop. If the car is
still in the shop, the rep puts the receptionist on hold while he dials
the mechanic to get an estimated completion on the servicing.The next
available mechanic answers and provides the estimate.
The process looks like this:

Real life gets a little more complicated; this request/response cycle is
happening with multiple customers, receptionists, reps, and mechanics at
the same time. The diagram below illustrates the work queues associated
with each request and response.

Note there are six pending requests above. A close examination of each
queue shows that reception is actively working four requests, but is
reliant on the response time of the service department for three of them.
The service department is actively working two requests, and is beholden
to the shop for both.

Note there are only a total of two requests actually being handled: one
by reception and one by the shop. Every other request is queued
somewhere in the chain.

The diagram above illustrates queue theory, which is also the basis for
troubleshooting e-commerce solutions. A deeper explanation of queue
theory can be found online. The reason for covering these concepts is to
address the question: how do you explain long hold times? (and is
analogous to how you explain poor application performance)

The delay could arise from a slow down at any point in the chain. One
scenario is that there are three receptionists and fifty customers
calling in simultaneously. Even if every customer’s car is done and in
the lot, the experience would be slow because of the queue (AKA line) of
customers waiting on a receptionist. On the other hand, the queuing
could be worse somewhere deeper in the chain, ex: a shortage of mechanics.

Note this pattern : you will never have a quick response if the
receptionists’ work queue is large, but that queue may be large because
all of the receptionists are waiting for reps that are in turn waiting
for mechanics. In other words, there may be plenty of receptionists,
but a shortage of mechanics. Or the inverse might be true. It’s
impossible to tell by just looking at the receptionists’ queue.

Now let’s examine the mechanics’ queue. If the mechanics’ queue is empty,
and you still have long customer hold times you do know that increasing
the number of mechanics won’t help. If on the other hand, the mechanics’
queue is large, you can also surmise that you’ve got a capacity problem
in the shop.

The point is that we are able to derive useful information from the mechanics’
queue, but not from the receptionists’ queue. By examining this queue
we can either reduce the problem domain to the mechanics’ shop or eliminate
the shop from the problem domain.

Why is this? It has to do with the directional nature of the
request/response process
. In this back and forth, there is a never an
instance where the mechanic proactively calls a rep or a receptionist.
Requests are always passed in the same direction.

Queuing Theory Applied to E-Commerce

This request/response pattern is similar to a typical three-tiered e-commerce
system. Each tier has a number of workers (processes) that are finite
and require access to finite resources from time to time. The web server
may handle a request by a user or the web server may pass on the request
to the application server, but this direction of request/response is never
reversed. The database will never make a request of the application server,
and the application server will never require information from the web
server. It is this directional nature of communication that we exploit.

  1. Always start with the data tier and work
up through the systems in the direction of the response.

It actually gets a bit more complicated in that within each server, there
are often a number of these queues to analyze. By looking at the wait
times for finite resources like connection pools and thread pools, one can
determine where the performance bottleneck resides.

For the purpose of illustration, a typical hybris
solution is shown here, but this construct applies to any n-tiered system.
hybrisdatabase

The queues are the key. By examining each in the correct order, the
problem domain is methodically reduced
.

Layers of an Onion

I want to reinforce that this sequential process of reducing the problem
domain is often run through more than once. There are often multiple
stability or performance issues with any problematic e-commerce site.
Once the most significant problem is identified and fixed, the process
can be repeated until the solution is good enough. There is always a
performance bottleneck. If there weren’t, page load times would be 0
seconds. The success of this endeavor is achieved by getting to an
acceptable performance level. And that brings me to the final point:

  2. Establish what adequate performance (AKA success) looks
like before you start to troubleshoot.

What’s Next?

We’ve applied this approach successfully time after time to solve
e-commerce performance and stability issues. Once problematic queuing is
identified, the source of the queuing is usually fairly easy to
determine.

Take a queuing problem with a database connection pool as an example.
Which are the longest running queries? Which ones are run most often?
Most importantly, which query has the highest response time multiplied
by time to execute? The solution may be SQL optimization, applying an
index, archiving old order data, caching at the application tier or
something else. The important takeaway is that by reducing the problem
domain, the effort is focused to a solvable subset.

Learn More