Jump forward in time to the day of your next big product launch (first release, new features, new market segment, etc). And your site/application crashes due to the “unexpected” demand. All you can do now is look for a bucket of water to put out the fire. What could you have done to prevent this disaster? Jump back to today and start doing it!
Depending on how you look at things, this is a backwards planning exercise, or a variation of the remember the future innovation game, or risk management, or proactive product management. You can avoid a disaster by imagining what might happen, then hypothetically figuring out why it (would have) happened. That leads to planning how you could prevent it. And now you’ve left the dream world of a Gedanken experiment and returned to the real world of product management.
The way to approach this is straightforward. Imagine some failure scenarios and the importance of preventing them:
- Imagine a failure scenario.
- “Predict” the likelihood of the failure.
- “Estimate the impact of the failure.
- Repeat for each scenario
You can prioritize your failure scenarios by multiplying the likelihood of each with the impact of each, and sorting them from largest to smallest. Then determine which ones you’re willing to address, and which ones you’re willing to risk. You may not be able to predict the likelihood of some failures (at least until you do a root cause analysis). Take each of these and put them directly above the scenario with the next highest impact. The rationalle is that these are so bad, that you really want to find out how likely they are to happen. Once you predict likelihood (see below) you can reprioritize.
Root Cause Analysis
For the failure scenarios you choose to address, the next step is to do a root cause analysis that identifies why it might have happened. The best tool for capturing this analysis is an Ishikawa diagram. Consider that one problem you might face is your website crashing.
Essentially, you can crash your site by having too many users, too many concurrent users, or too many concurrent sign-ups. Developing a cause-and-effect diagram (another name for an Ishikawa diagram) is usually an iterative and exploratory process. You probably won’t create the simple version above first. You may ask your implementation team “What can cause the website to crash?” For each of their answers, you identify when that situation can happen. Or you start top down. Most likely, a mix of the two. Your completed root cause analysis may look like the following:
At this point, your team can probably predict many (maybe all) of the root causes of a website crash. The predictions may be conditional – “we can handle 10 concurrent users, but 20 probably kill us, and 100 definitely would.” Developers are notoriously good at answering questions with conditional statements that reveal the nuances of their thinking.
Remember that you’re looking back from the future. At product launch, what are you hoping for / reasonably expecting? For this example, assume it is 10,000 total users, with 100 concurrent users (normally) and 500 concurrent signups. You determine these numbers by working with your PR, marketing, or mar-com people (or wearing those hats, when it is all you). Your plan is to do a big launch with a demo and a promo code for signup. You know your audience will have internet connections, and will have twitter running at the time of your presentation. You expect/dream of an immediate burst of signups, followed by tweets and word of mouth, and eventually blog articles causing additional growth over the next couple of weeks.
Use this data to feed back into the developer’s conditional responses. If you’re like me, you will have found “absolute certainty of failure” from something. And you may have even identified the thresholds for each element. For example, database loading can handle 75 concurrent users, but with the current implementation, you only have enough database connections available to support 25 concurrent users.
Jumping back to the present, you now have some very discrete, and very important things to do before your launch. If you need to, revisit the prioritized list of failure scenarios. By looking at the next level of detail, have you found that the order of importance (to fix) has changed? What about the “must fix” versus “willing to risk” line? Has it moved?
Fold the “must fix” items into your backlog, and prioritize them relative to the other capabilities on your roadmap. As a side note – make sure you’ve built in some testing to make sure you actually prevent the problems. This might even be a great opportunity to implement “performance regression tests” – it is not enough to prevent bugs, you have to prevent slowdowns.
Rethinking the Problem
Without going into details on how the team will solve each problem, make sure that together you keep the Ishikawa diagrams in mind, and see how any proposed solutions might “reappear” on the diagram. For example, rewriting your database connections to use asynchronous processes and a set of pooled connections may prevent a crash, but it may really hurt performance. You may not have time to find an elegant solution. So stop and rethink the problem.
At this point, you’ve said
- Given a marketing plan / launch strategy, we would crash the website.
- We can make changes between now and the launch that will double the number of concurrent users we can support (or whatever), but that is not enough to support the launch strategy.
- Solution: Change the launch strategy.
Maybe you can’t support a wide-open promo-code based signup. You should modify your launch so that it can only create as much demand as your product (including pending improvements) can support. Maybe you limit it to the first 1,000 new users (probably more code to write to enforce the limit). Maybe you launch with per-user invitations, where you can control the speed of propagation of invites (start with 100, when those have been sent, make another 100 available, etc).
Entire Team Problem
This is a problem that is solved collaboratively, by the entire team. It is not just a “go write the code” problem. What your product can support at a launch should drive how you choose to launch, just as how you choose to launch should drive what you want your product to support.
You may have to delay a key capability in order to scale. Does your marketing team know this? Slightly less bad than crashing would be announcing a feature that is disabled. Still need to announce the feature? Pre-announce it: “Coming in a month…”
This stuff is important for every company and product, but it is especially critical for start-ups. As a start-up, you have limited opportunities to grow, and a limited safety-net to catch you when you fail to capitalize on those opportunities. So make sure everyone (not just the development team) is aligned to make the best use of each opportunity.
You have an opportunity to prevent problems. All you have to do is imagine that they have happened in the future, figure out why they would have happened, then do what it takes (in software, or organizationally) to prevent them.