Drawing accurate conclusions from real-world experiments

Drawing accurate conclusions from real-world experiments

July 27, 2016

“Test, measure, and scale” has become the mantra for staying competitive in today’s business world. And while most enterprises don’t have a shortage of ideas, all have limited resources to test and validate new concepts. The manager’s dilemma has become deciding which experiments to run and which results to trust. Failing to properly test a good idea is a missed opportunity, while scaling up an idea based on bad data can be catastrophic. The best companies in today’s economy are continually testing new ideas, products, and processes, but most lack an efficient, reliable platform for managing these tests across the enterprise.

In the digital world, enterprises have become skilled at A/B testing of web pages, marketing campaigns, and promotions. There are great tools available to randomize digital tests and learn quickly and cost effectively. The leading digital companies, like Amazon.com, have become masters of an agile, test-and-measure culture because a pure digital environment is well suited to low-cost, large-scale randomized testing. However, most businesses are not purely digital and need to apply that same test-and-measure philosophy to more complex systems, including retail stores and branches, human resources, and distribution channels.

With complex, multi-channel systems, cost and complexity must be carefully considered when deciding what to test and how. The gold standard for testing is a randomized experiment. This is where customers, stores, branches, etc. are randomly selected to receive different treatments to compare results. The challenge for this framework is to ensure the treated and non-treated control groups are statistically indistinguishable. For many experiments, it’s impossible or too costly to create truly random treatments and control groups, and if the groups differ materially prior to the experiment, it’s likely the response will differ as well.

So how can we run proper experiments in complex enterprises and be confident in the results? One paradigm is known as matched control. This is where the treated group is matched to a subset of untreated subjects based on observable attributes to effectively reduce or eliminate the pre-intervention differences. The assumption is that without any intervention, both the treated and untreated would continue to look similar by these attributes. This new “matched control group” is the next best thing to obtain usable results where randomization is impractical. There are various methods that can be used to find the matched control groups, including propensity score matching. Each method has its own benefits and tradeoffs in terms of data requirements and precision.

The concept is powerful, but how can businesses put matched control into practice? Let’s consider a recent campaign to acquire new customers at a retail bank. The goal was to test the program at select branches and then roll it out more broadly if the results were positive. The practical issues are that you might not be able to choose branches randomly—you might select them for geographic or performance reasons, or buy-in was required by the local manager, or they were hand-picked by an executive, or any other number of non-random reasons. For a match, we might use things like prior new customer acquisition rates to match branches. If that same campaign was directed toward a certain demographic, then each branch’s known demographic footprint could be an additional attribute to match on.

Recently, a bank we work with wanted to understand the impact of a branded credit card as compared to non-branded cards. Does the branding impact sales and card usage? There is natural self-selection with a branded card that might draw people from a certain demographic or income bracket. Using a matched control group to find customers who look similar across the branded and non-branded cards allowed us to provide a much more accurate predictor of how the branded experience impacts sales.

In another recent project, an insurance company was trying to measure the effect of going to a “preferred” provider compared with an average provider. In the data used to determine a “preferred” provider, people weren’t randomly selected to go to one or the other—certain people might have chosen certain providers. It’s possible that wealthy, healthy people go to certain providers and were going to get better faster anyway. It’s also possible that the insurance company steered the most troubling cases to their defined list of “preferred” providers. When comparing the raw averages, the preferred provider appears to be much more cost effective. However, with propensity score matching, the selection bias was largely eliminated and we were able to get a much better quantitative measure of the real “preferred provider effect”—which in this case was much lower than the raw data suggested.

To embrace an agile, test-and-measure philosophy, enterprises need a comprehensive solution for experimentation that includes a way to manage quasi-experiments where true randomized experiments are impractical. At a minimum, it must:

  • Be easy to use. Business users should be able to set up, run, and measure experiments without a statistics background.
  • Find the best control group. The platform should be capable of identifying the most appropriate matched control group.
  • Run quickly. Results must be produced in minutes, not days.
  • Archive results. The platform should store and archive each treated and matched control group, as well as evaluation metrics, to allow future meta-studies.

These are the basic requirements, but to get the most benefit from a rapid test-and-measure culture, the platform must also:

  • Be transparent to the user. The platform must relay to managers how good the matched control is. Sometimes, it’s simply not possible to find a great control, and managers need to know how much to invest in the results. Knowing an experiment is invalid and why can be just as important as understanding a good test.
  • Be comprehensive. The solution should have knowledge of all experimentation across the enterprise so that it can incorporate the effect of other experiments when choosing a control and analyzing results. If multiple experiments are being conducted simultaneously, they may impact one another, which will certainly affect the outcome. Results from previous experiments may also be helpful in creating new experiments and setting baseline expectations for certain classes of interventions.
  • Easily integrate outside data sources. For example, weather could have a big impact on an experiment. If there was severe weather in a region during a test, it could impact results, so the ability to plug in outside data is an important requirement.

For more than 20 years, Fulcrum has been at the forefront of advanced data analytics, implementing solutions that have unlocked multi-million dollar opportunities for the world’s top companies. As businesses continue to grow in complexity, operating across physical and digital channels, their data assets also become more complex. The companies that leverage that data effectively will have a strategic advantage. Our recent work in matched control groups is a great example of bringing together expertise in data science, business process, and big data technology to drive real profits for our partners.

For more information, contact us today.