Data Science in Practice: Leveraging Matched Control Groups to Accurately Measure Business Experiments

Published by Sean Christy, Group Director, Fulcrum Analytics on September 1st, 2016

It’s challenging these days for managers to measure the financial effects of changes they make to their enterprise. Interventions such as marketing campaigns, human and physical capital expenditures, changes to business processes, new product introductions, and outlet openings or closures all require rigorous assessment of the effects to the bottom line as justification for investment. This way we can learn from success or failure and drive future business decisions. A manager’s new idea can be viewed as a hypothesis to test, and the intervention is the experiment that will validate or refute the hypothesis.

The gold standard when it comes to experimental testing is a randomized block design. Subjects (customers, employees, stores, branches, markets) for the experiment are randomly selected, and those receiving differentiated treatment are themselves randomly selected. Often subjects are sub-grouped or blocked by common features in order to measure the effects of treatment for those specific attributes. Grouping by features of interest ensures that sufficient observations are available to measure possible statistically significant effects. Within this framework, the treated and non-treated controls are statistically indistinguishable by any observable business metric as well as unobserved features. Due to the randomization of treatment, any treatment effects will be the same between test and control.

Randomized block experiments are commonplace in direct marketing, where the managers have access to many observed customer features and have full control over who is treated with marketing and who is not. Managers have control over who is marketed to, and when and how, so they can observe all responses and non-responses from that marketing. The effectiveness of the marketing is then measured from any difference in response between the treated and control groups.

Randomizing the subjects for the experiment and treatment, however, is not always cost effective or even possible. Capital improvements, for example, are usually too costly to justify a purely random experiment. Experiments in human resource management could be illegal at worst or unethical at best. Managers may have limited or no control over who is subject to treatment, such as with online or mass media or any situation where a customer opts in to a program, such as a loyalty card.

Simply measuring the post-intervention differences between treated and non-treated will likely lead to false conclusions because the selection for treatment was biased. For instance, only underperforming stores may be selected for new training, a new product launch may be limited to stores with available shelf space, or just qualifying customers are approved for a store-branded credit card. The treated and untreated subjects differ materially prior to the intervention, and it is likely that the response to the intervention would differ as well.

Measuring differences between pre- and post-intervention within only treated subjects requires controlling for seasonality and trends which could vary subject to subject and have additional modeling requirements.

One solution that allows for accurately measuring treatment effects without a randomized test is to match treated subjects to untreated subjects on similar pre-intervention features. Matching treated subjects based on observable attributes reduces or eliminates the pre-intervention differences between treated and untreated. A subset of untreated subjects then acts as the comparison control group. The assumption is that without any intervention, both the treated and the matched untreated would continue to look similar by these features post-intervention. Any observed differences in actual outcome can then be attributed to the effect from treatment. In many cases, the manager is then able to measure the effects from intervention with nearly the same precision as a randomized experiment.

Various methods are available to match treated to untreated subjects and balance the pre-intervention features. Propensity score, exact, coarsened exact, nearest neighbor, and optimal are among the commonly used algorithms. Each has its own benefits and tradeoffs in terms of data requirements and precision. The choice of which features to use as the basis for matching must also be made. Propensity score is the most commonly used and simplest to implement. However, since it relies on a generalized linear model, it’s sensitive to the usual model assumptions and not usually subject to a thorough model building process. It is also possible that pre-intervention features remain statistically unbalanced. Combining propensity score with additional constraints can improve results but requires additional technical sophistication on the part of the analyst.

The selected features to match on should be assumed to be correlated with outcome metrics used to assess the effectiveness of the intervention—but also include features related to how subjects were selected for treatment. For example, if a marketing campaign to acquire new customers is launched, then prior new customer acquisition rates would be a reasonable baseline to match stores. If that same campaign was directed toward a certain demographic, then each store’s known demographic footprint could be an additional attribute to match on. Ultimately, validation of the manager’s hypothesis rests on finding a valid control group for the experiment on a complex set of criteria.

In many ways, each set of matched control groups found through one of the matching algorithms or matched using different criteria is analogous to a sample in the statistical sense. It is a single point estimate of the true effect from the intervention. Likewise, with repeated sampling and repeated matching on different criteria or with different algorithms, additional point estimates are generated. A manager now has at disposal a range of values that point to the true result of the intervention.

The intervention itself is a single event, and repeated interventions with like criteria provide additional data points to the measured effects. Structuring interventions and measurement of effects to allow for meta-analysis leads to a more robust estimate of the true intervention effects.

Another often-overlooked aspect is that no manager’s intervention is done in isolation. Other interventions are likely to have occurred prior, which affect the features on which treated and untreated are matched, or concurrent, which bias the measurement of the intervention. Capturing and cataloging all interventions enterprise-wide on customers, employees, or outlets is a continuing challenge—but key to proper assessment of any intervention and experiment.

Outside the enterprise are other factors that could influence the effects of an intervention. Competitor behavior, weather, and macro-economic changes can have an impact on the experiment directly as well as how it is measured.

Any method of matching test subjects to control will remain, however, a second-best solution compared to a randomized test design. Although balance between test and control may be achieved for observed features, there remains a possibility that unobserved features, such as attitudes toward a brand, are not balanced and could impact the treatment effects. Additionally, balancing features may not be possible if, for example, all subjects with a given feature are treated and none remain to match against. Knowing these limits and being able to measure the effectiveness of the best possible non-randomized test is itself of value.

Not every intervention directed by a manager needs to be thought of as experiment to be rigorously measured with a randomized design or matched controls. Those that are big bets, or where future decisions depend on the intervention’s outcome, do require a sophisticated analytical methodology. Enterprises committed to data-driven business decision-making are the ones investing in resources necessary to capture interventions and accurately measure their effects.

For more information, contact us today.