Overcoming survivorship bias in data analytics
Young students might be forgiven for believing that dropping out of college to pursue big ideas is a key to success. After all, Steve Jobs, Bill Gates, and Mark Zuckerberg have done so. But what this perception hides are many more people who left college and failed in their business endeavors. By focusing on famous tech moguls and ignoring unsuccessful dropouts, students are exhibiting what is known as “survivorship bias.”
This tendency to draw conclusions based on the things that have survived and ignore those that didn’t creeps into different aspects of everyday life, including data science and artificial intelligence (AI). And biased interpretations of data have major consequences. They can lead to inaccurate and overly optimistic predictions and conclusions, prompting companies to engage in futile actions. Knowing how to spot and avoid survivorship bias is thus vital for anyone looking to take advantage of data and smart algorithms.
Two ways survivorship bias is shown
Survivorship bias affects a decision-making process in two main ways – inferring a norm and inferring causality. Inferring a norm relates to people’s leaning to believe that things that survived a process, and whose existence can be either proved or directly observed in the present, are the only ones that ever existed. For instance, old medieval monuments we see today are primarily made of stone. This might prompt tourists to believe that most buildings in the past were made of stone despite wood being an equally important construction material.
Inferring causality is a tendency to believe that anything that survived a process was shaped by it. People might assume, for instance, that working at Microsoft makes someone a great startup founder given 50% of the startups that became unicorns - valued at $1B or more- are from Microsoft (these are made up numbers). And while this could be true, this claim cannot be made without considering the non-unicorn startups with founders from Microsoft.
Survivorship bias in the business world
Biased interpretations of events and data come in many shapes and forms. Take, for example, a company that launched a data analytics tool. After one month, it discovered that marketers are the only user group that opts for paid plans and creates sophisticated analyses. Managers might conclude that the tool resonates with marketing professionals and empowers their work. This reasoning, however, is flawed as the company is only considering active users.
To avoid the survivorship bias trap, managers need to also look into people who gave up on the tool. A large number of churned marketers, for instance, would disprove the hypothesis that the software is particularly suited for this user group. Also, users might be skillful at creating various analyses despite the flawed design of the tool, not because of it. Knowing customers’ skill levels before the onboarding may allow the managers to better judge the impact of their product.
Another likely scenario of survivorship bias in action is a startup building a machine learning model for churn prediction. Its engineers have looked at active customers to identify factors useful for the forecast and are now developing the new tool. What they overlooked, however, is the fact that their model is biased and doesn’t take into account the customers that already churned. And by only considering surviving customers, the software will produce biased insights that won’t be highly effective at churn prevention.
Biased actions can have even more serious consequences. Say, for instance, that a software development company is building a new fraud prevention algorithm for a bank. Engineers could decide to build a machine learning model to protect against the fraudsters that breached the bank’s previous solution. But accounting only for ‘survivors’ would mean that all the other attackers the old system blocked are now ignored and might find it easier to trick the new algorithm.
How to avoid survivorship bias?
Survivorship bias affects individuals and companies in various ways. Typically, it leads to overly optimistic conclusions as the resilience of ‘surviving’ data affects the outcome, while the parameters that have ceased to exist are ignored. The predictions shaped by survivorship bias aren’t representative of real-life environments.
Preventing survivorship bias, therefore, requires constant vigilance. It’s important to be selective with data sources and take into consideration factors or observations no longer in existence. And before every analysis, think about the data that’s not present but should be as that’ll ensure that your individual reasoning, as well as machine learning models, are as objective as possible.
Working on AI projects the right way
AI and machine learning technologies enable companies to retain a competitive edge by providing forecast, automation, and optimization capabilities. But getting the most out of your data requires accounting for survivorship bias. Being aware of this issue is the first step to avoiding it. Then, actively looking for signs of the bias will ensure that your data sets reflect the reality and aren’t irrationally optimistic. And with these foundations in place, you’ll be ready to get the most out of AI-driven efforts.
This article was originally published in 2020 on Kurvv.ai’s blog.