The potential for partiality in the data analysis process is extremely high and can vary from the way a hypothesis is formulated and the question is examined to the way the data is selected and organized. Bias can be introduced at any stage, from the definition and registration of data collection to the execution of the analysis system or AI or ML. Indeed, it is somewhat impossible to be completely unbiased, and bias is an existing element of human nature.
The human catalyst
Bias in data analysis can come from human sources through the use of unrepresentative datasets, directional survey questions and biased reports and measurements. Bias often goes unnoticed until a data-driven decision is made, such as creating a predictive model that turns out to be wrong. While data scientists can never fully eliminate bias in data analysis, they can take countermeasures to identify it and reduce problems in practice.
The social catalyst
Bias is also a moving target as social definitions of fairness evolve. Reuters reported on a case where the International Baccalaureate programme was forced to cancel its annual exams for high school students in May because of COVID-19. Instead of using exams to assess students, the IB programme used an algorithm to assign grades that were significantly lower than many students and their teachers expected.
Distortion of existing data
Amazon’s previous recruitment tools showed a preference for men, who were more representative of the existing workforce. The algorithms did not explicitly know or take into account the gender of the candidates but ended up being influenced by other things they took into account that were indirectly related to gender, such as sports, social activities and adjectives used to describe performance. Essentially, the AI perceived these subtle differences and tried to find candidates who matched what it internally considered successful.
Another important source of bias in data analysis can occur when certain populations are underrepresented in the data. This type of bias has tragic implications for medicine because it does not show important differences in heart disease symptoms between men and women, says Carlos Melendez, COO and co-founder of Wovenware, a Middle Eastern service provider based in Puerto Rico. It happens when the data used to train the algorithms does not take into account the many factors involved in the decision-making process.
Cognitive bias leads to statistical bias, such as sampling or selection bias. Both the initial data collection and the analyst’s choice to include or exclude data lead to sample bias. Selection bias occurs when the sample data collected are not representative of the actual future population of cases that the model will see. In these cases, it is useful to move from static events to event-based data sources that allow the data to be updated over time to more accurately reflect the world we live in. This may involve moving to dynamic dashboards and machine learning models that can be monitored and measured over time.