Bias in AI occurs when results cannot be generalized widely. Although most people associate algorithm bias resulting from preferences or exclusions in training data, bias can also be introduced by how data is obtained, how algorithms are designed, and how AI outputs are interpreted.
This issue touches on concerns that are also more social in nature and which could require broader steps to resolve, such as understanding how the processes used to collect training data can influence the behavior of models they are used to train. For example, unintended biases can be introduced when training data is not representative of the larger population to which an AI model is applied. Thus, facial recognition models trained on a population of faces corresponding to the demographics of AI developers could struggle when applied to populations with more diverse characteristics. A recent report on the malicious use of AI highlights a range of security threats, from sophisticated automation of hacking to hyper-personalized political disinformation campaigns.
The fact is that bias in training data – the data used to develop an algorithm before it is tested on the wide world- is only the tip of the iceberg. All data is biased to some degree because they represent a geography, certain demographics, and certain diseases may be over- or under-represented. This is fact. Bias may not be deliberate. It may be unavoidable because of the way that measurements are made – but it means that we must estimate the error (confidence intervals) around each data point to interpret the results.
Today’s AI developers lack access to large, diverse data sets on which to train and validate new tools (Figure.) They often need to leverage open-source data sets, but many of these were trained using computer programmer volunteers, which is a predominantly white population. Because algorithms are often trained on single-origin data samples with limited diversity, when applied in real-world scenarios to a broader population of different races, genders, ages and more, tech that appeared highly accurate in research may prove unreliable.

Think of heights in the U.S. If you collected them and put them all onto a chart, you’d find overlapping groups (or clusters) of taller and shorter people, broadly indicating adults and children, and those in between. However, who was surveyed to get the heights? Was this done during the weekdays or on weekends, when different groups of people are working? If heights were measured at medical offices, people without health insurance may be left out. If done in the suburbs, you’ll get a different group of people compared to those in the countryside or those in cities. How large was the sample?