Open data and data bias

Achtung

Bei maschinellen Übersetzungen besteht die Gefahr, dass sie nicht zu 100 % korrekt sind.

Status: Normal serviceStatus: Normal service

09 Juni 2020

Common mistakes or a lack of (open) data can lead to an unrepresentative dataset of a population, but these issues can also be prevented with our tips and tricks

Along with the proliferation of (open) data and new applications - such as machine learning (ML) and artificial intelligence (AI) - there has been a rise in reports of gender, race and other types of bias in these systems. A principal source of these biases is the data used in these new applications. Biased datasets can create biased algorithms, and this can raise ethical concerns. Three types of biases are common: the interaction bias, the latent bias and the selection bias. The interaction bias refers to facial recognition algorithms trained on datasets containing more Western faces (e.g. Caucasian) than faces from people with a more diverse ethnic background (e.g. African, Asian). The latent bias refers to an algorithm that may incorrectly identify something based on historical data or because of a stereotype that already exists in society. The selection bias occurs when a dataset overrepresents one group and underrepresents another.

The good news is that there are ways to address ethical concerns and to prevent a data bias. First, ML and AI applications should be trained on representative datasets. Second, data scientists can develop the best learning model for each different ML and AI application in order to prevent or detect a data bias. Lastly, the performance of the application should be monitored so that adjustments can be made in case a data or algorithm bias occurs.

There are many positive examples where data is used to erase any form of gender or racial bias. For example, researchers in the US have developed an AI tool for accurately predicting a woman’s future risk of developing breast cancer, and it is particularly effective for African American women. The researchers used a representative dataset of almost 90,000 screening mammograms from about 40,000 women - of different races - to train, validate, and test the deep learning model. These results are a promising example of how data can be used to create a positive impact on our society.

Looking for more open data related news? Visit the EDP news archive and follow us on Twitter, Facebook or LinkedIn. Interested in learning more about COVID-19 and open data? Visit our EDP for COVID-19 page.

Text of this article