Open data and AI: A symbiotic relationship for progress
Publication Date/Time
2023-06-09T10:00:00+00:00
Country
Europe
Unleashing the potential of this powerful duo
Artificial Intelligence (AI) systems are complex mathematical models
trained on data. These systems are designed to process and analyse
large volumes of data with the purpose of recognising patterns and
making predictions. AI systems are becoming increasingly accurate and
sophisticated due, in part, to advances in the techniques and
algorithms used for AI, access to greater computer processing power,
and the wider availability of data.  

With these technological developments, the use of AI is becoming more
common across sectors in Europe. Figure 1 shows how in 2021 the
percentage of enterprises using AI technologies was varying across
countries in Europe. To foster the use of AI across Europe, the EU
launched the European AI Strategy
[https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence]
aiming to boost research and industrial capacity while ensuring safety
and fundamental rights. 
[https://data.europa.eu/sites/default/files/img/media/Figure%201_1.png]
_Figure 1: Percentage of enterprises in Europe using AI technology in
2021 (Eurostat
[https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Use_of_artificial_intelligence_in_enterprises#Enterprises_using_artificial_intelligence_technologies]) _

Data is a critical component of AI systems. This story highlights the
connection between open data and the performance of AI systems. With
real-life use cases, this story showcases how these two topics are
interlinked and are dependent on each other to foster Europe’s
digital transition
[https://commission.europa.eu/strategy-and-policy/priorities-2019-2024/europe-fit-digital-age_en].


HOW OPEN DATA AND AI ARE CLOSELY RELATED

Open data and AI have the potential to support and enhance each
other’s capabilities. On the one hand, open data can improve AI
systems. In general, exposing AI systems to a larger volume and
variety of data increases the chance of the system returning accurate
and useful predictions. As such, open data can be a supply of large
amounts of diverse information for AI systems. In this way, the
availability of open data contributes to better performing AI. For
example, an AI system trained to predict consumer consumption patterns
[https://data.europa.eu/en/publications/datastories/measuring-inflation-eurozone-open-data]
in Europe will likely perform better if it includes a representative
selection of goods and service transactions by consumers across
different countries, municipalities, and income groups.

On the other hand, AI can unlock additional value from open data. AI
can analyse large volumes of data and identify trends and patterns
that might not have been revealed through other analysis techniques.
Open data contains rich information and complex patterns from which
insights can be derived. As a powerful analysis tool, AI can thereby
leverage the value of open data. For example, an AI system trained to
predict forest fires
[https://data.europa.eu/en/publications/datastories/high-value-datasets-forest-fires-eu]
can search for patterns across weather data, satellite images, and
historical trends that standard statistical comparisons cannot
identify.

AI CAN BENEFIT FROM THE BREADTH OF OPEN DATA

Open data encompasses a broad range of subject areas. This variety of
data widens the possible use cases for which AI systems can be
developed, making new AI-driven products and services possible. These
use cases can only be developed if the relevant data is available and
easy-to-access. The free availability of diverse datasets as through
open data, is essential to drive innovation and bring new economic
opportunities. The hope is that innovative AI systems can then be used
to help solve challenges facing society, creating socioeconomic value.


The diverse pool of information that open data can provide for AI
systems especially can enable multidisciplinary applications that
combine data on several topics to derive new insights. For example, a
weather dataset can be used to make weather predictions. But combined
with data on seed genetics, soil characteristics, and environmental
conditions, an AI system can be trained to have good contextual
knowledge of the variables affecting agriculture production
[https://data.europa.eu/en/publications/datastories/open-data-agricultural-sector].
Such a system could be used to aid decision-making to increase crop
yields, prevent plant disease, or optimise other business decisions.

AI CAN BENEFIT FROM THE DEPTH OF OPEN DATA

While the breadth of subject areas covered by open data enables broad
use cases for AI systems, having extensive open data available on a
specific use case allows AI systems to perform better. Take, for
example, an AI system developed to recognise residential buildings. If
the AI is trained on only images captured in the summertime of
mansions in the countryside, the model will perform poorly when asked
to recognise a city apartment as a residential building. The model
therefore needs to be trained on a comprehensive set of examples to
understand the variations of what is considered a residential building
(in this case factors such as architectural style, building size, and
surrounding environment may be relevant). 

Data that comprehensively represent the subject matter exposes AI
systems to a broader range of scenarios and variations. Ultimately,
this allows AI models to perform better in real-world situations and
generalise their knowledge when confronted with new data (such as a
picture of an individual house that the model has not seen before). AI
systems trained on unrepresentative or incomplete data risks making
biased predictions and being unreliable.

HIGH-QUALITY AI RELIES ON HIGH-QUALITY OPEN DATA

The completeness of open data contributes to the ability of AI systems
to generalise to unseen examples once these are deployed into
“real-world” operation, but it also contributes to the concept of
data quality. Some characteristics of data quality includes the
completeness, relevance, consistency, uniformity, and trustworthiness
of the data for the use case being developed. Figure 2 displays the
relationship between six data quality dimensions and the performance
of three types of AI algorithms from a paper
[https://arxiv.org/abs/2207.14529] by researchers at the University of
Potsdam. For example, the figure shows that completeness (no missing
data) and feature accuracy (no erroneous data) have a strong effect on
the performance of all three AI algorithms. On the other hand,
clustering algorithms are less affected by target accuracy (no
mislabelling of data), uniqueness (no redundant or duplicated data),
and class balance (having groups equally represented).
[https://data.europa.eu/sites/default/files/img/media/Figure%202_3.png]
_Figure 2: The effect of data quality on the performance of AI
algorithms (__GitHub_
[https://github.com/HPI-Information-Systems/DQ4AI]_)_

Several open data initiatives in the scientific field demonstrate the
impact of open repositories with structured catalogues of data and
standardised data formats. For example, the German government funds a
national research data infrastructure including a consortium (called
NFDI4Chem [https://www.nfdi4chem.de/]) that makes chemical data
findable, accessible, interoperable, and reusable by setting best
practices including machine-readable chemical structures. These
databases typically include a quality review process or curation
method to ensure data quality and trustworthiness.

A greater demand for open data for new products and services might
encourage the release of more datasets and improvements in data
quality. It is therefore promising to see in the Open Data Maturity
Report 2022
[https://data.europa.eu/en/publications/datastories/open-data-maturity-report-2022-countries-perspectives-open-data-quality]
how European countries are working to improve the quality of the data
published on their national open data portals.

OPEN DATA IS ENABLING REAL-WORLD AI APPLICATIONS

There are several examples of open data being used in AI systems for
novel applications in Europe.

As a first example, the Croatian CROZ RenEUwable
[https://data.europa.eu/en/news-events/news/eu-datathon-2022-teams-behind-apps-meet-croz-reneuwable]
app combines climate and energy data into a machine learning model
that provides citizens with personal recommendations for adopting more
sustainable decisions on energy. The AI-driven app, which won the EU
Datathon 2022 [https://op.europa.eu/en/web/eudatathon] in the category
of ‘A European Green Deal’, is based on open data selected by the
team based on quality, completeness, consistency, timeliness,  and
usability. 

Another example is a project for the Land Registry and Topography
Administration [https://act.gouvernement.lu/en.html] in Luxembourg
that illustrates the use of AI to analyse aerial images saved in
geographic databases
[https://data.europa.eu/data/datasets/extopia-extraction-dojects-topographiques-par-intelligence-artificielle?locale=en].
Figure 3 displays an annual series of such aerial photographs. These
databases need to be continuously updated and maintained, with new
aerial images added every year.  This task includes identifying all
newly constructed, demolished, or updated buildings. Performing such
an inspection manually is very laborious, so the Luxembourgish
government launched a project to develop a proof of concept of an
AI-based tool that can automatically identify changes to buildings.
The project produced satisfying results and a goal for subsequent
versions is to include other topographical objects such as walking
paths.
[https://data.europa.eu/sites/default/files/img/media/Figure%203_1.png]
_Figure 3: Landscape changes in Beaufort, Luxembourg from 2001 –
2022. Aerial images have been geometrically corrected to ensure a
uniform scale (Ministry of Finance, Luxembourg)_

As a final use case example, the European Commission launched the
European Cancer Imaging Initiative
[https://digital-strategy.ec.europa.eu/en/policies/cancer-imaging] to
leverage data and digital technologies, such as AI, to combat cancer.
This initiative aims to create an open dataset
[https://cancerimage.eu/achievements/#tools] that links all existing
resources and databases across Europe, working towards a more open,
available, and user-friendly infrastructure for cancer imaging. The
infrastructure is expected to be completed by December 2023, after
which data providers will be able to connect to the platform.

CONCLUSION

The potential of AI systems in society is vast. When combined with
open data, new opportunities become possible both for deriving new
insights from open data and powering AI systems for new uses. The free
availability of open data provided to all citizens without limits on
its reuse enables enterprises to implement this data in their AI
systems. Novel use cases are supported by the diversity of open data
both in its potential breadth, which would offer multiple use cases
for AI systems, and its potential depth, which would offer a deep
representation of a specific use case. Policy initiatives that promote
the sharing of certain datasets support the diversity of open data in
Europe.

Increasing access to high-quality open data is a priority to unlock
the synergy between open data and AI. Further improvements to data
quality are supported by policies and community initiatives that
impose quality standards and curation methods on open data.
Data.europa.eu contributes to this goal through, among other means,
its metadata quality dashboard [https://data.europa.eu/mqa/?locale=en]
which intends to help data providers and national data portals
evaluate their metadata against various indicators such as
accessibility and reusability. 

If you want to find out more about the work of data.europa.eu and real
examples on what to do with data, don´t miss out on the Use Case
Observatory report
[https://data.europa.eu/sites/default/files/report/The_Use_Case_Observatory_Volume_I_0.pdf]
and related learnings on the data.europa academy
[https://data.europa.eu/en/academy/measuring-impact-open-data]. Also,
stay tuned for our next data stories and webinars by subscribing to
our newsletter [https://data.europa.eu/en/newsletter] and following
data.europa.eu on social media. 
