AI and Open Data: a crucial combination
Publication Date/Time
2018-07-04T09:00:00+00:00
Unlocking the potential of Artificial Intelligence
AI AND DATA: A CRUCIAL COMBINATION

Artificial Intelligence (AI) technology has the capacity to extract
deeper insights from datasets than other techniques. Examples of AI
are: speech recognition, natural language, processing, chatbots or
voice bots. To get AI applications to work, big sets of high quality
(open) data are necessary. But what requirements does this data have
to meet? To answer this question, we need to look more closely at what
requirements data need to have to enable successful AI applications.

 

USE OF OPEN DATA FOR AI

Access to (open) data can unlock the potential of AI applications. An
example of an AI project worth noting shows how big amounts of Open
Data are used at a police agency that focuses on crime prevention.
Based on occurrences of crimes (data) and their frequency, the
algorithm developed patterns to identify 'hotspots', areas where
certain crimes are likely to occur in the future. These 'hotspots' are
defined geographic areas within which the algorithm can make specific
predictions as to the type of crime that could take place and when it
is most likely to occur. Underlying these predictions are certain
assumptions and patterns, such as that criminals often operate within
the same area for long periods of time. This project
[http://www.policyconnect.org.uk/appgda/research/crime-prevention-through-artificial-intelligence]
was very successful, reducing burglaries by 33% and violent crimes by
21% in the city where it was applied.

 

THE INGREDIENTS TO LET AI TO FUNCTION PROPERLY

AI systems need Open Data to function. To function properly,
significant 1) data volume, 2) data variety and 3) data veracity (the
truthfulness of data) are crucial. In traditional data analysis, when
bad data is discovered, one can exclude the bad data and start over.
However, this way of data cleansing cannot be done on a larger scale.
Bad data cannot be detected or 'pulled out' of the system that easily.
AI techniques draw conclusions from large masses of data and at some
point, it becomes impossible to determine on which data elements these
predictions are based. Often, when bad data is detected, the whole
learning process starts again from the beginning, which is time- and
cost-intensive.

For this reason, besides the V's 'volume' and 'variety', 'veracity' is
needed to detect whether the data is of good quality. Veracity refers
to the truthfulness of the data. There are several questions to answer
to find out whether the data is 'truthful':

 	* WHY & WHO: Data from a reliable source implies better accuracy
than a random online poll. Data is sometimes collected, or even
fabricated, to serve an agenda. Therefore, a user of data should
establish the credibility of the data source, and for what purpose it
was collected.
 	* WHERE: Data is often geographically or culturally biased. Consumer
data collected in one area of the world may not be representative of
consumers 200 kilometres further. In addition to this, the
interpretation of data can differ: when we objectively measure data,
such as temperatures, the interpretation of that data can differ: what
is considered cold or warm?
 	* WHEN: Most data are linked to time in some way that it might be a
time series, or it can be a snapshot from a specific period.
Out-of-date data should be omitted. However, when using AI over a
longer time span, data can become old or obsolete during the process.
"Machine unlearning [https://www1.lehigh.edu/news/machine-unlearning]"
will be needed to get rid of data that is no longer valid.
 	* HOW: It is important to understand the 'core' of how the data of
interest was collected. Domain knowledge is of the essence here. For
instance, when collecting consumer data, we can fall back on the
decades old methods of market research. Answers on an ill-constructed
questionnaire will certainly render poor quality data.
 	* WHAT: If you use certain data, you want to know what your data is
about, but before you can do that, you should know what surrounds the
numbers. Humans can sometimes detect bad data, for instance because it
looks illogical. However, AI does not have this form of common sense
and tends to take all data for true.

Together with the three 'V's, Open Data can play a pivotal role in
achieving benefits through the use of AI. Open Data
[http://theodi.org/guides/what-open-data] is data that is made
available by (public) organisations, businesses and individuals for
anyone to access, use and share. When data is not open, data cannot be
re-used for other purposes, such as AI. Access to open data will
'unlock' the potential of data-hungry machine AI applications.

 [https://data.europa.eu/sites/default/files/img/media/20180704-fig1.png]


  
_Figure 1 Open Data and AI_

 

HOW TO ACHIEVE DATA QUALITY? 

Data quality requires an integral approach to make sure that people,
processes and systems within organisations meet the demands to achieve
continuous data quality. Three factors are important:

 	* Define roles and responsibilities to EMPLOYEES for efficient data
management;
 	* Describe PROCESSES for data quality assurance;
 	* Invest in sufficient TECHNOLOGY (software and IT architecture) to
support employees.

With the right software, people and rules in place, organisations can
ensure that the organisational data system detects errors and delivers
valuable new insights by making use of AI.

THE FUTURE OF AI? 

Data is essential for AI as algorithms in these systems need large
quantities of high-quality data to perform properly and to develop
further by 'learning'. The continuous promotion of data quality within
organisations when using (open) data for AI is therefore essential to
gain reliable insights. An important focus for (public) organisations
is to make data openly available where possible to let other
organisations (or people within their own organisation) use the data.
Increasing access to high-quality data volumes is key to unleashing
the potential of AI and to allow innovation to flourish.
