Linking data: European Science Vocabulary
In our ‘Linking data’ series, we are presenting EU projects that use linked open data (LOD). What data is linked in these projects? Why did they decide to use LOD? What benefits does it bring? Follow the series to find out.
In this episode, we are presenting the European Science Vocabulary. Read on to find out what it is and how and why it uses LOD.
CORDIS – EU research and development database
The Community Research and Development Information Service (CORDIS) is a multilingual platform offering access to data about EU-funded research and innovation projects. Its mission is to bring research results to professionals in the field, foster open science, create innovative products and services and stimulate economic and scientific growth across Europe.
The platform is made up of several databases and contains information on all EU-supported research and innovation (R & I) activities, including funding programmes (such as Horizon 2020), projects, results and publications.
The project database is at the heart of CORDIS. It gives access to public information about EU-funded R & I projects, including details such as objectives, dates and funding programmes.
Improving findability with EuroSciVoc
To make projects easier to find in the database, CORDIS developed the European Science Vocabulary (EuroSciVoc). This is a taxonomy – a way of describing data in which all the terms belong to a single hierarchical structure and have parent/child or broader/narrower relationships to other terms. The structure is sometimes referred to as a ‘tree’. EuroSciVoc allows the classification of projects according to the precise scientific field(s) which they relate to.
EuroSciVoc’s root is based on the two levels of the Fields of Research and Development classification, developed by the Organisation for Economic Co-operation and Development. To offer a comprehensive categorisation, the taxonomy tree was further enriched with additional branches, based on the scientific fields collected from the abstracts of projects stored on the CORDIS platform.
The taxonomy contains more than 1 000 categories available in six languages (English, French, German, Italian, Polish and Spanish). Starting from its seven root categories, the EuroSciVoc classification can reach a maximum depth of six levels.
Each category is enriched with one or more relevant keywords, in other words alternative related terms used to classify projects in addition to their main term. The keywords are extracted from the textual description of the projects. For instance, the keywords for the category ‘water supply systems’ are ‘water supply network’ and ‘water supply infrastructure’. Stop-keywords can be used to exclude certain projects from categories (e.g. exclude a project mentioning ‘state of the art’ from the category ‘arts’).
The major benefit of EuroSciVoc is that it allows users to find projects belonging to specific domains of science in a standardised way.
EuroSciVoc follows a pragmatic approach of combining artificial intelligence and human expertise. Artificial intelligence algorithms are used to extract and suggest categories and their keywords from project descriptions. Those suggestions are then validated by humans.
As the project database expands, EuroSciVoc evolves and is maintained constantly. Its maintenance and update workflow consists of four phases.
A dedicated tool using a combination of natural language processing algorithms helps to classify CORDIS projects according to the EuroSciVoc taxonomy.
Dedicated algorithms detect all new categories issued and all modifications of existing categories. They compile a list that is crucial for the cleansing phase.
During cleansing, the EuroSciVoc team analyses the list and either applies the necessary modifications directly or following a discussion.
Finally, once the list has been verified and changes have been made, EuroSciVoc is released. It is made available on the EU Vocabularies website and in the integrated CORDIS architecture (the repository of CORDIS content).
Linked open data approach
EuroSciVoc is formalised using the Simple Knowledge Organization System, a common data model for sharing and linking knowledge organisation systems via the web.
The main benefit of using this data model is that it allows to link and align concepts and their labels between different controlled vocabularies. Thanks to these connections, users can compare different resources that are classified using equivalent categories, irrespective of lexical and semantic differences.
Thanks to LOD, EuroSciVoc can be seamlessly reused by any other organisation to classify their own data. As part of the EU controlled vocabularies, EuroSciVoc is periodically published on the EU Vocabularies website with a persistent uniform resource identifier. The taxonomy is free for reuse in accordance with the CC BY 4.0 license.
Reusing EuroSciVoc in your projects
EuroSciVoc allows you to classify your data using a taxonomy built on a corpus of the textual resources of more than 5 000 R & I projects. It is frequently updated to accommodate new information and improve its accuracy and scope. Its evolution, while data-driven, is controlled by human experts, which ensures that it is semantically consistent.
Reusability and flexibility are some of the major features of EuroSciVoc. The taxonomy can be used to represent fields of science in six languages and can easily be adapted to other controlled vocabularies using the Fields of Research and Development classification (since the latter acts as its root). Finally, EuroSciVoc is easily reusable, thanks to its formalisation in the Simple Knowledge Organization System.
The taxonomy is currently at version 1.3 but, as mentioned above, it is an ongoing project. Apart from expanding and aligning with other relevant taxonomies, EuroSciVoc will potentially evolve into a thesaurus – a controlled vocabulary with concepts represented by labels, which extends taxonomies’ hierarchical structure with associative properties.
What does this mean in practice? What’s the added value?
Thesauri can determine that a concept is more or less specific than another, but also that a concept is related to another because they cover aspects of a similar domain. For instance, ‘artificial intelligence’ is related to 'computational fluid dynamics' since the latter studies fluid dynamics by exploiting techniques like machine learning.
If EuroSciVoc evolves into a thesaurus, it will provide CORDIS users with more complex and exhaustive information. It will also benefit the re-users of EuroSciVoc, as they will be provided with a more extensive reference data asset.
Graphics used in this article (available for reuse under CC-BY-4.0)