Spanish-Italian website parallel corpus (Processed)

Name: Spanish-Italian website parallel corpus (Processed)
Creator: Directorate-General for Communications Networks, Content and Technology
Published: 2020-02-11T12:57:48
License: https://creativecommons.org/licenses/by/4.0

Publisher

Directorate-General for Communications Networks, Content and Technology »

Description

This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 3,319 TUs. Date of crawling : 23/01/2017 A strict validation process was already followed for the source data, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.