This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 15,797 TUs. Period of crawling : 15/11/2016 - 23/01/2017. A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.
- Landing Page
- Release Date
- Modified Date
- Spanish, French
European Language Resource Coordination (ELRC)