We audit 5 multilingual corpora, finding that lower-resource corpora have systematic issues.
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
We explore the impact of the training data size on a French version of RoBERTa. (Equal contribution by the first three authors).
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect.
We explore the impact of the training data size and heterogeneity on French language modeling. (Equal contribution by the first three authors).
We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.
We investigate the impact of different types and size of training corpora on language models.
We explore the impact of the OCR quality on grobid-dictionaries models.
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.