Searching for Discriminative Metadata of Heterogenous Corpora

Abstract

In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain, and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology.

Publication
In the 14th International Workshop on Treebanks and Linguistic Theories (TLT14)
Gaël Guibon
Gaël Guibon
Associate Professor

My research goes from emojis and emotion prediction and recommendation to meta learning, few-shot learning and French lexical evolution studies.

Related