Searching for Discriminative Metadata of Heterogenous Corpora

Abstract

In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain, and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology.

Publication
In the 14th International Workshop on Treebanks and Linguistic Theories (TLT14)
Gaël Guibon
Gaël Guibon
Post-doctoral Researcher

My research goes from emojis and emotion prediction and recommendation to French lexical evolution studies.

Related