Searching for Discriminative Metadata of Heterogenous Corpora


In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain, and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology.

In the 14th International Workshop on Treebanks and Linguistic Theories (TLT14)
Gaël Guibon
