Searching for Discriminative Metadata of Heterogenous Corpora

Gaël Guibon, Isabelle Tellier, Sophie Prévost, Mathieu Constant, Kim Gerdes

December 2015

Abstract

In this paper, we use machine learning techniques for part-of-speech tagging and parsing to explore the specificities of a highly heterogeneous corpus. The corpus used is a treebank of Old French made of texts which differ with respect to several types of metadata: production date, form (verse/prose), domain, and dialect. We conduct experiments in order to determine which of these metadata are the most discriminative and to induce a general methodology.

Type

Conference paper

Publication

In the 14th International Workshop on Treebanks and Linguistic Theories (TLT14)

Gaël Guibon

Associate Professor

My research goes from emojis and emotion prediction and recommendation to meta learning, few-shot learning and French lexical evolution studies.

Searching for Discriminative Metadata of Heterogenous Corpora

Abstract

Gaël Guibon

Associate Professor

Related