Datahub SSH

AnnCor and Multiword Expression Identifier

The central goal of this project is to create a Multiword Expression Identifier for Dutch (MWEIDD) and enrich various Dutch text corpora with annotations based on this Identifier. Multi-word expressions are combinations of words that have idiosyncratic properties, e.g. De plaat poetsen.

Next to that the project consists of activities for preparing text corpora (AnnCor and Childes) for this MWEIDD algorithm, then to parse the corpora and store the results in a new version of GreTel (Greedy Extraction of Trees for Empirical Linguistics) which than will serve as user-friendly search engine for the exploitation of these annotated corpora.

For a detailed project description, click here.

The results of the DataHub’s AnnCor project:

  • The development of an algorithm, sc. Multiword Expression Identifier for Dutch
  • Parsing of the current AnnCor corpus
  • Parsing/Annotation of the Childes corpus.
  • Developing a new version of GreTel.
  • Uploading the results of the parsed corpora to Gretel

This results in a toolset which enable an unprecedented boost and acceleration of the research into multiword expressions, both from a theoretical and from a computational linguistics viewpoint. It will also benefit research into language acquisition and education. In addition, through it Utrecht University will play a leading role in the international research community on MWE research. The toolset will in time be disseminated via this website.