Datahub SSH

Project description

Jan Odijk’s team (UiL-OTS/TLC) cooperates with the Digital Humanities Lab in the DataHub SSH project and they work on the following items:

  • MWEID: development of a Multi-Word Expression Identifier. This application will automatically identify multiword expressions (e.g. idioms en collocations) in a Dutch text corpus and generate appropriate annotations. The LASSY-Small corpus will be enriched with annotations generated by this application. MWEID will be a new application initially hosted by the DH Lab, and later most likely also by the recognized CLARIN centre Institute for the Dutch Language (INT) ( The data and associated metadata will integrated in the GrETEL application, and stored on UU Datahub servers, as well as on servers of INT.
  • CHILDES treebanks: Automatically generated syntactic structures for CHILDES data will be made available via the GrETEL application and as data via DataHub SSH and INT servers.
  • Manual annotation of CHILDES corpora: For the CHILDES Van Kampen corpus a version with manually annotated syntactic structures for each utterance will be made available. This is a continuation of the work started earlier in the UU AnnCor project.
  • Multiple improvements will be made to the GrETEL application, in particular in the GrETEL upload functionality. GrETEL is an application that enables searching in a treebank, analysis of the search results, and uploading one’s own text corpus which is then made available as a searchable treebank. GrETEL is running on servers of the UU DH Lab and on servers of INT.
  • The Lassy-Large treebank (which includes the SoNaR-corpus) will be made available on server of the Datahub SSH so that all Utrecht researchers can make use of it in their research.
  • More generally, all data and applications that we make will be integrated in the European CLARIN research infrastructure and the Dutch part of it (CLARIAH) via the recognized CLARIN centre INT.