Datahub SSH

Corpora

Under corpora we group all the activities that deal with the creation, deployment, and dissemination of text corpora. 

Current activities and results:

  • English-language historical newspapers
    In this project we focus on the dissemination of three corpora of English-language historical journals and magazines, like the Herald Tribune and the Economist.
  • A Corpus of Islamic legal texts (8th century – 19th century) 
    In this project fifty-five works of substantial Islamic law were prepared for analysis by the tool Footprinter.
  • Encyclopedia of Arabic poetry and belles-lettres (9th – 18th century) 
    This encyclopedia is a collection of fourteen encyclopedic anthologies of poetry and belles-lettres, all written, from the 9th to the 18th century, in the Sunni world. The anthologies will be subjected to a sentiment analysis, specifically targeting the diachronic appreciation of the five bodily senses. 
  • AnnCor and Multiword Expression Identifier
    The central goal of this project is to create a Multiword Expression Identifier for Dutch (MWEIDD) and enrich various Dutch text corpora with annotations based on this Identifier.
     
    Besides the activities above, the project consists of activities for preparing text corpora (AnnCor and Childes) for this MWEIDD algorithm.