The University of Helsinki's English E-thesis 1999-2016, Korp version 1.1

View resource name in all available languages

Helsingin yliopiston englanninkielinen E-thesis 1999-2016, Korp versio 1.1

e-thesis-en-korp-v1-1

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2020031301

Access location:

The corpus is available in Kielipankki - the Language Bank of Finland in Korp.

The corpus contains the University of Helsinki's English master's theses as well as the doctoral theses and their summaries published at https://ethesis.helsinki.fi by September 2016.

This version fixes issues with tokenization, language recognition and OCR in Master's theses and dissertations, 23 subcorpora in total. These subcorpora have also been parsed with Turku Neural Parser Pipeline (TNPP). For more information, see http://turkunlp.org/Turku-neural-parser-pipeline/.

The 2 subcorpora containing abstracts (ethesis_en_dissabs and ethesis_en_maabs) are the same as in the previous version.

The subcorpus ethesis_en_phd_math has been renamed to ethesis_en_phd_sci.

Texts with less than 1000 words have been left out. Most of them contain only an abstract, often both in Finnish and English, and/or just the first page and possibly a list of contents. especially in dissertations.

Texts that contain more than 1000 words are included if they contain at least 50 English words, tested with a very simple search containing common English words. Most texts that do not pass this test are not in English or the text itself has been badly OCR'd.

You don’t have the permission to edit this resource.
  • Turku Neural Parser Pipeline