The Suomi 24 Corpus (2017H2) (deprecated)

View resource name in all available languages

Suomi 24 -korpus (2017H2) (käytöstä poistunut)

Suomi24-2017H2

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2019010801

This resource has been replaced with version 1.1: http://urn.fi/urn:nbn:fi:lb-2020021801

Compared with VRT version 1.1, this version 1.0
– contains major discrepancies in dependency annotations, resulting from a mistake in the parsing process;
– has only numeric topics (discussion area numbers) for messages, not topic names;
– lacks several other text attributes;
– has many text attributes with different names;
– lacks base forms without compound-boundary markers;
– has base forms with doubly XML-encoded &, < and > (e.g., “&amp;lt;” instead of “&lt;” for <);
– has the data divided into 99 files with at most a million messages each, instead of 17 files divided by the year;
– has thread start messages and their comments in different files;
– has the messages sorted according to the thread or comment id (number), instead of having all the messages of a thread (within a year) consecutively in thread order;
– contains some completely empty messages (text elements with no tokens);
– contains 143 messages with the dummy timestamp “1970-01-01 00:00:00”; and
– contains extra (non-initial) positional attributes comments.

The corpus is available in Kielipankki - the Language Bank of Finland. License details: http://urn.fi/urn:nbn:fi:lb-20150304151

The corpus contains all the texts available in the Suomi24 API from the discussion forums of the Suomi24 online social networking website from 1.1.2001 to 31.12.2017. The tokenized version was created and the annotation process was then carried out by Jussi Piitulainen.

Researchers who have a user name and a password can download the entire corpus in the VRT format.

The corpus has been reparsed on 27.12.2019. The dependency parses and relations in search results on on 2017H2 done before the correction differ significantly from the parses in other corpora parsed earlier with the same parser.


You don’t have the permission to edit this resource.
  • Turku Dependency Treebank parser