Finnish News Agency Archive 1992-2018, CoNLL-U, source

View resource name in all available languages

STT:n uutisarkisto 1992-2018, CoNLL-U, lähdemateriaali

stt-fi-1992-2018-conllu-src

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2020031201

Access location:

This is the parsed version of the Finnish News Agency Archive 1992-2018 corpus (http://urn.fi/urn:nbn:fi:lb-2019041501). The corpus was parsed by Khalid Alnajjar (University of Helsinki) using Turku neural parser pipeline (http://turkunlp.org/Turku-neural-parser-pipeline/).

The Finnish News Agency Archive corpus comprises newswire articles in Finnish sent to media outlets by the Finnish News Agency (STT) between 1992-2018. The corpus includes about 2,8 million items in total. Most of the material is news articles that vary from short “news flashes” to telegrams and longer articles. News articles are categorized by department (domestic, foreign, economy, politics, culture, entertainment and sports) as well as by metadata (IPTC subject categories or keywords and location data). The archive also includes other material STT has created or forwarded such as news planning lists, sports results, analysis articles and press releases.

The corpus is available for non-commercial research through the download service korp.csc.fi/download as whole texts based on a research plan submitted with the application in the Language Bank Rights.

Notes:
-) Headlines and news content were parsed and the output is in CoNLL-U Format.
-) Filenames in the original corpus are preserved, only the file extension was changed. This allows mapping the parsed corpus to the original corpus to obtain additional metadata if needed.
-) Files having "h_" as the prefix contain the parsed headline. Otherwise, it is the parsed news content.
-) Not all documents in the corpus contained a headline or/and news content. In such cases, the file was ignored.
-) The corpus contained some English documents and, in such cases, the output of the parser is usually incorrect. Language identification could be done to deal with the English documents appropriately.
-) UralicNLP (https://github.com/mikahama/uralicNLP/wiki/UD-parser) can be utilized easily to read and use the parsed corpus in Python.

Acknowledgments:
-) This work has been supported by European Union's Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).
-) The corpus was processed on the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:research-infras-2016072533).

Licence: http://urn.fi/urn:nbn:fi:lb-2019041502

You don’t have the permission to edit this resource.