Helsinki Corpus of Swahili 2.0 (HCS 2.0) Downloadable Annotated Version

View resource name in all available languages

Helsinki Swahili -korpus 2.0 (HCS 2.0), ladattava annotoitu versio

hcs-a-v2-dl

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-201803271

Access location:

The corpus is available for Download at Kielipankki - the Language Bank of Finland at http://urn.fi/urn:nbn:fi:lb-201803272

This is the downloadable version of Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version, see http://urn.fi/urn:nbn:fi:lb-2016011301 for more information.

The Helsinki Corpus of Swahili 2.0 Annotated Version containing about 25 million words will be available in Kielipankki - the Language Bank of Finland in Download (https://korp.csc.fi/download) for academic use. This means that students and staff of universities can use the corpus by simply logging in with their university credentials. Alumni have the option to apply for access via https://lbr.csc.fi.

Access rights instructions: https:/ /www.kielipankki.fi/support/access (in Finnish: https://www.kielipankki.fi/tuki/kayttooikeudet/).

Instructions on how to access Kielipankki corpora: https://www.kielipankki.fi/support/corpus-location/ (in Finnish: https://www.kielipankki.fi/tuki/aineiston-sijainti/).

The corpus contains various kinds of linguistic information attached to each token. The corpus was annotated using the Salama Tagger.

Preparation of the material

Most of the corpus material was retrieved from the Web. This method was used increasingly after texts in the Web became available. Only texts in news media and on open government pages were retrieved. Some types of texts, such as books, were scanned and proofread. Part of the oldest news material before the time of scanners in the 1980’ies was manually typed.

The corpus material has gone through a series of formatting and correction routines.

1. Converting the text into ascii-format, required by the tagger. There is a wild variety of codes for describing diacritics in Web texts. These had to be formalized.
2. Proofreading and correcting the text with a speller.
3. Analyzing the proofread text for finding still remaining typos and possibly new words.
4. Constructing a correction program that automatically corrects such typos that can be safely corrected. More than 8000 such mistake types were identified.
5. New words found in corpus were added to the parser.
6. Texts were corrected using the constructed correction program.
7. Metadata in text files were formalized.
8. Texts were converted into sentence-per-line format.
9. Text within each file was randomly shuffled to mix the sentence order.

The result of these routines comprises the Helsinki Corpus of Swahili 2.0 Not Annotated Version.

The result of these routines was annotated with Salama Tagger, thus producing the Korp format of the corpus.

Metadata were added to each file.

Structure of the corpus

HCS 2.0 contains the following types of material:

Old material

1. Books
2. News
New material
1. Bunge
2. News

Old material contains material before 2003. Much of this material is in Helsinki Corpus of Swahili 1.0. The big difference is, however, that while in the earlier corpus only sections of books were included, in the new corpus whole texts are included. The other difference is that while in the old corpus text sections are in the original order, in the new corpus sentences are randomly shuffled.

Most of the new material consists of news texts from 2004-2015. The section ‘Bunge’ contains Hansards of the Tanzanian Parliament from the years 2004, 2005 and 2006. Metadata in the beginning of each file give more information. Also the names of the files give hints of the contents of the files.

A word in the annotated corpus contains normally the following types of information:

1. token
2. stem
3. part-of-speech
4. morphological description
5. gloss in English
6. syntactic tag
7. rest of verb description

The last point concerns only verbs.

Detailed license information: See Documentation section below.

You don’t have the permission to edit this resource.