Helsinki Corpus of Swahili 2.0 (HCS 2.0) Downloadable Annotated Version

503 Last view: 2024-04-13

9 Last update: 2021-06-15

Helsinki Corpus of Swahili 2.0 (HCS 2.0) Downloadable Annotated Version

View resource name in all available languages

Helsinki Swahili -korpus 2.0 (HCS 2.0), ladattava annotoitu versio

hcs-a-v2-dl

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-201803271

Access location: http://urn.fi/urn:nbn:fi:lb-201803272

The corpus is available for Download at Kielipankki - the Language Bank of Finland at http://urn.fi/urn:nbn:fi:lb-201803272

This is the downloadable version of Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version, see http://urn.fi/urn:nbn:fi:lb-2016011301 for more information.

The Helsinki Corpus of Swahili 2.0 Annotated Version containing about 25 million words will be available in Kielipankki - the Language Bank of Finland in Download (https://korp.csc.fi/download) for academic use. This means that students and staff of universities can use the corpus by simply logging in with their university credentials. Alumni have the option to apply for access via https://lbr.csc.fi.

Access rights instructions: https:/ /www.kielipankki.fi/support/access (in Finnish: https://www.kielipankki.fi/tuki/kayttooikeudet/).

Instructions on how to access Kielipankki corpora: https://www.kielipankki.fi/support/corpus-location/ (in Finnish: https://www.kielipankki.fi/tuki/aineiston-sijainti/).

The corpus contains various kinds of linguistic information attached to each token. The corpus was annotated using the Salama Tagger.

Preparation of the material

Most of the corpus material was retrieved from the Web. This method was used increasingly after texts in the Web became available. Only texts in news media and on open government pages were retrieved. Some types of texts, such as books, were scanned and proofread. Part of the oldest news material before the time of scanners in the 1980’ies was manually typed.

The corpus material has gone through a series of formatting and correction routines.

1. Converting the text into ascii-format, required by the tagger. There is a wild variety of codes for describing diacritics in Web texts. These had to be formalized.
2. Proofreading and correcting the text with a speller.
3. Analyzing the proofread text for finding still remaining typos and possibly new words.
4. Constructing a correction program that automatically corrects such typos that can be safely corrected. More than 8000 such mistake types were identified.
5. New words found in corpus were added to the parser.
6. Texts were corrected using the constructed correction program.
7. Metadata in text files were formalized.
8. Texts were converted into sentence-per-line format.
9. Text within each file was randomly shuffled to mix the sentence order.

The result of these routines comprises the Helsinki Corpus of Swahili 2.0 Not Annotated Version.

The result of these routines was annotated with Salama Tagger, thus producing the Korp format of the corpus.

Metadata were added to each file.

Structure of the corpus

HCS 2.0 contains the following types of material:

Old material

1. Books
2. News
New material
1. Bunge
2. News

Old material contains material before 2003. Much of this material is in Helsinki Corpus of Swahili 1.0. The big difference is, however, that while in the earlier corpus only sections of books were included, in the new corpus whole texts are included. The other difference is that while in the old corpus text sections are in the original order, in the new corpus sentences are randomly shuffled.

Most of the new material consists of news texts from 2004-2015. The section ‘Bunge’ contains Hansards of the Tanzanian Parliament from the years 2004, 2005 and 2006. Metadata in the beginning of each file give more information. Also the names of the files give hints of the contents of the files.

A word in the annotated corpus contains normally the following types of information:

1. token
2. stem
3. part-of-speech
4. morphological description
5. gloss in English
6. syntactic tag
7. rest of verb description

The last point concerns only verbs.

Detailed license information: See Documentation section below.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN ACA - NC

Restrictions: Academic - Non Commercial Use, Attribution

User Nature: Academic

Attribution Details: See Documentation section

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Helsinki

IPR Holder

Hurskainen Arvi

Contact Person

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland

text

Monolingual text corpusLanguages

Swahili

Linguality

Linguality type: Monolingual

Size

25,000,000 Tokens

Resource Creation

Resource Creator

Hurskainen Arvi

Metadata

Created: 03/23/2018

Last Updated: 06/15/2021

Metadata Language: English (en)

Revision: attribution details fixed

Metadata Creator

Hanna Westerlund

Ute Dieckmann

Relation

Related Resource: Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version http://urn.fi/urn:nb...

Relation Type: IsVariantFormOf

Documentation

How to cite: https://www.kielipan...

Document Type: Manual

Helsinki Corpus of Swahili 2.0 (HCS 2.0) Manual, https://www.kielipan...

Resource group page: http://urn.fi/urn:nb...

Document Type: Manual

Jyrki Niemi, The Korp corpus input format, http://urn.fi/urn:nb...

Keywords: VRT

Document Type: Other

License information, http://urn.fi/urn:nb...

People who looked at this resource also viewed the following: