Aalto Finnish Parliament ASR Corpus 2008-2020

View resource name in all available languages

Aallon puheentunnistuskorpus eduskunnan istunnoista 2008-2020

fi-parliament-asr

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2021051903

NOTE: there is a newer version available: http://urn.fi/urn:nbn:fi:lb-2022052002. The older version is available on request.

This corpus is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group. The original session transcripts and videos are available at the web
portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi). The
corpus is split into three parts:
1. 2015-2020 set
2. 2008-2016 set
3. Development and evaluation sets

A non-overlapping combination of the 2008-2016 set and the 2015-2020 set form a training set of size:
- 1 422 318 sample pairs
- 3 130 hours of speech
- 19 356 831 word tokens

All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision.
The transcript files (.trn) are plain text files.

---
## 2015-2020 set
This subset is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group in 2021.

The tools and code used to produce this subset:
- Preprocessing and postprocessing: https://github.com/aalto-speech/fi-parliament-tools
- Decoding and segmentation: Kaldi, https://github.com/kaldi-asr/kaldi

### Data
This subset contains samples of speech (.wav) and their corresponding transcripts (.trn) from sessions
between 1/2015 and 104/2020. Few sessions that had broken or empty session transcript are left out,
so the session range has some gaps. Samples are grouped by session. Each filename is formed from the
following components:

> Filename (Kaldi-compatible utterance id): <mpid>-<session_number>-<session_year>-<startsec>-<endsec>
> e.g.: 00259-001-2015-00186868-00187044

Further details:

| Component | Definition |
|:----------------:|:----------------------------------------------------------------------------------------------------------------------------:|
| <mpid> | The unique Member of Parliament identifier given to the MPs in the parliament's public databases. |
| <session_number> | A running number given to the plenary session which together with the working year uniquely identifies the session. |
| <session_year> | The parliamentary working year of the session. In election years, the working year differs from the calendar year. |
| <startsec> | The start timestamp of the segment in the full plenary session audio. Format is seconds + two decimals, 00186868 = 1868.68 s |
| <endsec> | Like start timestamp, this marks the end timestamp of the segment in the original audio. |

This subset is machine-extracted so there remains some inaccuracies in the samples. The audio quality
also varies.

### Statistics
In total, there are:
- 984 676 sample pairs
- 1 780 hours of speech
- 11 234 724 word tokens

### Note about MPIDs
There is one speaker in this subset that is not an MP, Risto Hiekkataipale (MPID: 00002). His MPID
is arbitrary. The 2008-2016 set and dev-eval set use different speaker IDs. A mapping is provided in
`speaker-id-mapping.csv`

---
## 2008-2016 set
This subset is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group in 2017.

Code used to produce this subset:
- https://github.com/aalto-speech/finnish-parliament-scripts

### Data
This subset contains samples of speech (.wav) and their corresponding transcripts (.trn) from sessions
between 71/2008 and 77/2016. A list of samples from sessions held in 2008-2014, that do not overlap
with samples in the 2015-2020 set, is provided in `2008-2014-samples.list`. Samples are grouped by speaker.
Each filepath is formed from the following components:

> Utterance id: <speaker-id>/<speaker-name>_<sample-id>
> e.g.: 0004/aila_paloniemi_00045.wav

Further details:

| Component | Definition |
|:--------------:|:-----------------------------------------------:|
| <speaker-id> | A number identifier for the speaker. |
| <speaker-name> | Speaker's name in "firstname_lastname" order. |
| <sample-id> | A number identifier assigned to each sample. |

This subset is machine-extracted so there remains some inaccuracies in the samples. The audio quality
also varies. A mapping to the 2015-2020 set MP IDs is provided in `speaker-id-mapping.csv`.

### Splits
The paper "Automatic Construction of the Finnish Parliament Speech Corpus" by Mansikkaniemi et al.
(see citation) uses training splits which are defined in the following files:
- `parl-all.train.list`
- `parl-400.train.list`
- `parl-60min.train.list`
- `parl-30min.train.list`

### Statistics
In total, there are:
- 522 543 sample pairs
- 1560 hours of speech
- 9 743 296 word tokens (in .trn files)

In the 2008-2014 subset, there are:
- 437 642 sample pairs
- 1 350 hours of speech
- 8 122 107 word tokens (in .trn files)

### Text data
This subset comes with a 20 million word token in-domain text corpus in the file `parl-transcripts.train`.
The text corpus is extracted from the 2008-2016 session transcripts.

---
## Development and evaluation sets
This subset contains the dev and eval sets for Finnish Parliament ASR corpus. Both dev and eval sets
have been cleaned and corrected by hand.

Code used to produce this subset:
- https://github.com/aalto-speech/finnish-parliament-scripts

### Data
This subset contains samples of speech (.wav) and their corresponding transcripts (.trn) from the
same sessions as the 2008-2016 subset. The samples are split to seen and unseen speakers. Read
more about the seen/unseen split in the paper "Automatic Construction of the Finnish Parliament
Speech Corpus" by Mansikkaniemi et al. (see citation below). Each filename is formed from the
following components:

> Utterance id: <speaker-name>_<sample-id>
> e.g.: anne_mari_virolainen_04297.wav

Further details:

| Component | Definition |
|:--------------:|:-----------------------------------------------:|
| <speaker-name> | Speaker's name in "firstname_lastname" order. |
| <sample-id> | A number identifier assigned to each sample. |

A mapping that connects `<speaker-name>` to the speaker IDs used in training sets 2008-2016 and
2015-2020 is provided in `dev-eval-speakers.csv`.

### Splits
The paper "Automatic Construction of the Finnish Parliament Speech Corpus" uses seen and unseen
speaker splits for dev and eval sets. These splits are defined in the files (subset duration,
HH:MM:SS, in parentheses):
- `seen_dev.list` (2:36:51)
- `seen_eval.list` (2:53:51)
- `unseen_dev.list` (2:45:27)
- `unseen_eval.list` (2:48:20)

More details in the paper.

---
## Citation
The 2008-2016 set and dev-eval set are detailed in the following publication:

```
@conference{Aaltodoc:http://urn.fi/URN:NBN:fi:aalto-201710157137,
title={Automatic Construction of the Finnish Parliament Speech Corpus},
author={Mansikkaniemi, Andre; Smit, Peter; Kurimo, Mikko},
year={2017-08},
language={en},
pages={3762-3766},
keyword={automatic speech recognition; speech-to-text alignment; DNN acoustic models; parliament speech dat; transcribed speech corpus},
series={Interspeech 2017},
doi={10.21437/Interspeech.2017-1115},
url={http://urn.fi/URN:NBN:fi:aalto-201710157137},
}
```

---
## License
See the `LICENSE.md` file.

---
## Contact
Authors: Anja Virkkunen, André Mansikkaniemi, and Mikko Kurimo of the Aalto Speech Recognition Group
Contact via kielipankki@csc.fi

View resource description in all available languages

HUOM: aineistosta on uudempi versio: http://urn.fi/urn:nbn:fi:lb-2022052002. Vanhempi versio on saatavilla pyynnöstä.

You don’t have the permission to edit this resource.