Persistent Identifier of this document (to bookmark): urn:nbn:fi:lb-202311261
- Out-of-copyright newspapers, periodicals and eventually everything since 1771.
- Comes with page scans and OCRed text.
- This collection is periodically harvested from NLF's OAI-PMH interface, and is intended to ultimately provide all of NLF's freely available material as it becomes available.
- Metadata comes as METS, text and layout as ALTO, and page scans as JPG 2000 files
- The dataset is currently based on collection 861 at the NLF.
Approximate figures as of 2023:
-
~700K bindings (one binding typically being one issue of a newspaper or periodical, or a book)
-
~4M pages
-
~10 TB
The data set is downloaded based on combined metadata from the National Library of Finland OAI-PMH interface (lists of bindings) and METS files for each binding (number of pages, file formats). If for some reason the METS for a binding is not downloadable at the time of harvesting, it or any pages from the binding will not be present in the data set. Similarly, if any of the pages in either ALTO or image format are not downloadable from digi.kansalliskirjasto.fi, either temporarily or permanently, a binding can lack some files.
The METS files are also not 1-to-1 match to the public data. Currently we attempt to download all image files whose fileGrp
's USE
is either Images
or reference
and ID
is IMGGRP
or ACIMGGRP
. Both available and unavailable ALTO files look the same in the METS, so currently they are downloaded based on the access image page numbers. The scanned page files on the file system can be matched to their METS metadata by matching the number in file name to the SEQ
attribute in the corresponding file
element in METS. If you suspect that some data is missing, please contact the Language Bank of Finland.
At the moment, the harvesting does not automatically produce a list of files that were not available for download, but one will be made available later.
For immediate access on the shared file system, the newest dataset is kept in the directory /scratch/project_2006633/nlf-harvester/zip/
. That directory should be accessible to all users on Puhti.
The recommended way to process the data on Puhti is to use a suitable compute node with a SSD drive and extract parts of it to the SSD drive. On interactive compute nodes (launched with sinteractive
), this is $TMPDIR
, and for batch jobs, this is $LOCAL_SCRATCH
.
For example, the following command will extract binding 1416885
into $TMPDIR
:
unzip /scratch/project_2006633/nlf-harvester/zip/col-861_14.zip "*/1416885/1416885/*" -d $TMPDIR
The following will extract all metadata files:
for filename in /scratch/project_2006633/nlf-harvester/zip/*; do unzip $filename "*METS.xml" -d $TMPDIR; done
The resulting target directories will have segmented paths, like 1/16/163/1631/16318/16318/mets/16318_METS.xml
, and you can extract multiple times with different filters into the same target directory.
For more advanced searching and extracting, consider extracting the metadata files as above, and using them to generate a list of binding IDs of interest. See our example of parsing and matching, and using either unzip
or eg. Python's zipfile library for extracting the files you want.
Warning
The dataset will be updated periodically to keep up with newly digitized bindings and remove bindings that are no longer available for public access. This means that the dataset can change while your computations are in progress, which can lead to your analysis crashing or producing inconsistent results.
Currently the only way to see if that has happened is to check the last edit time of the zip files by running ls -l /scratch/project_2006633/nlf-harvester/zip/
on Puhti: if it is before your job started, your results have not been affected by an update.
Versions of the data set will also be made available as restic
backups on Allas. The retention policy of the previous versions is still open, so don't rely on old versions being available in the long term.
See versioning.md for more information about previous versions of the dataset and how to access them.
Each .zip
file in the distribution contains all the binding IDs with a particular prefix. For example, the file col-861_13.zip
contains all the bindings with IDs beginning 13
, like 130010
and 1329879
while col-861_3.zip
contains bindings with IDs beginning with 3
, like 30038
and 399915
.
Inside the zips, each binding is in its own directory, named with the binding ID number, and found at the end of a directory hierarchy. In the hierarchy, the first directory contains the first digit of the binding IDs contained within it, and each successive directory name has one more digit from the binding IDs contained. So the files for binding ID 123012
are found in directory
1/12/123/1230/12301/123012/123012
. Note that the full binding ID is repeated twice at the end of the hierarchy: this prevents the same path containing both files for an individual binding, and further directory hierarchy for other bindings.
For each binding, we have three kinds of files:
- A METS file containing metadata about the binding, under
mets/
- ALTO files containing the text and layout information as reported by OCR performed by The National Library of Finland, under
alto/
- The image files (most often JPEG2000 files, ending in
.jp2
, but other formats such as TIFF may also be present) containing the scanned pages, underaccess_img/
The files for a four-page binding with ID 123 would be found in the zip with the following directory structure:
.
└── 1/
└── 12/
└── 123/
└── 123/
├── mets/
│ └── 123_METS.xml
├── alto/
│ ├── 00001.xml
│ ├── 00002.xml
│ ├── 00003.xml
│ └── 00004.xml
└── access_img/
├── 00001.jp2
├── 00002.jp2
├── 00003.tif
└── 00004.jp2