Sauris German ASR dataset

Full Official Name: Sauris German dataset
Submission date: Oct. 20, 2023, 4:32 p.m.

The dataset consists of 46 audio files of Sauran German speech and corresponding texts collected in July 2023. Texts were taken from Insera börtlan (Comitato unitario delle isole linguistiche storiche germaniche in Italia, 2013), a didactic tool for young Saurans (sentences and short tales), and from Schneider (2020), a collection of stories written by various community members. We avoided prosodic problems by choosing not to include nursery rhymes or songs. These texts were divided into short sentences with complete meaning (approximately ten words long), and we asked six speakers to read them. The speakers were all over 50 years old, gender-balanced, and able to read the SG. Readers were given comparable sets of sentences, with 20% juxtapositions. We saved the audio files in mp3 format in mono mode with a sampling rate of 16 kHz. The collected recordings of a total length of ≈ 85 minutes are segmented into 1053 audio files, each corresponding to an utterance of ≈5 seconds. The corresponding texts have a mean of approximately 9 words and 50 characters. The CSV file contains a column for the audio file paths, a column with the transcribed sentences, and a supplementary column with the speaker ID (F stands for female, M for male).

