The inaugural Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.
This dataset contains the development set and evaluation sets for two Tasks in the 2023 MERLIon CCS Challenge, a special session at INTERSPEECH 2023 (Theme: 'Inclusive Spoken Language Science and Technology – Breaking Down Barriers').
As videocalls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom videocall dataset. The MERLIon CCS Challenge tackles automatic language identification and language diarization in a subset of audio recordings from the
Talk Together Study, where parents narrated an onscreen wordless picturebook to their child. The main objectives of this inaugural challenge are:
• to benchmark the current and novel language identification and language diarization systems in a code-switching scenario including extremely short utterances;
• to test the robustness of such systems under accented speech;
• to challenge the research community to propose novel solutions in terms of adaptation, training, and novel embedding extraction for this particular set of tasks.
The challenge features language identification (Task 1) and language diarization (Task 2). Two tracks, open and closed, are available. The tracks differ by the data used during system training. More information can be found in the
MERLIon CCS Challenge Evaluation Plan and the MERLIon CCS Challenge
GitHub.
The public release of the Challenge audio data includes minor revisions following the conclusion of the challenge, constituting no more than .0001% of the labeled data.
Due to the nature of the audio and the data release agreement with the participants,
all downloads from this repository will require an agreement to the terms of use. To preview the metadata associated with the datasets contained here, you can access the documentation without downloading any files
here.
This collection contains two versions of the data, a legacy archive (LEGACY_ARCHIVE) containing all original files in their original file structure and a set of download files (DOWNLOAD_FILES), formatted for efficient download. In the section below, please click Tree view to see the file structure.