LT World

You are here: Home kb Information & Knowledge Technologies Spoken Language Corpora

Spoken Language Corpora

Handbook of Standards and Resources for Spoken Language Systems.
Dafydd Gibbon and Roger Moore and Richard Winski.
Walter de Gruyter. Berlin, Germany. 1997.

Language Resources and Evaluation Conference.


  • Bavarian Archive for Speech Signals (BAS)
  • International Committee for Co-ordination and Standardisation of Speech Databases (COCOSDA)
  • Speech Processing Expertise Center (SPEX)
  • European Language Resources Association (ELRA)
  • Linguistic Data Consortium (LDC)

  • BAS Infrastructures for Technical Speech Processing (BITS)
  • HCRC MapTask Corpus
  • SpeechDAT
  • Annotation Graph Toolkit (AGTK)
  • British National Corpus (BNC)
  • The International Corpus of English (ICE)

Spoken language corpora are collections of recorded spoken language, generally associated with transcriptions of speech and noises, and with annotations at different linguistic levels. Speech corpora can contain read speech, spontaneous speech, dialogues and may be recorded under different conditions with regard to microphones, environment (e.g., laboratory, office, background noise), and transmission channel (e.g., telephone, broadcast). Speech corpora are used for different purposes, including training and evaluation of speech recognisers, phonetic and phonological research, dialect research, dialogue research, and speech synthesis.

Spoken Corpora