LT World

Personal tools
Log in

Skip to content. | Skip to navigation


provided by

dfki logo

with support by

eu star logofp7 logo


meta logo
clarin logo

as well as by

bmbf logo


take logo

You are here: Home kb Resources & Tools Language Data TDT2 Mandarin Audio Corpus

TDT2 Mandarin Audio Corpus

contains a total of 1,036 waveform files

  • Chinese

  • Monolingual

Topic Detection and Tracking Multilingual Corpus

  • Speech Corpus

  • Spoken

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrence of old or new events (tracking). This CD-ROM ROM release consists of the English and Mandarin text components of the TDT2 corpus. The data were collected daily over a period of six months (January-June of 1998) from the following sources: American Broadcasting Company (ABC); Associated Press; Cable News Network, Inc. (CNN); New York Times; Public Radio International (PRI); Voice of America (VOA); Xinhua News Agency; ZaoBao News. The two subcorpora were released also separatedly, cf. TDT2 English Text corpus Version 2 and TDT2 Mandarin Text Corpus.



The TDT2 Audio Corpus contains a total of 1,036 waveform files. Each file is a complete single-channel recording of 30- or 60-minute broadcast, which has been digitized at a sample rate of 16 KHz using 16-bit samples.

Available by membership to the LDC or paying $2500 price

  • Linguistic Data Consortium (LDC)