LT World

Sections
Personal tools
Log in

Skip to content. | Skip to navigation

Supporters

provided by

dfki logo

with support by

eu star logofp7 logo

through

meta logo
clarin logo

as well as by

bmbf logo

through

take logo

You are here: Home kb Resources & Tools Language Data TDT2 English Audio Corpus

TDT2 English Audio Corpus


contains a total of 1,036 waveform files

network news

  • English

  • Monolingual

Topic Detection and Tracking Multilingual Corpus

  • Speech Corpus

  • Spoken

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking). For further information on TDT2 please visit our TDT2 Information Pages.

 

Data

The TDT2 Audio Corpus contains a total of 1,036 waveform files. Each file is a complete single-channel recording of 30- or 60-minute broadcast, which has been digitized at a sample rate of 16 KHz using 16-bit samples.

 

 

The four broadcast sources represented in the corpus are as follows:

 

Source  Program                 Format/frequency
----------------------------------------------------
ABC    World News Tonight      "traditional" network news, 30 minutes/day
CNN    Headline News           continuous news summaries, up to 4
30-minute samples/day
PRI    The World               "in-depth" radio news, 60 minutes/weekday
VOA    varied                  60-minute news programs, up to 2/day

--------------------------------------------------------------------------------------------

 

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrence of old or new events (tracking). This CD-ROM ROM release consists of the English and Mandarin text components of the TDT2 corpus. The data were collected daily over a period of six months (January-June of 1998) from the following sources: American Broadcasting Company (ABC); Associated Press; Cable News Network, Inc. (CNN); New York Times; Public Radio International (PRI); Voice of America (VOA); Xinhua News Agency; ZaoBao News. The two subcorpora were released also separatedly, cf. TDT2 English Text corpus Version 2 and TDT2 Mandarin Text Corpus.


Available by membership to the LDC or paying $2500 price


http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S84

  • Linguistic Data Consortium (LDC)