Multilanguage Text Version 4.0 (TDT2)
TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected daily from nine news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June 1998). Both manually-created reference text and automatically- generated text (ASR and/or machine translation) are provided for all broadcast and all Mandarin data.
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking). For further information on TDT2 please visit our TDT2 Information Pages.
This release consists of the English and Mandarin text components of the TDT2 corpus. The data was collected daily over a period of six months (January-June 1998) from the following sources.
- American Broadcasting Company (ABC)
- Associated Press
- Cable News Network, Inc. (CNN)
- New York Times
- Public Radio International (PRI)
- Voice of America (VOA)
- Xinhua News Agency
- ZaoBao News
- Audio CD
- Linguistic Data Consortium (LDC)
Non-member Fee: US $1000.00, Reduced-License Fee:US $500.00, Extra-Copy Fee: US $150.00
- Charles Wayne
topic detection and tracking