The Penn Treebank (PTB)
POS, skeletal syntactic bracketing, argument-predicate structure
syntactic trees, PTB Bracketing
PTB and then corrected by human annotators
- University of Pennsylvania
- Department of Information Engineering and Computer Science
The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, he produces skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. He also annotates text with part-of-speech tags, and for the Switchboard corpus of telephone conversations,dysfluency annotation.
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.
Other Web links:
The Penn Treebank Project: http://www.cis.upenn.edu/~treebank/
Penn Treebank Online: http://www.ldc.upenn.edu/ldc/online/treebank/
1 Million texts
Member fee:$0 for 1999 members; Non-member Fee:US $3150.00; Reduced-License Fee:US $1575.00; Extra-Copy Fee:US $150.00
- Audio CD