LT World

You are here: Home kb Resources & Tools Language Data The Penn Treebank (PTB)

The Penn Treebank (PTB)

POS, skeletal syntactic bracketing, argument-predicate structure


syntactic trees, PTB Bracketing

PTB and then corrected by human annotators

  • University of Pennsylvania
  • Department of Information Engineering and Computer Science

The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, he produces skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. He also annotates text with part-of-speech tags, and for the Switchboard corpus of telephone conversations,dysfluency annotation.


The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file via ftp and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER.


Other Web links:

The Penn Treebank Project:

Penn Treebank Online:

  • English

  • Monolingual

  • Syntax
  • Morphosyntax
  • Morphology

 1 Million texts

Member fee:$0 for 1999 members; Non-member Fee:US $3150.00; Reduced-License Fee:US $1575.00; Extra-Copy Fee:US $150.00

  • Audio CD