LT World

You are here: Home kb Resources & Tools Language Data NEGRA Corpus

NEGRA Corpus



syntactic trees, syntactic dependencies, POS

  • Saarland University (UdS)

  • Treebank
  • POS-tagged Text Corpus

The NEGRA corpus version 2 consists of 355,096 tokens (20,602 sentences) of German newspaper text, taken from the Frankfurter Rundschau as contained in the CD "Multilingual Corpus 1" of the European Corpus Initiative. It is based on approx. 60,000 tokens that were tagged for part-of speech at the Institut für maschinelle Sprachverarbeitung, Stuttgart. This corpus was extended, tagged with part-of-speech and completely annotated with syntactic structures. The corpus was created in the projects NEGRA (DFG Sonderforschungsbereich 378, Projekt C3) and LINC (Universität des Saarlandes) in Saarbrücken.


The corpus is project internally stored in an SQL database. Externally, we represent the annotations in a line-oriented export format. The corpus contains context-free structures with crossing branches. If required, crossing branches can be converted to traces, and the corpus can be represented in the same format as the Penn Treebank.


The following different types of information are coded in the corpus:

  • Part-of-Speech Tags. We use the Stuttgart-Tübingen-Tagset (STTS)
  • Morphological analysis (only for the first 60,000 tokens). These are the tags of the expanded STTS
  • The grammatical function in the directly dominating phrase. List of the grammatical funktions
  • The category of nonterminal nodes (phrases). List of the phrasal categories

  • German

  • Monolingual

  • Syntax

Release 2