LT World

You are here: Home kb Resources & Tools Language Data The Bank of English (COBUILD Corpus)

The Bank of English (COBUILD Corpus)


  • University of Helsinki
  • The University of Alabama at Birmingham (UAB)

  • POS-tagged Text Corpus
  • Speech Corpus

  • Spoken
  • Written

The Bank of English is an international English language project sponsored by HarperCollins Publishers, Glasgow, and conducted by the COBUILD team at the University of Birmingham, UK. The text bank will comprise some 200 million words of both written and spoken English. The whole 200 million word corpus is being annotated morphologically and syntactically during 1993-94 at the Research Unit for Computational Linguistics (RUCL), University of Helsinki, using the English morphological analyser (ENGTWOL) and English Constraint Grammar (ENGCG) parser. The first half of the texts (103 million words) has already been processed in 1993. The project is lead by Prof. John Sinclair in Birmingham, and Prof. Fred Karlsson in Helsinki. The present author is responsible for conducting the annotation.


Our analysing system, which is presented in detail in [Karlsson, 1994], consists of the following successive stages:

  • preprocessing
  • ENGTWOL lexical analysis
  • ENGCG morphological disambiguation
  • ENGCG syntactic mapping and disambiguation

The main routines performed on the monthly data, including constant monitoring of both incoming texts and analysed output and management (documentation, backups) are closely linked to the updating of the preprocessing module and the ENGTWOL lexicon.

  • English

  • Monolingual

  • Morphology
  • Morphosyntax

200 million words of both written and spoken English

  • Timo J√§rvinen