LT World

Personal tools
Log in

Skip to content. | Skip to navigation


provided by

dfki logo

with support by

eu star logofp7 logo


meta logo
clarin logo

as well as by

bmbf logo


take logo

You are here: Home kb Resources & Tools Language Data British National Corpus (BNC)

British National Corpus (BNC)

100 Mio words

  • English

  • Monolingual

human annotation, 124 volunteers

POS, structural properties of texts


  • Oxford University Computing Services
  • Oxford University

  • POS-tagged Text Corpus

  • Spoken
  • Written

The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres).

  • inline XML

  • Data DVD

  • Oxford University Computing Services

BNC XML Edition with a personal licence: single copy: GBP 75 packs of ten: GBP 500 BNC XML Edition with an institutional licence GBP 500 BNC Baby single-user licence: GBP 21 per CD for orders of 10 or more: GBP 7 per CD

  • Creative Commens

  • Oxford University

linguistic research, language identification, language modeling, language teaching