LT World

Sections
Personal tools
Log in

Skip to content. | Skip to navigation

Supporters

provided by

dfki logo

with support by

eu star logofp7 logo

through

meta logo
clarin logo

as well as by

bmbf logo

through

take logo

You are here: Home kb Resources & Tools Language Data The American National Corpus (ANC)

The American National Corpus (ANC)


http://www.americannationalcorpus.org/OANC/index.html#

The ANC has also released an "Open" portion of the full ANC consisting of approximately 15 million words

government, technical, travel guides, technical, fiction, letters, journal

  • English

  • Monolingual

POS

morphosyntactically

All ANC and OANC data include annotations for word and sentence boundaries, part of speech (4 tagsets), and noun and verb chunks

  • Linguistic Data Consortium (LDC)

  • POS-tagged Text Corpus

  • Written

The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.

 

When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.

 

The file organization and encoding conventions for the OANC is the same as in the ANC Second Release. Please consult the Second Release document encoding conventions for a full description.

The OANC data is distributed with the following annotations:

  • Structural markup (sections, chapters, etc.) down to the level of paragraph
  • Sentence boundaries
  • Words (tokens) with part of speech annotations using the Penn tagset
  • Noun chunks
  • Verb chunks

All annotations were originally produced automatically using GATE's ANNIE system. Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here). Note that the validated sentence boundaries are not included in the ANC Second Release.



http://www.americannationalcorpus.org/OANC/index.html

  • Web

  • Linguistic Data Consortium (LDC)

  • GNU GPL