The British component of the International Corpus of English (ICE-GB)
syntactic trees, syntactic dependencies, POS
- Speech Corpus
ICE-GB is the British component of the International Corpus of English (ICE).
WHAT IS SPECIAL ABOUT ICE-GB?
ICE-GB is fully grammatically analysed. Like all the ICE corpora, ICE-GB consists of a million words of spoken and written English and adheres to the common corpus design. 200 written and 300 spoken texts make up the million words. Every text is grammatically annotated, permitting complex and detailed searches across the whole corpus. ICE-GB contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. This is the biggest collection of parsed spoken material anywhere with the exception of DCPSE (which only contains spoken material).
The picture below shows ICECUP 3.1 displaying a single tree from the spoken part of the corpus.
ICE-GB has been fully checked. It was checked by linguists at several stages in its completion, using both a traditional ‘post-checking’ strategy and also by cross-sectional error-based searches. We do not believe that the analysis in the corpus is perfect, but it is not systematically imperfect - unlike the best parser output.
ICE-GB comes complete with ICECUP. ICECUP allows you to perform a variety of different queries, including using the parse analysis in the corpus to construct Fuzzy Tree Fragments to search the corpus.
- GNU GPL