home kb Information and Knowledge Technologies Language Analysis Tokenization and Segmentation Tokenization and Segmentation
External Links
Google Scholar
provided by
German Research Center for Artificial Intelligence
with support by
as well as by

Tokenization and Segmentation

definition: Tokenization is commonly seen as an independent process of linguistic analysis, in which the input stream of characters is segmented into an ordered sequence of word-like units, usually called tokens, which function as input items for subsequent steps of linguistic processing. Tokens may correspond to words, numbers, punctuation marks or even proper names.The recognized tokens are usually classified according to their syntax. Since the notion of tokenization seems to have different meanings to different people, some tokenization tools fulfil additional tasks like for instance sentence boundary detection, handling of end-line hyphenations or conjoined clitics and contractions.
related organisation(s):
  • XEROX Research Center Europe (XRCE)
related person(s):
  • Andrei Mikheev
  • Lauri Karttunen
  • Pasi Tapanainen
  • Gregory Grefenstette
  • Stefan Klatt
  • R. Gillam
related system(s) / resource(s):
  • SATZ: An Adaptive Sentence Boundary Detector
  • Tokenization in OpenNLP
  • Adwait Ratnaparkhi's MXTERMINATOR
  • LTG software: LT TTT
related publication(s):

The Stakes of multilinguality: Multilingual text tokenization in Natural Language Diagnosis.
E. Giguet.
Proceedings of the 4th Pacific Rim International Conference on Artificial Intelligence Workshop Future issues for Multilingual Text Processing. 1996.

Text Boundary Analysis in Java.
R. Gillam.

What is a word, what is a sentence ? problems of tokenization.
G. Greffenstette, P. Tapanainen.
In third International Conference on Computational Lexicography (Complex'94). 1994. 79-87.

LT TTT - A Flexible Tokenisation Tool.
C. Grover, C. Matheson, A. Mikheev and M. Moens.
Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). 2000.

Xerox Finite-state Tool.
L. Kartunen, T. Gaal and A. Kempe.
Xerox Research Centre Europe. 1997.

Adaptive sentence boundary disambiguation.
D. Palmer, M. Hearst.
Proceedings of the 4th Conference on Applied Natural Language Processing. 1994.

A maximum entropy approach to identifying sentence boundaries.
Reynar and Ratnaparkhi, 97.
Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA. 1997.

Document Centered Approach to Text Normalization.
Mikheev, Andrei.
Proceedings of SIGIR'2000. 2000.

You don't have to think twice if you carefully tokenize.
Klatt, Stefan.
Proceedings of 1st International Joint Conference on Natural Language Processing (IJCNLP-04). Sanya, Hainan Island, China. 2004.