LT World

You are here: Home kb Information & Knowledge Technologies Tokenization and Segmentation

Tokenization and Segmentation


The Stakes of multilinguality: Multilingual text tokenization in Natural Language Diagnosis.
E. Giguet.
Proceedings of the 4th Pacific Rim International Conference on Artificial Intelligence Workshop Future issues for Multilingual Text Processing. 1996.

Text Boundary Analysis in Java.
R. Gillam.

What is a word, what is a sentence ? problems of tokenization.
G. Greffenstette, P. Tapanainen.
In third International Conference on Computational Lexicography (Complex'94). 1994. 79-87.

LT TTT - A Flexible Tokenisation Tool.
C. Grover, C. Matheson, A. Mikheev and M. Moens.
Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). 2000.

Xerox Finite-state Tool.
L. Kartunen, T. Gaal and A. Kempe.
Xerox Research Centre Europe. 1997.

Adaptive sentence boundary disambiguation.
D. Palmer, M. Hearst.
Proceedings of the 4th Conference on Applied Natural Language Processing. 1994.

A maximum entropy approach to identifying sentence boundaries.
Reynar and Ratnaparkhi, 97.
Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA. 1997.

Document Centered Approach to Text Normalization.
Mikheev, Andrei.
Proceedings of SIGIR'2000. 2000.

You don't have to think twice if you carefully tokenize.
Klatt, Stefan.
Proceedings of 1st International Joint Conference on Natural Language Processing (IJCNLP-04). Sanya, Hainan Island, China. 2004.



Tokenization is commonly seen as an independent process of linguistic analysis, in which the input stream of characters is segmented into an ordered sequence of word-like units, usually called tokens, which function as input items for subsequent steps of linguistic processing. Tokens may correspond to words, numbers, punctuation marks or even proper names.The recognized tokens are usually classified according to their syntax. Since the notion of tokenization seems to have different meanings to different people, some tokenization tools fulfil additional tasks like for instance sentence boundary detection, handling of end-line hyphenations or conjoined clitics and contractions.


Tokenization; Tokenisation; Segmentation; Segmentation and Tokenization; Segmentation and Tokenisation