Tokenization and Segmentation — LT World

LT World

Supporters

provided by

dfki logo

with support by

eu star logofp7 logo

through

meta logo
clarin logo

as well as by

bmbf logo

through

take logo

N.B.

This site uses Google Analytics to record statistics about site visits - see Legal Information.

You are here: Home kb Information & Knowledge Technologies Tokenization and Segmentation

Tokenization and Segmentation


The Stakes of multilinguality: Multilingual text tokenization in Natural Language Diagnosis.
E. Giguet.
Proceedings of the 4th Pacific Rim International Conference on Artificial Intelligence Workshop Future issues for Multilingual Text Processing. 1996.

Text Boundary Analysis in Java.
R. Gillam.

What is a word, what is a sentence ? problems of tokenization.
G. Greffenstette, P. Tapanainen.
In third International Conference on Computational Lexicography (Complex'94). 1994. 79-87.

LT TTT - A Flexible Tokenisation Tool.
C. Grover, C. Matheson, A. Mikheev and M. Moens.
Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). 2000.

Xerox Finite-state Tool.
L. Kartunen, T. Gaal and A. Kempe.
Xerox Research Centre Europe. 1997.

Adaptive sentence boundary disambiguation.
D. Palmer, M. Hearst.
Proceedings of the 4th Conference on Applied Natural Language Processing. 1994.

A maximum entropy approach to identifying sentence boundaries.
Reynar and Ratnaparkhi, 97.
Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA. 1997.

Document Centered Approach to Text Normalization.
Mikheev, Andrei.
Proceedings of SIGIR'2000. 2000.

You don't have to think twice if you carefully tokenize.
Klatt, Stefan.
Proceedings of 1st International Joint Conference on Natural Language Processing (IJCNLP-04). Sanya, Hainan Island, China. 2004.



Tokenization is commonly seen as an independent process of linguistic analysis, in which the input stream of characters is segmented into an ordered sequence of word-like units, usually called tokens, which function as input items for subsequent steps of linguistic processing. Tokens may correspond to words, numbers, punctuation marks or even proper names.The recognized tokens are usually classified according to their syntax. Since the notion of tokenization seems to have different meanings to different people, some tokenization tools fulfil additional tasks like for instance sentence boundary detection, handling of end-line hyphenations or conjoined clitics and contractions.


Tokenization; Tokenisation; Segmentation; Segmentation and Tokenization; Segmentation and Tokenisation