Tokenization and Segmentation
- XEROX Research Center Europe (XRCE)
- Andrei Mikheev
- Lauri Karttunen
- Pasi Tapanainen
- Gregory Grefenstette
- Stefan Klatt
- R. Gillam
- SATZ: An Adaptive Sentence Boundary Detector
- Tokenization in OpenNLP
- Adwait Ratnaparkhi's MXTERMINATOR
- LTG software: LT TTT
The Stakes of multilinguality: Multilingual text tokenization in Natural Language Diagnosis.
E. Giguet.
Proceedings of the 4th Pacific Rim International Conference on Artificial Intelligence Workshop Future issues for Multilingual Text Processing. 1996.
Text Boundary Analysis in Java.
R. Gillam.
What is a word, what is a sentence ? problems of tokenization.
G. Greffenstette, P. Tapanainen.
In third International Conference on Computational Lexicography (Complex'94). 1994. 79-87.
LT TTT - A Flexible Tokenisation Tool.
C. Grover, C. Matheson, A. Mikheev and M. Moens.
Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000). 2000.
Xerox Finite-state Tool.
L. Kartunen, T. Gaal and A. Kempe.
Xerox Research Centre Europe. 1997.
Adaptive sentence boundary disambiguation.
D. Palmer, M. Hearst.
Proceedings of the 4th Conference on Applied Natural Language Processing. 1994.
A maximum entropy approach to identifying sentence boundaries.
Reynar and Ratnaparkhi, 97.
Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA. 1997.
Document Centered Approach to Text Normalization.
Mikheev, Andrei.
Proceedings of SIGIR'2000. 2000.
You don't have to think twice if you carefully tokenize.
Klatt, Stefan.
Proceedings of 1st International Joint Conference on Natural Language Processing (IJCNLP-04). Sanya, Hainan Island, China. 2004.