Syntax for Statistical Machine Translation Workshop
In recent evaluations of machine translation systems, statistical systems based on probabilistic models have outperformed classical approaches based on interpretation, transfer, and generation. Nonetheless, the output of statistical systems often contains obvious grammatical errors. This can be attributed to the fact that the syntactic well-formedness is only influenced by local n-gram language models and simple alignment models. We aim to integrate syntactic structure into statistical models to address this problem. A very convenient and promising approach for this integration is the maximum entropy framework, which allows to integrate many different knowledge sources into an overall model and to train the combination weights discriminatively. This approach will allow us to extend a baseline system easily by adding new feature functions.
The workshop will start with a strong baseline -- the alignment template statistical machine translation system that obtained best results in the 2002 DARPA MT evaluations. During the workshop, we will incrementally add new features representing syntactic knowledge that deal with specific problems of the underlying baseline. We want to investigate a broad range of possible feature functions, from very simple binary features to sophisticated tree-to-tree translation models. Simple feature functions might test if a certain constituent occurs in the source and the target language parse tree. More sophisticated features will be derived from an alignment model where whole sub-trees in source and target can be aligned node by node. We also plan to investigate features based on projection of parse trees from one language onto strings of another, a useful technique when parses are available for only one of the two languages. We will extend previous tree-based alignment models by allowing partial tree alignments when the two syntactic structures are not isomorphic.