University of Edinburgh developing attention-based NMT model called Nematus

Tuesday, December 20, 2016

Over the last two years, there has been a leap forward in machine translation research as deep learning methods, or neural networks, have been applied to translation. In September Google published a widely reported paper which claimed that for some types of content, neural machine translation (NMT) models were reaching human levels of performance. This claim is unfortunate because we are still a long way off human levels of performance, and in all likelihood for many types of content we will never reach human levels. However, Google and a number of other organisations such as WIPO and Systrans have already put their NMT models into production and we have seen a noticeable improvement in MT quality, especially for language pairs which were previously considered challenging, such as English into and out of Chinese, Japanese, German and Czech.

At Edinburgh we have been at the forefront of neural machine translation research. This year we participated in the shared task on news translation at the WMT16 Conference in Machine Translation. We won (sometimes tied) first place in 7 out of the 8 language pairs in which we competed. We are developing an attention-based NMT model called Nematus which is being widely adopted in the research community. We have also included innovations related to automatically segmenting words and using sub-word units, using monolingual data in the target language and adding linguistic knowledge as features.

For the last ten years, deep learning has been increasingly adopted by communities who pursue research in artificial intelligence. Better understanding of how to train these models, and huge increases in affordable parallel processing power meant that larger neural networks models have been able to train on more data. These advances have led to significant improvements in state-of-the-art image recognition and speech recognition. After a decade of relatively small improvements on phrase-based models, attention-based NMT models, which were initially proposed in 2015, have quickly equaled and now surpassed their performance on standard benchmark tasks.

The NMT model is made of various building blocks, and the most popular version can be viewed as an encoder-decoder with attention. This means that the source sentence is first encoded as a sequence of vectors, and then the target sentence (output) is produced one word at a time by the decoder. The decoder uses the encoded version of the source, and the words previously generated in the target, to select its output, and can selectively attend to portions of the source representation.

There are a number of reasons that neural models are improving on previous phrase-based and syntax-based machine translation models. Previous statistical machine translation (SMT) models used large amounts of parallel data to extract hand crafted features which were stored in phrase translation tables and reordering tables. Training SMT models involved finding weights for combining these features in a log linear model. NMT models, however, view translation as one complex mathematical function on matrices. NMT models are also trained on parallel/translated data, but they distill this knowledge, not in phrase tables, but in matrices of learned weights.

So why are the NMT models outperforming previous MT approaches? One important reason is that NMT models do not need to make the independence assumptions that SMT made, where each phrase pair is largely translated independently of the previous phrase pair. In NMT as we produce each target word we are able to use information from all source words and all previous target words to guide us. Furthermore, in NMT, words are represented as real numbered vectors which can capture similarity in many different ways (semantic, e.g. dog and cat, and morphological, e.g. dog and dogs), thereby making best use of similar contexts for similar words.

However, there is a downside – current NMT models are not as well anchored in the source sentence as earlier statistical models and occasionally depart from it altogether, producing repetitive nonsense, or – perhaps worse – completely fluent but wrong translations. Despite these problems, NMT offers many advantages, such as the ease in which additional information (e.g. context, images, non-linguistic constraints) can be incorporated, and the elegance of its training pipeline.

In the SUMMA project we are actively investigating how to improve NMT for spoken language translation and for translating social media in the context of the news. We are also working on developing an open source toolkit for faster decoding and training.

Phrase-based MT model

Phrase-based MT model

Neural  MT model

Neural MT model

(c) Copyright SUMMA EU project.