Integrated Language Technology
Goals
The main focus in ILT is to improve and generate new models of machine translation (MT) systems capable of high-quality output and facilitating a range of input and output modalities (ILT1); to produce intelligent speech recognition and synthesis engines for multilingual eyes-busy, hand-busy scenarios (ILT2); and to develop novel methods of automatic annotation of monolingual and multilingual data according to well-defined linguistic and localisation criteria, in order to facilitate improved MT technology (ILT3).
Methodology
More and more people believe that state-of-the-art phrase-based approaches to MT have reached a ceiling beyond which significant improvements will not come about unless more linguistic information can be captured by such models. In ILT1, we are enhancing our systems with syntax and semantics at all levels in the MT pipeline. In parallel, we are developing different kinds of MT systems, including hybrid and combined systems, and machine learning-based transfer systems, all with the capability to be tuned to specific domains and different modes of input and output. Similarly, in ILT2, research is addressing the shortcomings of current spoken language technologies in scaling to open domains and other languages by explicitly using fine-grained linguistic knowledge and by automating as much of the data acquisition and structuring. Recognising that there are synergies between the technologies and methodologies used in MT and speech technology which can be utilised for innovations in both areas, ILT is seeking to tightly couple MT engines from ILT1, and speech recognition and synthesis engines from ILT2. In ILT3, research addresses the level of linguistic and localisation metadata required to support MT in the localisation process, as well as focusing on domain and text classification, using a variety of supervised and unsupervised approaches.
Industry Engagement
The industrial partners, especially Symantec, IBM, Microsoft, Traslán, Alchemy and VistaTEC, are all well integrated in research track ILT1. Staff from Microsoft, Traslán, Symantec and IBM have all featured in peer-reviewed publications in this track. Alchemy and VistaTEC feature in the CNGL EYECON project, which centres on the integration between MT and Translation Memory systems, with eye-tracking used as a predictor of the cognitive load involved in post-editing MT output. This initiative has received extra financial support from Alchemy over and above its initial commitments to the CNGL.
The ILT1 MT group is already heavily engaged in commercialisation activities. ILT1 researchers were part of the group (with ILT2) that developed the first CNGL patent in 2009. Contract research is already being carried out, which may lead to Innovation Partnerships, and new CNGL industrial partners. In addition, moves have been taken to create spin-off companies centred on the MaTrEx MT system, including, as part of the PLuTO FP7 project featuring Prof. Way and Dr. Sheridan, a spin-off company to facilitate multilingual patent search.
While limited domain speech recognition and synthesis can provide speech interface for specific applications, much of the research in ILT2 aims to underpin open domain recognition and synthesis and facilitate extendibility of interfaces not only to other languages but also allow for more natural interactions rather than restricted dialogues. This facility is of interest to industry partner SpeechStorm who provide self-service solutions for managing customer interactions. SpeechStorm are providing speech data and are advising on prioritisation of tasks in the expansion to open domains. In the past year, ILT3 researchers have continued to work with the datasets supplied by Symantec and VistaTEC, and have supported mutual scientific interests in industrially relevant problems


