Announcing End of SUMMA project
It’s done ! The SUMMA project ended officially on 31 January 2019. But it does not mean we stop here. The outcome is very promising, the tools will lead their own lives, and we will continue to inform and support the SUMMA community.
We give you a brief overview of the results and achievements of this very interesting project.
We have advanced research on the 10 SUMMA components and their underlying technologies:
- speech recognition
- metadata extraction from broadcast media
- machine translation
- streaming implementation of storyline clustering and topic detection
- entity tagging and linking
- knowledge base construction
- forecasting and fact checking
- story-level semantic parsing
- story highlight generation and summarisation
Our machine translation research, for instance, pushed the state-of-the-art in both quality and speed. We investigated deeper recurrent models, and document level translation and our models won many language pairs in WMT16 and WMT17 evaluations. In SUMMA we also spearheaded the development of the Marian toolkit for training and deploying neural MT systems, which is now being deployed in the EU, UN and Microsoft. Another example: One of the main novelties of the SUMMA platform is the ability to provide automatic summaries for news articles and groups of related stories (clusters). To this end, we developed eight novel summarisation models and published over ten papers on the subject, for three types of approaches: extractive (sentence selection), compressive (sentence compression) and abstractive (text generation). We advanced the state-of-the-art on several fronts, and developed models which learn to represent text and to select/generate a summary with the most relevant content. Among other approaches, we leveraged neural architectures such as Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNN), and representation formalisms such as Abstract Meaning Representation (AMR). See and check out our blog over the next few weeks for more details on this.
The integrated SUMMA platform is a fact. It was developed as a prototype, but a very solid one, and serves as a basis for full and very powerful monitoring systems. We have tried it extensively, in particular at BBC Monitoring and Deutsche Welle, involving evaluators from a range of sections, including innovation specialists, journalists, editors, monitors from all languages supported by the platform. In addition, it has been demonstrated widely and raised considerable interest from other media companies and broadcaster groups, such as EBU. The SUMMA user days and other occasions gave external users a chance to try out the systems themselves and consider potential use and provide feedback.
We have built a common monitoring platform which is robust and flexible. It caters for different applications, target groups and supporting modules. The platform is Docker-based, making it possible to smoothly add or change components. Flexibility is key and we have focused upon that objective. It currently processes nine languages (English, German, Spanish, Portuguese, Arabic, Russian, Farsi, Latvian and Ukrainian). The range of languages can be expanded by adding dockerised models for ASR and MT. It is highly customisable. By integrating off-the-shelf tools, we can even cover virtually all major languages. This opens up great possibilities.
The current platform offers a fully automated monitoring system, ingesting content via API. After ingestion, it automatically transcribes all audio from video, turning speech into text. It also automatically translates all text (from original text articles or from transcribed speech to text) into English. It uses that to come up with a cross-lingual overview of the content, clustering related items into stories, summarising stories and individual items, adding topical keywords and named entities, adding sentiment analysis. The platform offers entity as well as full-text search and different visualisations, including a list view, tile view and heat map view.
Several use cases have been developed and tested during the project:
Internal monitoring: Deutsche Welle focused on monitoring internal content, i.e., articles and audiovisual material that is published within the organisation. Deutsche Welle as world broadcaster produces and distributes content in 30 languages. Therefore, it is a huge (linguistic and logistic) task to keep track of what has been produced in all these languages. Comparing content topics across languages, determining which items could be of interest to other language departments and what could/should be translated and/or reused in some other way requires a lot of resources and is therefore necessarily limited. The SUMMA internal monitoring platform facilitates this task and allows us to go beyond the basics. For now, it offers a fully automated content analysis system from eight DW languages (all SUMMA languages except Latvian, currently not covered by DW). We get a cross-lingual overview, with added metadata and detailed data (including article or transcript text) at item and story (grouped content) level and in the original language as well as the common target language English.
External monitoring: BBC Monitoring, on the other hand, targeted external sources (data published by other organisations) to be monitored. A special focus was live channels. This requires a highly scalable system that is fast and can process a lot of data coming from a high number of external feeds. A customised UI was built for this use case.
On top of the generic internal and monitoring platforms, the user partners developed additional applications catering for specific use cases. BBC has developed three applications out of the SUMMA platform: Speedboat – external media monitoring, Sandcastle – internal media monitoring, Monsoon – fact checking for external media monitoring.
DW customised applications include SUMMA Data Dashboard, Slack Chat Bot Module, and Social Media Analyser.
Separate blogs will appear in this channel explaining those applications.
Major efforts have been made and are continuing towards implementing the tools at the user partners’ Deutsche Welle and BBC. Other interested users can have access through the open-source version. Further support through SAAS is being considered by part of the consortium. Stay tuned – we will keep you posted through this channel.
We thank the European Commission and H2020 for funding and supporting us and giving us the chance to develop the prototype and arrive at the highly potential SUMMA platform.