Tackling information overload: Prototype Monsoon
In the modern newsroom journalists are bombarded with information from an ever-growing number of sources – Twitter, Facebook and other social platforms, alongside more traditional sources of news, including news wires.
In a pressurised news environment, it can be hard to spot new stories or developments in existing stories in the sea of information. And finding “new news” particularly in the stream of updates from the wires is exactly the challenge that BBC News Labs’ prototype, Monsoon, is designed to address.
What did we do?
The Monsoon prototype works by automatically comparing incoming stories from news wires against a data set of over 500,000 BBC news articles in English – with the aim of identifying additional angles or “new news”.
The definition of “new news” for this prototype was broadly consistent with measures of “interestingness” in the data mining literature:
- A story that may be contradictory to the BBC’s existing knowledge (as encapsulated by the training data)
- A story that does not appear at all in the BBC dataset
- A piece of information that updates or provides additional information to an already published story in the BBC output (for example an updated death toll from a natural disaster, the release of names of people involved with an incident that were previously unnamed)
Stories identified as containing “new news” can then be alerted or highlighted to journalists.
How did the SUMMA technologies help?
Monsoon is based on SUMMA technologies and components and is designed around the knowledgebase generation and automated fact checking technologies developed by SUMMA partners Priberam and the University of Sheffield.
The Monsoon prototype picks up stories from a selection of wires feeds coming into BBC News. Monsoon then uses the SUMMA summarisation module, as created by Priberam, to decompose each individual story into shorter “facts”. This is because the full-length wire items are too long to run through the system in its current form. Each “fact” is then presented to the fact checking system for testing.
Given the training data for the Monsoon system comprises of BBC News stories, facts that are classified as “contradictory” are therefore considered to be of interest and are highlighted.
The user interface mockup below shows how we envisage the feed of wires stories to be presented to a journalist. This would automatically update and scroll as new stories come in, with the top story being the most recent.
Initial testing with users indicated that using this tool to aid research would be an important function. Therefore, individual stories from the knowledgebase which have contributed to the classification of the wires story can be interrogated by expanding the appropriate headline.
What did we learn?
We made a number of assumptions in the prototype. The use of an automatically generated summary was found to sometime miss important details. Future iterations should use a specialised fact extraction algorithm to try and reduce this as much as possible.
As journalists have established ways for bringing breaking news to their attention, it is envisaged that in future iterations of Monsoon we could eventually remove any items tagged incoming as breaking/alert as these would be alerted by other means, and instead focus on flagging up stories and updates that are more likely to get missed.
We did notice that the underlying fact checking system struggled with facts that changed over time, sometimes matching incorrectly with facts that were correct but had subsequently changed, or vice versa. Further research is needed into how the knowledgebase and fact checking algorithm may perform better under these circumstances.