In our new series, Bleeding Edge, we look at break through technologies within Text Analytics, Machine Learning and Natural Language Processing that are making the transition from academic lab rat to corporate cash cow.
We’ll take a deep dive into the different technical approaches, and review some of the commercial applications, and the companies starting to make headway in the field.
Skim Technologies is a company that built its foundations around summarization technology, so we felt it appropriate to kick off the series with a look at Multi-Doc Summarization.
What is Multi-Doc Summarization (MDS)?
Multi–document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic.
Main approaches to Multi-Document Summarization technology
Existing multi-document summarization (MDS) methods fall into three main categories, which are presented below in increasing order of complexity.
Traditional extractive methods treat a set of related documents as a group of sentences, which are then ranked based on some sort of salience score. The top most salient sentences are selected to be part of the summary. Since the produced summaries are made up of verbatim sentences from the input documents, these methods potentially yield some overlapping information (redundancy). Typically, two types of unsupervised models are used for sentence selection: 1) models based on sentence ranking, and 2) models based on sparse reconstructions.
Sentence ranking methods include: (a) centroid-based methods (Lin and Hovy, 2002; Radev et al., 2004), which score sentences based on features such as cluster centroids, sentence position and TF-IDF; (b) graph-based models, such as LexRank, TextRank, and DivRank (Erkan and Radev, 2004; Mihalcea and Tarau, 2005; Mei et al, 2010), which first measure the sentence similarity between sentences and then use a ranking algorithm such as PageRank on the similarity graph to estimate the importance of different sentences; and, (c) topic-based models (Hardy et al., 2002; Harabagiu and Lacatusu, 2005; Wang et al., 2008), which identify topics in documents and rank the sentences according to their topic membership. After identifying the top-ranked sentences, further strategies are applied to ensure the selected sentences do not convey redundant information. A widely used strategy is the so-called Maximum Marginal Relevance (MMR) (Goldstein et al., 1999). MMR relies on a greedy approach to select sentences that takes into account the trade-off between relevance and redundancy. Other strategies proposed in the literature include unified models that perform sentence selection and redundancy control simultaneously (see, for instance, Lin and Bilmes, 2012 and Sipos et al., 2012). However, this is a challenging problem and it is usually hard to get a good balance between relevance and redundancy.
There were also a few attempts at using deep learning for performing sentence ranking for MDS. The most recent one is the MDS system proposed by Yasunaga et al (2017), which employs a Graph Convolutional Network (GCN) on the sentence relation graphs, with sentence embeddings obtained from Recurrent Neural Networks as input node features. Through multiple layer-wise propagation, the GCN generates high-level hidden sentence features for saliency estimation. A greedy heuristic is then applied to extract salient sentences while avoiding redundancy.
Sparse reconstruction methods employ the idea of data reconstruction in the summarization task. Examples of such methods include: (a) DSDR (He et al., 2012), which reconstructs each sentence in the document set by a non-negative linear combination of summary sentences and then minimises the reconstruction error; (b) MDS-Sparse (Liu et al., 2015), which proposes a two-level sparse representation model to reconstruct the sentences in the document set while ensuring diversity among them; (c) SpOpt (Yao et al., 2015), which follows the sparse representation framework while simultaneously doing sentence selection and compression by alternately adjusting reconstruction and compression coefficients in optimization; and (d) DocRebuild (Ma et al, 2016), which reconstructs the documents with summary sentences through a neural document model and selects summary sentences so as to minimise the reconstruction error.
A major drawback of all mentioned methods is the fact that the document set is treated as a set of sentences and all the operations are carried out on the sentence set, thus neglecting the global semantics of the documents. Note, however, that despite the limitations and the associated performance ceiling, most summarization systems adopt the extraction-based approach.
These are a natural extension of the extraction-based methods, which introduce an extra step by applying further compression on the selected sentences. Compression is achieved by removing unimportant or redundant words/phrases from the selected sentences, making them more concise and general (see, for instance, the methods proposed by Knight and Marcu, 2000, Harabagiu and Lacatusu, 2010, and Lin et al, 2015). Still, these methods cannot merge facts from different source sentences, because all the words/phrases in the selected summary sentence are solely from one source document.
Unlike the previous methods, abstraction-based methods can generate new sentences whose fragments come from different source sentences via sentence aggregation and fusion. Existing methods can be summarised into four main categories:
- Sentence fusion based methods (Barzilay and McKeown, 2005; Filippova and Strube, 2008; Banerjee et al., 2015), which first cluster the sentences to compute the salience of topical themes and then generate a new sentence for each cluster by merging the common information units contained on all the sentences in the cluster;
- Information extraction based methods (e.g., Genest and Lapalme, 2011), which rely on extracting information units, such as Information Items or Basic Semantic Units, as components for generating sentences;
- Summary revision based methods (Nenkova, 2008; Siddharthan et al., 2011), which try to improve the quality of the summary by rewriting noun phrases and resolving co-references;
- Pattern-based sentence generation methods (Wang and Cardie, 2013; Pighin et al., 2014; Bing et al., 2015), which generate new sentences based on a set of sentence generation patterns learned from a corpus or designed templates.
- Deep learning methods for headline generation: more recently, a few articles proposed the use of deep learning techniques for abstractive summarization tasks, namely, headline generation or sentence summarization (Rush et al., 2015; Chopra et al., 2016).
The abstractive-based approaches gather information across sentence boundary, and hence have the potential to cover more content in a more concise manner. However, these methods are much harder to implement when compared to extraction-based and compression-based methods, due to their complexity and required expertise.
The future of Multi-Doc Summarization
There have been developments in recent years around more sophisticated methods, such as the abstraction-based ones, which tend to generate summaries that are more faithful to the ones created by humans. However, the methods used are a lot more complex and require a lot more computational load. This will have commercial limitations since processing requirements will push costs up, and slow down output. Restricting the applications for use to organisations that are already processing a lot of data and therefore have economies of scale.
Most MDS systems we looked at were devised to generate a single summary that captures the main information contained on a specific set of documents. This document set usually refers to a given time horizon – static content. However, given the dynamic nature of Web content generation and the fast-changing flow of news streams, new documents related to the same topic may arrive at a later time, providing updated information about an event or topic. Take for example, news updates on a natural disaster, or a criminal investigation. One area that we see potential for commercial development is in Update Summarization, where a user would naturally prefer to read updated information at certain time intervals, instead of relying on a static event summary.
There have only been a handful of companies that are bringing MDS technology to market, most notably Agolo and Fast Foward Lab (acquired by Cloudera) and it’s so early that there’s not much to say at this point in terms of their successes.
However, Agolo, raised $3.5m earlier this year from Microsoft Ventures amongst others, which goes to show there’s a certain level of interest in the field. Although does also demonstrate that its so new that not even the tech giant is developing it for themselves yet.
Agolo claim to be working with most publishers and are also offering their platform to banks for document analysis. However on review of their API it seems they’re only using simple extractive MDS methods, which has its limitations, so it will be interesting to see how the company progresses and whether the technology will move to a more abstractive approach.
Technology Use Cases
Multi-Doc Summarization could open up a whole wealth of applications. Most probably within the processing of large amounts of textual information that needs to be consumed by a human in decision making processes within government or corporate organisations. But it could also be used effectively in consumer apps.
We’re big fans of the Jobs To Be Done (JTBD) framework, so for simplicities sake, I’ve written out some of our ideas on how this technology could be applied commercially, as a Job To Be Done.Government Official JTBD:
As a politician I need to know events that are happening on a given subject at a given moment, so that I can make important decisions, quickly. But, I don’t have enough time to read all of the documents.Investor JTBD:
As an analyst I want to get an overview of a company or person that my organisation is considering investing in, so that I can make a sound and educated decision.Financial Trader
As a trader I want to get an overview of all the risks contained within a document that I’m about to sign, but there are thousands of pages, and I might miss some of the risks contained within.Consumer News Reader JTBD:
As a reader of news, I want a summary of things that have happened around a breaking news event, so that I can be completely informed on the subject without the need of further reading.
Most of these jobs to be done stem from a need of time saving from mass information consumption, which we’ve always found to be the strongest need for summarization whether its multi or single doc.
There are hundreds more use cases I’m very sure, so it would be great to hear from you about your ideas for MDS. Just tweet us @skimit, #bleedingedge #MDS
Big thanks to Marcia Oliveira who helped with the research for this post.