Here at Skim.it we are passionate about summarisation. Our particular interest lies in how automatic summarisation can be used to help improve the efficiency of information consumption and sharing, principally so in the workplace.
‘Summary’ is a term everyone should be familiar with, but summarisation as a discipline may well be new to you. So the purpose of this post is to offer up a light introduction to summarisation as a subject area, something perhaps best achieved by providing answers to the following four main questions:
- What exactly is a summary?
- Which types of summaries can be created?
- What are the goals of automatic summarisation?
- How do you know when an automatic summary (machine created) is a good one?
First up then, I’ll explain what a summary is.
What is a summary?
Summaries permeate our everyday lives; news headlines, tables of contents, weather forecast bulletins, music reviews, abstracts of scientific articles, film trailers, and sports highlights are just some of the many forms summaries can take.
As can be deduced from the above examples, ‘summary’ as a concept has subtleties that lend the word more than one definition. However, the most widely accepted definition comes from the American National Standards Institute (ANSI) who define a ‘summary’ as:
an abbreviated, accurate representation of the content of a document
According to the ANSI definition, a summary must obey two requirements:
1. It must be short
A summary should contain much less content than the original input; usually a paragraph in length. It should provide the reader with the overarching topic and key ideas of the source, without expanding on specific details.
2. It must be faithful to the source.
A summary should capture the main structure and substance of the document, without adding new information. This way, by solely reading the summary, the reader gets access to a distilled, easy to consume version of the document that concisely describes its most important ideas.
What types of summaries can be created?
As we touched upon earlier, summaries can come in many different forms. But we can distinguish different types of summaries according to the:
- Input document(s)
- Purpose of use
- Produced output
Summaries can be derived from a single document or from several related documents (multi-document summarisation). These documents can be monolingual (written in a single natural language) or multilingual, and can belong to different genres (e.g. news, scientific articles, books etc).
The size of the input document also has an impact on the summarisation task, and we can further distinguish between summarisation of short documents (1-2 pages long) and long documents. Things get a bit more complicated when the summarisation input is not text but other sorts of media, such as audio or video.
Purpose of use
Summaries also differ according to the purpose of use and intended readership. The summaries could therefore be:
Generic summaries. These are independent of the audience, of the document genre or domain, and of the summarisation goals – meaning they are suitable for a broad readership community.
Topic-focused summaries. These take into account the user’s information needs, usually stated as a query, in order to produce a personalised summary that contains only the information on the topic that is relevant to the user. Therefore, the same input document can give rise to different summaries, depending on the users’ goals.
Finally, summaries can be distinguished in terms of the produced output, which can be an extract or an abstract.
An extractive summary is simply a collection of the most important sentences contained in the source document, created entirely from copied text. This may pose several problems such as the presence of dangling anaphors (I’ll explain those in a bit), redundancies, gaps in the rhetorical structure of the summary and lack of coherence. Back to those dangling anaphors. This is where, for example, a sentence may start by saying ‘and then he carried on this test for 25 years’. Alone, this sentence may not make much sense as we are missing the previous sentence that would put this sentence into context.
Abstractive summaries are concise versions of the source document that convey its main ideas mostly by using different words.
The key difference between an extract and an abstract is that the latter contains material (words, sentences) that is not present in the source document.
Regardless of the type of summary, the process of summarising content is time-consuming, labour intensive and so impractical for large-scale summarisation, such as the one required by the Web. Luckily though, computers can perform this task for us in a faster, much more cost effective way.
What are the goals of automatic summarisation?
The origins of automatic text summarisation can be traced back to 1958 when Hans Peter Luhn published a paper on “The automatic creation of literature abstracts” in the IBM Journal of Research and Development. In this seminal article, Luhn proposed a novel statistical approach for summarising text documents that relies on weighting sentences using frequencies of normalised terms, to then extract the most important ones to form a summary.
Luhn’s pioneering algorithm represented the first step toward the development of summarisation algorithms for automatic text summarisation, a field whose relevance and popularity gained momentum with the advent of the Web and the subsequent growing presence of large amounts of online text.
Automatic text summarisation is a field of computational linguistics whose major goal is to take one or several text documents, automatically extract their gist, and present it to the user in a condensed, digestible form.
The use of computers to produce summaries has enormous potential. It not only frees humans from the effort and resources required to manually produce summaries, but also allows end users to quickly browse through large quantities of content, saving them valuable time too.
However, the automatic creation of high-quality summaries, similar to those produced by humans is, in general, a complex task due to the inability of computers to understand natural language.
In an attempt to simplify the problem, most research has been following a pragmatic approach: the extractive text summarisation method of output.
Extractive text summarisation uses natural language processing and machine learning to analyse statistical and linguistic features of a piece of text to determine the importance of the corresponding sentences. The typical procedure employed by the extractive approach is to assign scores to each sentence and then successively extract the sentences with the highest scores until the desired length of the summary is reached. The sentences chosen are rearranged and presented to the user in the same order in which they appear in the source so as to enhance coherence and legibility.
Despite its simplicity, the cut and paste approach of extractive summarisation poses readability problems – problems I outlined earlier. To overcome these problems a few attempts have been made within abstractive summarisation, where the goal, which again I discussed earlier, is to capture the meaning of the original text and rewrite it using different words, in order to create a concise, cohesive and coherent summary. Due to the inherent complexity of this task, more progress still has to be done in order to make the abstractive approach competitive.
Irrespective of whether we decide to use an extractive or abstractive approach to summarise text documents, it is of utmost importance to evaluate the results of the automatic summarisation system in order to ensure it is doing a good job.
That leads us nicely onto our fourth and final question…
How do you know when an automatic summary (machine created) is a good one?
One of the biggest challenges in automatic text summarisation is the evaluation of the system output. Evaluation is an important and necessary task since it allows the assessment of the performance of a summarisation system, as well as the comparison between results produced by different systems, enabling us to understand if computer-generated summaries are as good as human-generated ones.
The challenges of automatic evaluation of summaries
This is a very difficult task due to a number of reasons:
1. There is no clear notion of what constitutes a good summary. Although there are attempts to define criteria to guide the evaluation, these tend to be subjective and prone to criticism.
2. Human variability. The most common way to evaluate automatic summaries is to compare them with human-made model summaries (also known as reference summaries or gold standard summaries). But when creating extracts, different people (the annotators) do not usually agree on the importance of sentences and, thus, tend to choose different sentences when creating their summary. As a consequence, the inter-annotator agreement tends to be low. Besides, people are also not very consistent when producing summaries. The same person may choose a different set of sentences at different times, which results in low intra-annotator agreement.
3. Semantic equivalence. It is not uncommon to find two or more sentences in a document that express the same meaning using different words, a phenomenon known as paraphrasing. This makes evaluation even more difficult, because it implies that there may be more than one “good summary” for a given text document.
4. Heavy reliance on humans still. Humans are very good at identifying important content and producing well-formed summaries – meaning the evaluation process tends to rely on humans to create gold standard summaries or act as judges of the system’s output. This greatly increases the cost of an evaluation, making it a time-consuming and expensive task.
Despite these challenges, extensive evaluations of automatic summarisation systems have been carried out from as far back as the 1960’s, greatly helping to refine and compare existing systems. So now lets touch upon the evaluation methods that have been used.
Methods for evaluating text summarisation systems
Methods for automatic summarisation evaluation can be broadly classified along two dimensions:
1. how humans interact with the evaluation process (online VS offline evaluation)
2. what is measured (intrinsic VS extrinsic evaluation).
Online evaluation requires the direct involvement of humans in the assessment of the system’s results according to set guidelines.
In contrast, offline evaluation does not require direct human intervention as it usually compares the system’s output with a previously defined set of gold standard summaries, using measures such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation). This makes offline evaluation more attractive than online evaluation, because it is not directly influenced by human subjectivity, it is repeatable and faster, and allows evaluators to quickly notice if a change in the system leads to an improvement or deterioration of its performance.
Intrinsic evaluation methods, as the name suggests, test the summarisation system in of itself, either by asking humans to judge the linguistic quality, coherence and informativeness of the automatic summaries or by comparing them with human-authored model summaries.
Alternatively, extrinsic evaluation methods test the summarisation system based on the usefulness of the machine-generated summaries to help complete a certain task, such as relevance assessment and reading comprehension. This kind of task-oriented evaluation can be of extraordinary practical value to providers and users of summarisation technology, due to their focus on the application (e.g. producing an effective report or presentation based on summaries, finding relevant documents on a given topic from a large collection, correctly answering questions about the source document using summaries only).
If a human is able to perform such tasks using less time and without loss of accuracy, then the system is considered to have good performance. However, carrying out extrinsic evaluation requires careful planning and is resource-intensive, thus not being suitable for monitoring the performance of a summarisation system during development. In such contexts, intrinsic evaluation is usually preferred.
Summarisation is not an easy task for computers, due to the subtleties and complexity of natural languages and their inability to truly understand text as we humans do.
Still, much progress has been made in the field of automatic summarisation in the last 50 years and we are now able to successfully use automatic summarisation technology to help improve our information rich lives.
Experience the benefits of using summarisation technology for yourself by requesting access to Skim.it’s beta testing group.
Thanks for reading!