One of the biggest challenges we’re trying to tackle at Skim Technologies is how to extract the most important information from a web page, in order to vastly improve the way people consume information in a mobile first world… This post details some of the work we’ve been doing to automatically summarise lists.
We found from research, that one of the most popular content genres people read on mobile are lists, or as some say “listicles”. We’ve all been sucker to the ‘top 10 of …’, or ‘50 ways to…’, as content producers know that titles with lists have much higher rates of click through. But how do you summarise a list, while keeping structure and staying true to the content? Read on to find out more.
We learned a while ago that state-of-the-art text summarisation algorithms can be successfully applied for compressing article-style web content. However, there is no one-size-fits-all approach to the summarisation of web pages. But for us the sole purpose of our Skim API technology, is to present a digestible snippet of the original page.
While most people are familiar with the function and layout of an article, listicles are less well understood. We call a web page a listicle if it contains a simple list or heading structure which is essential for understanding the text. Some listicles solely serve the function of presenting a list, for example listing the 10 best gadgets of 2017. Others contain itemised galleries or any structured content in a broader sense. Indeed, this page itself would be considered a listicle according to our classification.
Tailoring Summarisation to the Page Type
So what is the best way to summarise listicles as opposed to articles? Articles and listicles are often similar in their structure, but there are a few features that help us to distinguish them. Thus, looking at the structure can help us to decide which summary template and summarisation approach to use.
Figure 1: Article skim vs. listicle skim
The Challenges of Skimming a Listicle
The subheadings of a listicle form an essential part of a listicle summary. Often they form a list that gives a perfect summary of the page content (see Figure 2).
Figure 2: Example taken from https://www.entrepreneur.com/article/238106
However, sometimes they carry too little meaning to appear on their own. Therefore, it is sometimes necessary to present a user with an introductory preamble. This preamble can contain the whole text preceding the list or a summary of it (cf. Figure 3).
Figure 3: Example skim of this post
Other times, subheadings which are essential to understanding the page, are not expressive enough on their own and need to be presented together with some of the content that follows them (cf. Figure 4).
This raises various questions: How shall we summarise text contained in a list item? Should text summarisation be used and if so, how many list items can be feasibly summarised? And, most importantly, how does the machine know when the title or subheadings are not providing enough context for a list-only summary?
Figure 4: Example taken from https://www.entrepreneur.com/article/238249
At Skim Technologies we are constantly trying to improve our content extraction and summarisation models and these are only a very few of the questions we are currently facing. We have learned that there is no one-size-fits-all approach to summarising web pages, but we are looking forward to addressing these challenges to make information consumption and communication better for everyone.
Thank you for reading,