By Skim Team
10 minutes read
OUR INNOVATION DAYS
Use cases that feature how AI technology and Data Science are helping solve environmental and humanitarian causes have grown in popularity because of their increasingly positive impact.
At Skim we believe in a bright future, one that will see the widespread adoption of AI technologies and Data Science for social good.
Skim’s mission is to empower people with data and to demystify Artificial Intelligence. Our vision of the future for AI and Machine Learning is for these technologies to help humans to have easier lives and better businesses.
Our Innovation Days aim to use what we do best for social benefits. It’s an opportunity for us to have one or two days a month playing with new technologies and thinking outside the box.
For this month’s Innovation Day, we dedicated two days to a project with Muscular Dystrophy UK.
MUSCULAR DYSTROPHY UK
Muscular Dystrophy UK (MDUK) is a charity that brings individuals, families, and professionals together to fight muscle-wasting conditions. Around 70,000 children and adults, are currently affected by more than 60 rare progressive muscle-weakening and wasting conditions.
The charity supports families living with muscle-wasting conditions by providing vital information, advice, resources, and support for people with these conditions, their families and the professionals who work with them.
Suzannah, Head of Digital at Muscular Dystrophy UK, and her team needed help with the
The goal was to develop a system to perform automatic tagging of their website content, organised into different sections (i.e. news, blog posts, etc), and classify these into disease categories, or condition types. This classification would help with their content audit and it would make it easier for their audience to find content that is up-to-date and relevant to their conditions.
In this blog, we will take your through our 2 day Innovation Day project for MDUK, including the preparation, the tools we used and the text analysis and Machine Learning solutions adopted to overcome the challenge.
Preparation included receiving the content from MDUK’s website, along with a list of general and specific conditions (also referred to as, disease categories or tags), which we examined. Our data experts came up with ideas on how to approach the challenge, set out alternative plans, outlined the steps to take, as well as the potential obstacles that could occur during the project.
One of the first steps in preparation for the development of the project was to ensure we had access to consistent and good quality data from each web page of MDUK’s website. Because the access to some important data, such as title and main content, was available for only a few URLs, we used our Skim Engine; a powerful tool that extracted the missing content. Another specialty of our Skim Engine is to automatically enrich the datasets with relevant keywords as well as to extract publication dates for the content, which can be further used to filter old/outdated articles from the website.
Once the data was prepared, we were ready for the days ahead.
The day started with a cup of coffee for everyone and a team meeting to go through the
We discussed everyone’s role in the project and the plan of action for our approach:
1st step: Set up the UMLS (Unified Medical Language System) database.
2nd step: Select and implement an ensemble of normalised fuzzy string matching measures.
3rd step: Find MD-related disease terms in the main content of each web page using the Ontology.
4th step: Apply fuzzy string matching to identify the disease category (or tag) the page is referring to.
5th step: Create a system based on the above components that takes as input the main content of a web page and outputs the likeliest disease category or tag.
6th step: Apply the system to all web pages provided by MDUK.
7th step: Perform manual reviewing of the tags assigned by the system and identify any issues.
8th step: Prepare the delivery of the project results.
There were two main pain points; too many possible categories, as well as the data annotation in condition type.
There were a good number of potential disease types to tag a given document with, and some of these involved quite technical descriptions of diseases. Since none of us had medical expertise, let alone knowledge of muscular dystrophy specifically, it wasn’t going to be easy to infer all the ways these diseases and descriptions might be written in the source material. Without good synonym coverage, we would end up missing most cases, with overall poor performance. The UMLS (Unified Medical Language System) is a meta-thesaurus which links a medical concept code (for example, Muscular Dystrophy has the code C0026850) to its entries in multiple formal medical ontologies (like ICD, SNOMED and so on). From there, we could find lists of synonyms (i.e. how medical professionals tend to express each disease type in papers and journals) to help tag the documents with the right condition.
WHY FUZZY STRING MATCHING?
The process of manual tagging can be described as the one in which a human reads a piece of content and identifies the disease category (i.e., the muscular dystrophy condition) that is referred in the text. Our first approach was then to devise an automatic process that relied on techniques able to mimic the steps taken by a human during content tagging. One of the simplest solutions was to use fuzzy string matching, a technique to find strings that approximately (rather than exactly) match a given pattern. In this case, each of the disease categories or conditions represented the pattern we were trying to match.
Two of us started to work on the data annotation. Manually annotating a few web pages by assigning general and specific disease categories (the tags) was important for several reasons:
- It helped us familiarise with the problem at hand.
- It enabled us to properly assess how challenging and time-consuming the task was, especially given the high number of disease categories/conditions (around 80).
- It allowed us to identify obstacles that could hinder our progress with automatic methods.
- It provided us with a ground-truth that could be used to assess the output of the final system.
We developed an Ontology we would use as tags for each document. This would find all MD-related disease terms in each page – but wouldn’t yet determine which of the results was the definitive match to use as the tag. We started with a simple keyword-like approach, searching pages for the tag names provided by MDUK, then generalised these into patterns, based on what we were finding (and missing) in a subset of texts. Finally, we enabled dynamic fuzzy matching around our patterns, with the degree of fuzziness for each match being relative to the types of terms being assessed. After the parameters were fine-tuned, the matches from each page were ready to be evaluated and the appropriate disease type selected as the page tag.
CHOOSING THE TAG
Once the Ontology had provided a manageable selection of candidate words and phrases, we could afford some more computationally intensive calculations to determine which of these were the most probable candidate for the definitive tag. This involved passing each candidate match through an ensemble of well-established distance metrics, to get a well-rounded measure of its similarity to every potential tag. Once combined with a second score representing the specificity of each candidate (more specific tags being preferred over more general ones), we had a system automatically to classify a page into one and only one of the conditions – and so assign each page a single, definitive tag.
The main challenge on Day 1 was that fuzzy string matching against the entire text of the page gave really poor results. We considered instead segmenting the text and testing the similarities between a) a given tag name and b) each word or phrase in the page. But this would need a very large number of comparisons (each of the 80 tag names against every word and phrase on every page). In the end, we decided on a two-phase solution: first, develop the Ontology – using patterns to find the segments that looked most relevant – and second, to apply the much more thorough fuzzy ensemble-based analysis on just those resulting segments to select the most probable tag. Computationally, this was much more manageable and, once implemented, started to give really promising results.
At 6 pm it was time to switch off and go for a nice meal together as a reward for a successful Day 1 of our project.
The day started with a quick standup to re-cap and discuss the tasks for the day. Suzannah came to visit us in the office halfway through the 2nd day, so we had the opportunity to show her the progress we’d made on the project and to explain the likely outcome.
The highlights of the project on day 2 were:
- Implementing ULMS.
- Evaluating the Automatic Disease Tagger.
On the second day, we implemented the additional terms from the UMLS with our Ontology. This turned out to be really useful. Some of the synonyms (like “MD”) were common sense, but others (like “Dystrophic muscle syndrome”) we would never have guessed at. Another benefit of UMLS was we got synonyms in multiple languages.
Did you know “福山型先天性筋ジストロフィー” means “Fukuyama type congenital muscular dystrophy” in Japanese? We didn’t either. Given there were only English documents in our data, this wasn’t of too much use. But a system that can formally disease-code collections of language-diverse webpages seemed like something with plenty of future potential.
EVALUATING THE AUTOMATIC DISEASE TAGGER
Once we finished putting together the pieces that make up our automatic tagger, we used it to tag the pages belonging to a random sample of sections from MDUK’s website (i.e., the grants, blogs, and news sections). We manually reviewed the output of the system and compared it with the ground-truth tags assigned by the team on Day 1. We felt a mix of surprise and excitement in realising how good the automatic tagger was in assigning the right disease categories/conditions to the pages. At the end of the Innovation Day, it was interesting to conclude that simple solutions can work surprisingly well on well-defined contexts (i.e., muscular dystrophy conditions). Even though we had a plan B, which was to use semi-supervised learning to generate a disease tagger, in the end, we didn’t have to resort to more complex solutions since our plan A surpassed our expectations.
The outcome was surprisingly good for a two-day project. At the end of the second day, the web pages from MDUK website were all tagged. We assigned two tags to each web page: the general condition tag (e.g., Spinal Muscular Atrophy) and the specific condition tag (e.g., Duchenne Muscular Dystrophy). The tagging was automatically performed by the system developed during the Innovation Day: a mix of ontology creation with fuzzy string matching. An evaluation of the results revealed that the tags assigned by our system were very accurate, saving many human hours of manual tagging.
We really enjoyed working on this project but, most of all, it was a rewarding experience for all of us. By manually going through the stories and content published on their website for data annotation, we learned about the different names of the conditions, the struggles and the support that society and organisations such as MDUK have to offer in support of people affected by these conditions and to their families.
We would like to thank Suzannah and MDUK for giving us the opportunity to contribute our technology and Data Science skills to this really great cause.
A QUOTE FROM SUZANNAH
It was great to work with Skim on
and classifying the content on the MDUK site and this lays the foundation for our website redesign later in the year. We know that when people are diagnosed with a muscle-wasting condition, our website is the first thing they turn to for support and so we want to provide all the content they need to manage their condition.
We also know that our current website is not providing a good user experience and with a limited MDUK resource, to organise
and classify the content manually was not possible. Using the Skim engine was a brilliant way to automate the organisation of the content that will form the basis of our content audit for the new website.
We were very impressed with what could be achieved in a short space of time by using the Skim engine and we’re now ready to tackle the re-structure of those 8,000 pages!