Skim Engine Updates: Address Extraction

Skim Engine Updates: Address Extraction

Updates with our customers in mind


The main advantage of our address model is its ability to accurately detect and extract addresses from any web page.
With this new feature of the Skim Engine we are trying to provide solutions to the absence of address detection model in either open source or commercial form.

Technical approach


Our new address feature was developed by following a two-step approach: first, we developed an address detection model, i.e., a Machine Learning classification model that predicts which blocks in a web page are likely to contain an address; then, we combined it with an address parser in order to extract the components of the address (e.g., road, postcode, city, country).

We framed the problem of detecting address blocks from web pages as a Machine Learning classification problem. Thus, the goal was to predict if a given HTML block contained an address or not. To do so, we devised a holistic set of features to provide the Machine Learning algorithm with the information needed to accurately perform the task. This set included HTML-based, NLP-based, and domain-knowledge inspired features. We also had to account for the natural imbalance of the data (i.e., the proportion of address blocks to non-address blocks in web pages is typically low) and adopt smart strategies to balance out the training dataset. The final address detection model achieved a very good performance on new/unseen webpages (the F1 measure of the address class is above 90% on the test set).

After using the address detection model to predict the blocks that are likely to contain an address, we apply an address parser to the corresponding strings of text to extract the address elements and output a normalised version of the address. The locations associated with each detected address are also geotagged and separately provided in the locations feature of the Skim Engine.

Solve a problem


The Postal Address extraction model recently released to the Skim Engine offers our clients the ability to take an address from a website, PDF, or other unstructured sources (including cells in an Excel sheet). This is a highly valuable feature for many sectors. We originally developed it for the Insurance industry, where premise location and distance to Emergency Services can be used for further modelling risk. But since then have found multiple other use cases in financial services and healthcare. If you have a use for our Address Extraction feature, please get in touch at sales@skimtechnologies.com.

Other recent releases can be found here: Geo Tagging and Entity Extraction

Our mission

Skim’s mission is to empower people to use data more effectively and to demystify artificial intelligence. Rather than holding up the common narrative of machines replacing humans, we see how machines can help humans to have easier lives and better businesses.

Supported by

Contact

London office
27 Finsbury Circus,
London EC2M 5NT

Portugal office
R. de Cândido dos Reis 81,
4050-152 Porto, Portugal

+44 207 129 7497
sales@skimtechnologies.com

skim-logo