Real time fuzzy matching with Spark and Elastic Search

A session at Spark Summit 2015

Tuesday 16th June, 2015

4:00pm to 4:30pm

Near duplicates are a big cause of concern in data analysis. Near similar records affect our ability to integrate data from multiple sources, perform credit checks, assess best leads by matching internal data with external data, adhere to compliance rules and create a holistic view of systems. This talk will discuss the way we leverage Apache Spark, machine learning and Elastic Search to provide real time fuzzy matching. Our application integrates Spark with Elastic Search to provide the user the ability to query a record to find other records in the system which are same or nearly similar to it. In this talk, I will discuss our creation and use of labeled data to learn similarity models. I will also discuss our integration of Spark and Elastic Search to create indices which are queried at real time to find the best matching records.

About the speaker

This person is speaking at this event.
Sonal Goyal

Fuzzy Matching, Deduplication and Entity Resolution at scale bio from LinkedIn

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 4:00pm4:30pm PST

Date Tue 16th June 2015

Short URL


Official event site


View the schedule


See something wrong?

Report an issue with this session