Tuesday 16th June, 2015
4:00pm to 4:30pm
Near duplicates are a big cause of concern in data analysis. Near similar records affect our ability to integrate data from multiple sources, perform credit checks, assess best leads by matching internal data with external data, adhere to compliance rules and create a holistic view of systems. This talk will discuss the way we leverage Apache Spark, machine learning and Elastic Search to provide real time fuzzy matching. Our application integrates Spark with Elastic Search to provide the user the ability to query a record to find other records in the system which are same or nearly similar to it. In this talk, I will discuss our creation and use of labeled data to learn similarity models. I will also discuss our integration of Spark and Elastic Search to create indices which are queried at real time to find the best matching records.
Sign in to add slides, notes or videos to this session