Data Ingest, Linking and Data Integration via Automatic Code Generation

A session at Strata 2012

Wednesday 29th February, 2012

1:30pm to 2:10pm (PST)

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business.

The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging.

New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage.

A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

SALT is an ECL code generator for use with the open source HPCC platform for data-intensive computing. The input to the SALT tool is a small, user-defined specification stored as a text file which includes declarative statements describing the user input data and process parameters, the output is ECL code which is then compiled into optimized C++ for execution on the HPCC platform.

The SALT tool can be used to generate complete applications ready to-execute for data profiling, data hygiene (also called data cleansing, the process of cleaning data), data source consistency monitoring (checking consistency of data value distributions among multiple sources of input), data file delta changes, data ingest, and record linking and clustering.

SALT record linking and clustering capabilities include internal linking – the batch process of linking records from multiple sources which refer to the same entity to a unique entity identifier; and external linking – also called entity resolution, the batch process of linking information from an external file to a previously linked base or authority file in order to assign entity identifiers to the external data, or an online process where information entered about an entity is resolved to a specific entity identifier, or an online process for searching for records in an authority file which best match entered information about an entity.

SALT Use Case – LexisNexis Risk Solutions Insurance Services used SALT to develop a new insurance header file and insurance ID to combine all the available LexisNexis person data with insurance data. Process combines 1.5 billion insurance records and 9 billion person records. 290 million core clusters are produced by the linking process. Reduced source lines of code from 20,000+ to a 48 line SALT specification. Reduced linking time from 9 days to 55 hours. Precision of 99.9907 was achieved.

Summary and Conclusions – Using SALT in combination with the HPCC high-performance data-intensive computing platform can help organizations solve the complex data integration and processing issues resulting from the Big Data problem, helping organizations improve data quality, increase productivity, and enhance data analysis capabilities, timeliness, and effectiveness.

About the speaker

This person is speaking at this event.
Tony Middleton

HPCC Systems from LexisNexis

Next session in Mission City B1

2:20pm Disambiguation: Embrace wrong answers & find truth by Philip Kromer

Sign in to add slides, notes or videos to this session

Strata 2012

United States United States, Santa Clara

28th February to 1st March 2012

Tell your friends!


Time 1:30pm2:10pm PST

Date Wed 29th February 2012


Mission City B1, Santa Clara Convention Center

Short URL


View the schedule



See something wrong?

Report an issue with this session