Creating Operational Redundancy for Effective Web Data Mining

A session at ConvergeSE 2013

Friday 26th April, 2013

10:00am to 11:00am (EST)

Web data mining is quite frequently unreliable, inefficient, and just frankly turns into a painful series of hacks and patches to perform the most basic of scraping tasks. The root problem is that the web is full of semantically and structurally incorrect data. What we’re dealing with is a junkyard of data that will lead us down the wrong path as we search for the hidden gems.

To create a truly efficient web data mining architecture, several key factors need to be taken into account:
* Creating data redundancy principles around weighted key content identifiers to ensure consistent data returns.
* Understanding the horrible practices that are standard in the industry and used at the peril of whoever has to touch that code.
* Using content caching at the domain level to improve performance along with specific page modifiers to preserve unique page-specific qualities.

Once those principles are in place, we will have the basis for creating a highly scalable, efficient, and effective web data mining architecture, allowing us to create semantic value from any site with any content.

About the speaker

This person is speaking at this event.
Jonathan LeBlanc


Coverage of this session

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 10:00am11:00am EST

Date Fri 26th April 2013

Short URL


Official session page


View the schedule


Books by speaker

  • Identity and Data Security for Web Development: Best Practices
  • Programming Social Applications

See something wrong?

Report an issue with this session