Large scale web mining

A session at Strata 2012

Tuesday 28th February, 2012

9:00am to 12:30pm (PST)

This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

1. Introduction

  • Why web data is valuable
  • Key challenges to web crawling
  • Realistic definitions for success

2. Focused Web Crawling

  • Reducing time & cost by focusing the crawl
  • Approaches to classifying and scoring pages
  • Solutions for scalable web crawling

3. Structured Data Extraction

  • Data mining essentials
  • Structured text extraction
  • Automated vs. manual extraction

4. Analyzing the Data

  • Making it searchable
  • Finding "interesting" text
  • Machine learning with Mahout

5. Barriers to Success

  • Polite crawling versus deep crawling
  • Spam, splog, honeypots and nasty webmasters
  • Ajax, robots.txt and Facebook

6. Examples and Summary

  • Hotel reviews
  • Music pages
  • SEO analysis

About the speaker

This person is speaking at this event.
Ken Krugler

Scale Unlimited

Next session in Ballroom E

1:30pm The Craft of Data Journalism by Simon Rogers and Michael Brunton-Spall

Coverage of this session

Sign in to add slides, notes or videos to this session

Strata 2012

United States United States, Santa Clara

28th February to 1st March 2012

Tell your friends!


Time 9:00am12:30pm PST

Date Tue 28th February 2012


Ballroom E, Santa Clara Convention Center

Short URL


View the schedule



See something wrong?

Report an issue with this session