Sessions at Strata 2012 about Data Mining and Web Crawling on Tuesday 28th February

Your current filters are…

  • Large scale web mining

    by Ken Krugler

    This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

    1. Introduction

    • Why web data is valuable
    • Key challenges to web crawling
    • Realistic definitions for success

    2. Focused Web Crawling

    • Reducing time & cost by focusing the crawl
    • Approaches to classifying and scoring pages
    • Solutions for scalable web crawling

    3. Structured Data Extraction

    • Data mining essentials
    • Structured text extraction
    • Automated vs. manual extraction

    4. Analyzing the Data

    • Making it searchable
    • Finding "interesting" text
    • Machine learning with Mahout

    5. Barriers to Success

    • Polite crawling versus deep crawling
    • Spam, splog, honeypots and nasty webmasters
    • Ajax, robots.txt and Facebook

    6. Examples and Summary

    • Hotel reviews
    • Music pages
    • SEO analysis

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom E, Santa Clara Convention Center

    Coverage slide deck

Schedule incomplete?

Add a new session

Filter by Day

Filter by coverage

Filter by Topic

Filter by Venue

Filter by Space