Sessions at Strata 2012 about Scaling

Your current filters are…

Tuesday 28th February 2012

  • Large scale web mining

    by Ken Krugler

    This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

    1. Introduction

    • Why web data is valuable
    • Key challenges to web crawling
    • Realistic definitions for success

    2. Focused Web Crawling

    • Reducing time & cost by focusing the crawl
    • Approaches to classifying and scoring pages
    • Solutions for scalable web crawling

    3. Structured Data Extraction

    • Data mining essentials
    • Structured text extraction
    • Automated vs. manual extraction

    4. Analyzing the Data

    • Making it searchable
    • Finding "interesting" text
    • Machine learning with Mahout

    5. Barriers to Success

    • Polite crawling versus deep crawling
    • Spam, splog, honeypots and nasty webmasters
    • Ajax, robots.txt and Facebook

    6. Examples and Summary

    • Hotel reviews
    • Music pages
    • SEO analysis

    At 9:00am to 12:30pm, Tuesday 28th February

    In Ballroom E, Santa Clara Convention Center

    Coverage slide deck

Wednesday 29th February 2012

  • Building a Data Strategy: Data Enabling Toys at Leapfrog

    by Larry Murdock

    In 2007 Leapfrog embarked on a the Learning Path project to enable their learning toys to upload play logs to Leapfrog as an aid to parents in understanding what and how their children learn from their toys. For Leapfrog this would bolster their position as the educational toy leader and innovator, create opportunities to understand customers better and provide valuable information about the use of products for product lifecycle planning.

    This talk will present the strategy and business opportunities as they were planned, then discuss the challenges of implementation. In 2007 we looked at map reduce as a solution to potentially large data volumes but settled on Oracle RAC for reporting flexibility and product maturity. We faced demand estimation issues, SLA challenges, metadata and data management issues, data quality issues and then the killer…our data collection from our users was not passive.

    At 1:30pm to 2:10pm, Wednesday 29th February

    In Mission City B4, Santa Clara Convention Center