This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:
1. Introduction
- Why web data is valuable
- Key challenges to web crawling
- Realistic definitions for success
2. Focused Web Crawling
- Reducing time & cost by focusing the crawl
- Approaches to classifying and scoring pages
- Solutions for scalable web crawling
3. Structured Data Extraction
- Data mining essentials
- Structured text extraction
- Automated vs. manual extraction
4. Analyzing the Data
- Making it searchable
- Finding "interesting" text
- Machine learning with Mahout
5. Barriers to Success
- Polite crawling versus deep crawling
- Spam, splog, honeypots and nasty webmasters
- Ajax, robots.txt and Facebook
6. Examples and Summary
- Hotel reviews
- Music pages
- SEO analysis