Get Lanyrd on your mobile (iPhone, Android and more) - check it out here

Sessions at PyCon US 2012 about Web Scraping

Your current filters are…

Wednesday 7th March 2012

  • Web scraping: Reliably and efficiently pull data from pages that don't expect it

    by Asheesh Laroia

    Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

    • Basics of parsing
    • The website is the API
    • HTML is a mess, but we can parse it anyway
    • Why regular expressions are a bad idea
    • Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
    • Expect exceptions: How to handle errors
    • Basics of crawling
    • A quick review of HTTP
    • Why cookies are necessary for maintaining a session
    • How servers can track you
    • How to submit forms with mechanize
    • Debugging the web
    • Comparing FireBug and Chrome's DOM Inspector
    • The "Net" tab
    • Using a logging HTTP proxy to record traffic
    • Counter-measures, and how to circumvent them
    • JavaScript
    • Hidden form fields (e.g., Django CSRF)
    • CAPTCHAs
    • IP address limitations
    • How to cover your scraping code with tests
    • Why you should store snapshotted pages
    • Using mock objects to avoid network I/O
    • Using a fake getPage for Twisted
    • Parallelism
    • A quick tour of different models:
    • Twisted
    • gevent
    • celery
    • Handling JavaScript
    • Automating a full web browser with Selenium RC
    • Running JavaScript within Python using python-spidermonkey
    • Conclusion
    • Use your power for good, not evil.
    • Q&A

    At 1:20pm to 4:40pm, Wednesday 7th March

    In H3, Santa Clara Convention Center

    Coverage video

Schedule incomplete?

Add a new session

Filter by Day

Filter by coverage

Filter by Topic

Filter by Venue

Filter by Space