Web scraping: Reliably and efficiently pull data from pages that don't expect it

A session at PyCon US 2012

Wednesday 7th March, 2012

1:20pm to 4:40pm (PST)

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

  • Basics of parsing
  • The website is the API
  • HTML is a mess, but we can parse it anyway
  • Why regular expressions are a bad idea
  • Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
  • Expect exceptions: How to handle errors
  • Basics of crawling
  • A quick review of HTTP
  • Why cookies are necessary for maintaining a session
  • How servers can track you
  • How to submit forms with mechanize
  • Debugging the web
  • Comparing FireBug and Chrome's DOM Inspector
  • The "Net" tab
  • Using a logging HTTP proxy to record traffic
  • Counter-measures, and how to circumvent them
  • JavaScript
  • Hidden form fields (e.g., Django CSRF)
  • IP address limitations
  • How to cover your scraping code with tests
  • Why you should store snapshotted pages
  • Using mock objects to avoid network I/O
  • Using a fake getPage for Twisted
  • Parallelism
  • A quick tour of different models:
  • Twisted
  • gevent
  • celery
  • Handling JavaScript
  • Automating a full web browser with Selenium RC
  • Running JavaScript within Python using python-spidermonkey
  • Conclusion
  • Use your power for good, not evil.
  • Q&A

About the speaker

This person is speaking at this event.
Asheesh Laroia

Coverage of this session

Sign in to add slides, notes or videos to this session

Tell your friends!


Time 1:20pm4:40pm PST

Date Wed 7th March 2012

Short URL


Official session page


View the schedule



See something wrong?

Report an issue with this session