Sessions at PyCon US 2012 on Wednesday 7th March

Your current filters are…

  • Bayesian statistics made (as) simple (as possible)

    by Allen Downey

    This tutorial is an introduction to Bayesian statistics using Python. My goal is to help participants understand the concepts and solve real problems. We will use material from my book, Think Stats: Probability and Statistics for Programmers (O’Reilly Media).

    Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start.

    I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.

    Students should have at least basic level Python and basic statistics. If you learned about Bayes’s Theorem and probability distributions at some time, that’s enough, even if you don’t remember it! Students should be comfortable with logarithms and plotting data on a log scale.

    Students should bring a laptop with Python 2.x and matplotlib. You can work in any environment; you just need to be able to download a Python program and run it.

    Outline:

    • Bayes’s theorem.
    • Representing probability distributions.
    • Bayesian estimation.
    • Biased coins and student test scores.
    • Censored data.
    • The locomotive / German tank problem.
    • Hierarchical models and the hidden species problem.

    At 9:00am to 12:20pm, Wednesday 7th March

    In D2, Santa Clara Convention Center

    Coverage video

  • Faster Python Programs through Optimization

    by Mike Müller

    This tutorial provides an overview of techniques to improve the performance of Python programs. The focus is on concepts such as profiling, difference of data structures and algorithms as well as a selection of tools and libraries that help to speed up Python.

    Objective

    This tutorial provides an overview of techniques to improve the performance of Python programs. The focus is on concepts such as profiling, diffrence of data structures and algorithms as well as a selection of tools an libraries that help to speed up Python.

    Intended Audience

    Python programmers who would like concepts to improve performance.

    Audience Level

    Programmers with good Python knowledge.

    Prerequisites

    Please bring your laptop with the operating system of your choice (Linux, Mac OS X, Windows). In addition to Python 2.6 or 2.7, we need:

    Method

    This is a hands-on course. Students are strongly encouraged to work along with the trainer at the interactive prompt. There will be exercises the students need to do on their own. Experience shows that this active involvement is essential for an effective learning.

    Outline

    • How fast is fast enough? (10 min)
    • Optimization guidelines (10 min)
    • Premature optimization
    • Optimization rules
    • Seven steps for incremental optimization
    • Optimization strategy (30 min)
    • Measuring in stones
    • Profiling CPU usage
    • Profiling memory usage
    • Algorithms and Anti-patterns (40 min)
    • String concatenation
    • List and generator comprehensions
    • The right data structure
    • Caching
    • The example (5 min)
    • Testing speed (5 min)
    • Pure Python (15 min)
    • Meet Psyco, the JIT (5 min)
    • Using PyPy (15 min)
    • NumPy for numeric arrays (10 min)
    • Using multiple CPUs with multiprocessing (20 min)
    • Combination of optimization strategies (10 min)
    • Results of different example implementations (5 min)

    At 9:00am to 12:20pm, Wednesday 7th March

    In D1, Santa Clara Convention Center

    Coverage video

  • Hands-on Beginning Python

    by Matt Harrison

    We're going to mesh TDD, a desire to learn Python and Brazilian BBQ. Bring your laptop (having Python 2.x installed (will note 3.x differences)). This is hands on! You will program! It is assumed that you know how to program but perhaps not in Python. You start hungry and leave stuffed. We assume you know nothing and will stuff you with enough Python to be dangerous.

    The tutorial works like this: There's a short presentation. A short testcase for you to complete. Rinse/repeat until we run out of time. Hopefully you'll walk away from this tutorial knowing how to write Python programs.

    Course will cover:

    • REPL
    • Types
    • Mutable/Immutable
    • Getting help
    • Lists
    • Dictionaries
    • Functions
    • Whitespace
    • Conditionals & booleans
    • Iteration
    • Slicing
    • I/O
    • Classes
    • Exceptions
    • Packaging and layout

    There are short testcases to allow participants to practice concepts.

    All participants will receive an ebook modeled on the tutorial, slides, a handout and assignments, as well as prizes for completion of assignments.

    At 9:00am to 12:20pm, Wednesday 7th March

    In F1, Santa Clara Convention Center

  • Introduction to Django

    by Chander Ganesan

    The Django framework is a fast, flexible, easy to learn, and easy to use framework for designing and deploying web sites and services using Python. In this session, we'll cover the fundamentals of development with Django, generate a Django data model, and put together a simple web site using the framework.

    • Detailed Tutorial Outline
    • Django Overview and Basic Introduction
    • Downloading & Installing Django
    • Creating a new project
    • Choosing a database
    • Creating a new application
    • Installing & Using Django contrib applications
    • Overview of Django flow (i.e., URLconf expression, view function, HTTPResponse object, etc.)
    • Generating Simple Django Views
    • Configuring a URLConf for basic views
    • Creating Django Templates (template syntax, common filters and tags, loops, etc)
    • Creating & using Template Context objects
    • Introduction to Django Models
    • Defining basic Django models
    • Understanding basic model fields & options
    • Generating & Reviewing Model SQL
    • Adding data to a model
    • Simple data retrieval using models
    • Working with QuerySets (filters, slicing, ordering, common methods)
    • Overview of Q objects)
    • Using the Admin interface
    • Using Generic views
    • Access control with sessions & users

    At 9:00am to 12:20pm, Wednesday 7th March

    In H3, Santa Clara Convention Center

  • Introduction to Event Driven Programming Using Twisted

    by Jean-Paul Calderone

    This tutorial introduces programmers with a basic Python skills to the concepts and techniques of event driven programming. The focus is on understanding an event loop and handling the events related to TCP connections. Twisted is introduced as a re-usable event loop implementation and the abstract concepts of event driven programming are related to specific uses of the Twisted library.

    • What is event driven programming
    • What is it an alternative to
    • What are its advantages
    • How does an event loop work
    • Build one step by step to demonstrate
    • Demonstrate a server which can handle many clients
    • Demonstrate a client which can run in the same event loop
    • Demonstrate timed events in the event loop
    • How are event handlers connected to form a program
    • Callback functions
    • Deferreds
    • Generator tricks - inlineCallbacks
    • Coroutines - stackless, corotwine
    • More

    At 9:00am to 12:20pm, Wednesday 7th March

    In D3, Santa Clara Convention Center

    Coverage slide deck

  • IPython in-depth: high-productivity interactive and parallel python

    by Brian E. Granger, Min Ragan-Kelley and Fernando Pérez

    IPython provides tools for interactive and parallel computing that are widely used in scientific computing, but can benefit any Python developer. We will show how to use IPython in different ways, as: an interactive shell, an embedded shell, a graphical console, a network-aware VM in GUIs, a web-based notebook with code, graphics and rich HTML, and a high-level framework for parallel computing.

    IPython started as a better interactive Python interpreter in 2001, but over the last decade it has grown into a rich and powerful set of interlocking tools aimed at maximizing developer productivity with Python while using the language interactively.

    Today, IPython consists of a kernel that executes the user code and controls the user's namespace, and a collection of tools to control this kernel either in-process or out-of-process thanks to a well-specified communications protocol implemented over ZeroMQ. The kernel can do much more than execute user code, including introspection of objects in the user's namespace, detailed error reporting with rich tracebacks, history logging of inputs and outputs with an SQLite backend, a user-extensible system of commands for interactive control that don't collide with user variables, and much more.
    Our communications architecture allows these same features to be accessed via a variety of clients, each providing unique functionality tuned to a specific use case. We expose a number of directly usable applications:

    An interactive, terminal-based shell with many capabilities far beyond the default Python interactive interpreter (this is the default application opened by the ipython command that most users are familiar with).

    A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes (even segfaults) of the kernel process, and more. A user-based review of some of these features can be found here.

    A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook controls the same kernel as the other two applications, but instead of offering a linear, terminal-like workflow, it presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc. This model is a kind of literate programming environment popular in scientific computing and pioneered by the Mathematica system, that allows for the creation of rich documents that combine computational experimentation and results with other explanatory elements. A detailed review of this system can be found here.

    A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over ZeroMQ, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays). These engines can be controlled interactively while developing and doing exploratory work, or can run in batch mode either on a local machine or in a large cluster/supercomputing environment via a batch scheduler.

    In this hands-on, in-depth tutorial, we will briefly describe IPython's architecture and will then show how to use and configure each of the above components. We will also discuss how to use the underlying IPython libraries in your own application to provide interactive control.

    An outline of the tutorial follows:

    • Introductory description of the project and architecture.
    • IPython basics: the magic command system, shell aliases, full shell access, the history system, variable caching, object introspection tools.
    • Development workflow: combining the interpreter session with python files via the %run command.
    • Effective use of IPython at the command-line for typical development tasks: timing, profiling, debugging.
    • Embedding IPython in terminal applications.
    • The IPython Qt console: unique features beyond the terminal.
    • Embedding an IPython kernel in a GUI app to expose network-based interactive control.
    • Configuring IPython: the profile and configuration system for multiple applications.
    • The IPython notebook: interactive usage of the application, the IPython display protocol, defining custom display methods for your own objects, generating HTML and PDF output.
    • Parallelism with IPython: basic architecture, interactive control of a cluster, standalone execution of applications, integration with MPI, blocking and asynchronous parallelism, execution in batch-controlled environments, IPython engines in the cloud (illustrated with Amazon EC2 instances).

    A short listing of other features not covered in this tutorial, as guidance for users to later learn about on their own.

    For full details about IPython including documentation, previous presentations and videos of talks, please see the project website.

    At 9:00am to 12:20pm, Wednesday 7th March

    In F2, Santa Clara Convention Center

  • SQL for Python Developers

    by Brandon Rhodes

    Relational databases are often the bread-and-butter of large-scale data storage, yet they are often poorly understood by Python programmers. Organizations even split programmers into SQL and front-end teams, each of which jealously guards its turf. These tutorials will take what you already know about Python programming, and advance into a new realm: SQL programming and database design.

    The class will consist of six 25-minute lessons, each of which features a 10-minute lecture, 10 minutes of interesting exercises, and a 5-minute wrap-up in which the instructor recaps the exercises by giving his own answers. The focus will be on keeping things simple so that each building block is grasped clearly. The six lessons will be laid out something like this:

    1. Tables, INSERT, and SELECT.

    • Create a simple sqlite3 table with the DB-API interface provided by the Python Standard Library.
    • Use INSERT to fill the table with data.
    • Concatenate INSERT statements to increase the speed and reduce the number of database round-trips required during a bulk data load.
    • Read back table rows with SELECT.
    • Add dynamic expressions to the rows returned by SELECT.
    • Quote values correct to avoid SQL injection attacks.
    • Avoid “gotchya” differences between Python and SQL data types, with particular attention to Unicode, date-times, and the behavior of NULL verses None.

    2. WHERE and the importance of being indexed.

    • Run quick performance checks that demonstrate that WHERE usually requires the entire table to be read into memory and scanned.
    • Add a simple index to shortcut specific WHERE clauses and return their results more quickly.
    • Check whether an index is being used, and learn several reasons why apparently useful indexes get ignored by the database.
    • Add aggregate indexes that yield performance increases for very specific WHERE clauses.
    • Investigate how our data distrubtion — for example, whether a particular column has thousands of different values, or merely thousands of instances of a handful of values — can impact the wisdom and performance of various query plans.

    3. FOREIGN KEY and JOIN

    • Use a foreign key to relate rows in one table with rows in another.
    • Add JOIN clauses to a SELECT statement to assemble query-result rows that are built from pieces of several tables.
    • Diagnose performance problems with JOIN by observing the cost of full N×M scans that compare every row from one table with every row from another.
    • Think about the indexes that a query plan could take advantage of behind the scenes.
    • Create indexes that let the database take shortcuts when doing common JOINs.

    4. Post-processing.

    • Use ORDER BY to control the rows which are returned first by a given query.
    • Combine OFFSET and LIMIT to return "paged" results suitable for displaying on a limited display, like a web page or GUI window
    • Observe how indexes affect the performance of ORDER BY / LIMIT.
    • Use GROUP BY to support aggregate operations such as sums, averages, maxima, and minima.
    • Filter aggregate results with the HAVING clause.

    The exercises will present small Python scripts that post-process data, and ask students to write the equivalent GROUP BY / HAVING expressions to remove the need for the Python post-processing.

    5. Modifying tables.

    • Write WHERE clauses for UPDATE and DELETE using the same patterns already learned for SELECT.
    • Use transactions in combinations with UPDATE and DELETE to prevent inconsistent database states from becoming visible to other clients.

    6. ORMs, Objects, and Tables.

    • Create tables of objects using the SQLAlchemy declarative schema in combination with classes.
    • Understand the main differences between SQLAlchemy and the Django ORM, including the idea of explicit saves versus a unit-of-work pattern.
    • See how ORM query syntaxes mix down to SQL statements.
    • Determine when an ORM will be helpful, versus when straight SQL might be a better solution for a particular problem.

    Of course, mastery of these topics cannot be conveyed in a single three-hour course! The tutorial will have succeeded if students learn the main moving parts that are involved in a relationally-backed Python application, if they have gotten some practice with SQL and the kind of tasks that it seeks to simplify, and if they have a foundation upon which to build when they are next faced with writing or modifying Python code that interfaces with a SQL database.

    At 9:00am to 12:20pm, Wednesday 7th March

    In H2, Santa Clara Convention Center

    Coverage video

  • Writing a Pyramid application

    by Carlos de la Guardia

    Pyramid is the web framework at the core of the Pylons Project. It's a "pay only for what you eat" framework. You can get started easily and learn new concepts as you go, and only if you need them. It's simple, well tested, well documented, and fast. This course will present Pyramid and lead you through the creation of a an application as the concepts from the framework are introduced.

    Pyramid is the web framework at the core of the Pylons Project. It’s a “pay only for what you eat” framework. You can get started easily and learn new concepts as you go, and only if you need them. It’s simple, well tested, well documented, and fast.

    Though it’s in part inspired by Zope and uses concepts and software that may be familiar to Zope programmers, no prior Zope experience is required to use it. Also, unlike Zope, you don’t need to understand many concepts and technologies fully before you can be truly productive.

    Pyramid is also inspired by Django and Pylons. It tries to learn valuable lessons from things that have gone well with different web frameworks and give the user great flexibility in applying them.

    This course will present Pyramid and lead you through the creation of a an application as the concepts from the framework are introduced. The extensive Pyramid documentation will be used as “text book”.

    Proposed outline:

    • Installation
    • Scaffolds
    • Persistence options
    • URL dispatch
    • Views
    • View configuration
    • Renderers
    • Static views
    • Security
    • Declarative configuration
    • Testing
    • Deployment

    At 9:00am to 12:20pm, Wednesday 7th March

    In H1, Santa Clara Convention Center

    Coverage video

  • Data analysis in Python with pandas

    by Wes McKinney

    The tutorial will give a hands-on introduction to manipulating and analyzing large and small structured data sets in Python using the pandas library. While the focus will be on learning the nuts and bolts of the library's features, I also aim to demonstrate a different way of thinking regarding structuring data in memory for manipulation and analysis.

    The tutorial will teach the mechanics of the most important features of pandas. It will be focused on the nuts and bolts of the two main data structures, Series (1D) and DataFrame (2D), as they relate to a variety of common data handling problems in Python. The tutorial will be supplemented by a collection of scripts and example data sets for the users to run while following along with the material. As such a significant part of the tutorial will be spend doing interactive data exploration and working examples from within the IPython console.

    The tutorial will also teach participants best practices for structuring data in memory and the do's and don'ts of high performance computing with large data sets in Python. For participants who have never used IPython, this will also provide a gentle introduction to interactive scientific computing with IPython.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In D2, Santa Clara Convention Center

    Coverage video

  • Django in Depth

    by James Bennett

    A tutorial that goes beyond all other Django tutorials; we'll dive deep into the guts of the framework, and learn how each commonly-used component -- ORM, templates, HTTP handling, views and the admin -- work from the bottom up, covering both public and internal APIs in excruciating detail.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In D1, Santa Clara Convention Center

  • Graph Analysis from the Ground Up

    by Van Lindberg

    Graphs are a fundamental datatype - but typical developers don't get as much exposure to using and working with graphs as with other datatypes like tables and queues. This is a from-the-ground up working session; by the end, attendees should have the tools and experience to model and analyze problems with graphs.

    This tutorial is intended to bring somebody with Python experience but limited or no experience using graph-based algorithms to a place where they:

    • Understand the basics of graph theory and why it can be helpful;
    • Are familiar with the available tools for dealing with graphs;
    • Recognize how to model a problem in terms of a graph; and
    • Have a first hands-on experience applying the theory and the tools to solve an interesting real-world problem.

    To do this, the tutorial is divided into four sections, each corresponding to one of the objectives above. Each portion will have a hands-on exercise pertaining to the exact subject, with part 4 as a crowning workshop bringing together various skills and points raised throughout the session; after having a few minutes to work on their own code and ask questions, the class as a whole will walk through a solution.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In D3, Santa Clara Convention Center

  • Hands-on Intermediate Python

    by Matt Harrison

    Are you new to Python and want to learn how to step it up to the next level? Have you wondered about functional programming, closures, decorators, context managers, generators or list comprehensions and when you should use them and how to test them? This hands-on tutorial will cover these intermediate subjects in detail, by explaining the theory behind them then walking through examples.

    Tutorial will be in the the form of short lecture, short assignment to practice the concepts.

    Attendees should come with a laptop and Python 2.x (3.x differences will be noted).

    Tutorial will cover:

    • Testing (unittest and doctest)
    • Functional Programming
    • Functions
    • Closures
    • Decorators
    • Class decorators
    • Properties
    • Context Managers
    • List comprehensions
    • Iterator pattern
    • Generators

    Materials include an ebook covering the material, slides, handout and assignment code. Prizes to be awarded for completion of assignment.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In F1, Santa Clara Convention Center

  • How to get the most out of your PyPy

    by Armin Rigo, Alex Gaynor and Maciej Fijalkowski

    For many applications PyPy can provide performance benefits right out of the box. However, little details can push your application to perform much better. In this tutorial we'll give you insights on how to push pypy to it's limites. We'll focus on understanding the performance characteristics of PyPy, and learning the analysis tools in order to maximize your applications performance.

    We aim to teach people how to use performance tools available for PyPy as well as to understand PyPy's performance characteristics. We'll explain how different parts of PyPy interact (JIT, the garbage collector, the virtual machine runtime) and how to measure which part is eating your time. We'll give you a tour with jitviewer which is a crucial tool for understanding how your Python got compiled to assembler and whether it's performing well. We also plan to show potential pitfalls and usage patterns in the Python language that perform better or worse in the case of PyPy.

    We'll also briefly mention how to get your application running on PyPy and how to avoid common pitfalls there, like reference counting or relying on C modules.

    This tutorial is intended for people familiar with Python who have performance problems, no previous experience with PyPy is needed. We ask people to come with their own problems and we'll provide some example ones. Attendees should have the latest version of PyPy preinstalled on their laptops.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In H1, Santa Clara Convention Center

    Coverage video

  • MongoDB and Python

    by Rick Copeland and Bernie Hackett

    This intermediate-level class will teach you techniques using the popular NoSQL database MongoDB, its driver PyMongo, and the object-document mapper Ming to write maintainable, high-performance, and scalable applications. We will cover everything you need to become an effective Ming/MongoDB developer from basic PyMongo queries to high-level object-document mapping setups in Ming.

    The class will begin with a brief overview of MongoDB and its Python driver PyMongo. We will cover basic operations using PyMongo, including data manipulation, querying, and GridFS. Students will install MongoDB and PyMongo as part of this section.

    We will then describe the design philosophy and setup of Ming, a SQLAlchemy-inspired object-document mapper (ODM) for MongoDB developed at SourceForge.

    Next we will cover the base-level implementation of Ming, including schema design, the session and datastore, lazy migrations, data polymorphism, and GridFS support. We will also cover effective MongoDB index design, querying, and updating techniques, and how to use these with Ming. Students will install Ming as a part of this section, and have exercises covering schema design, lazy migrations, and GridFS.

    The final segment will cover the object-document mapper portion of Ming. We will cover the unit of work design pattern, object relations, ODM-level polymorphism, and how to drop down to the base layer (or even down to pymongo) when you really need to. This section will include exercises in designing your ODM model and effectively using the unit-of-work session.

    This talk targets Python 2.6-2.7 and MongoDB 2.0. Students should have Python 2.6 or 2.7 installed on their machines prior to the class and should be comfortable using virtualenv and pip or easy_install to install packages.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In H2, Santa Clara Convention Center

    Coverage video note

  • The real-time web with co-routines

    by John Anderson

    Learn how to build fast and interactive web applications using a wsgi compliant web framework and co-routines. Utilizing Redis/ZeroMQ, Socket.IO, and GEvent you will learn how to build a responsive and concurrent web app while maintaining good test coverage.

    We will build a collaborative todo list system that will show you how to utilize Python, Redis/ZeroMQ, Socket.io, and GEvent for real-time communication.

    We will focus on the following topics:

    Using a wsgi framework to build a RESTful interface and hooking up socket.io to it.
    Using socketio and redis/zeromq for subscribing to named channels and communicating in real-time over websockets.
    Gevent
    Testing your real-time application
    Deployment and Monitoring
    A rough outline
    GEvent (30 min)

    Intro to GEvent and co-routines, what it is, how it works, the benefits of using it, and how simple it is to wrap your mind around a co-routine vs the callback methodology of threads.

    ZeroMQ / Redis (30 min)

    Intro to pub/sub, the communication model we'll use for our realtime communication and how to utilize ZeroMQ or Redis to achieve this.

    Socket.IO and Backbone.js (30 min)

    Discussion of Polling, Longpolling, Flash sockets, and web sockets and the benefits of each and how Socket.IO lets us not care. I will discuss how to get socket.io client side working with the server side.

    Testing (15 min)

    We will cover how to architect your co-routine based applications to make them easily testable and how to utilize mock to make things easier. We will cover the gotcha's of testing GEvent based code.

    Deployment (15 min)

    The final portion of the tutorial we discuss deployment and how to manage and monitor your WSGI server that is using GEvent patched libraries. This will cover basics on which webservers to use as well as exception handling.

    We leave 30minutes of leeway to further discussion in a certain area that the students are most interested in as well as for questions on things specific that they feel I may not have covered so far.

    I will provide the non-realtime code base for the application we will be working on and we will start from there and slowly add in support for the realtime communication as we discuss the technologies needed.

    Most introductions to the web with gevent just show a basic chat application which leaves a lot of questions about where to go next once you are building a real application and I hope to cover all of those in this tutorial.

    At 1:20pm to 4:40pm, Wednesday 7th March

    In F2, Santa Clara Convention Center

  • Tutorial: MongoDB and Python

    by Bernie Hackett and Rick Copeland

    The class will begin with a brief overview of MongoDB and its Python driver PyMongo. We will cover basic operations using PyMongo, including data manipulation, querying, and GridFS. Students will install MongoDB and PyMongo as part of this section.

    We will then describe the design philosophy and setup of Ming, a SQLAlchemy-inspired object-document mapper (ODM) for MongoDB developed at SourceForge.

    Next we will cover the base-level implementation of Ming, including schema design, the session and datastore, lazy migrations, data polymorphism, and GridFS support. We will also cover effective MongoDB index design, querying, and updating techniques, and how to use these with Ming. Students will install Ming as a part of this section, and have exercises covering schema design, lazy migrations, and GridFS.

    The final segment will cover the object-document mapper portion of Ming. We will cover the unit of work design pattern, object relations, ODM-level polymorphism, and how to drop down to the base layer (or even down to pymongo) when you really need to. This section will include exercises in designing your ODM model and effectively using the unit-of-work session.

    This talk targets Python 2.6-2.7 and MongoDB 2.0. Students should have Python 2.6 or 2.7 installed on their machines prior to the class and should be comfortable using virtualenv and pip or easy_install to install packages.

    At 1:20pm to 4:40pm, Wednesday 7th March

    Coverage handout

  • Web scraping: Reliably and efficiently pull data from pages that don't expect it

    by Asheesh Laroia

    Exciting information is trapped in web pages and behind HTML forms. In this tutorial, you'll learn how to parse those pages and when to apply advanced techniques that make scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and evading common anti-scraping techniques.

    • Basics of parsing
    • The website is the API
    • HTML is a mess, but we can parse it anyway
    • Why regular expressions are a bad idea
    • Extracting information, using XPath, CSS selectors, and the BeautifulSoup API
    • Expect exceptions: How to handle errors
    • Basics of crawling
    • A quick review of HTTP
    • Why cookies are necessary for maintaining a session
    • How servers can track you
    • How to submit forms with mechanize
    • Debugging the web
    • Comparing FireBug and Chrome's DOM Inspector
    • The "Net" tab
    • Using a logging HTTP proxy to record traffic
    • Counter-measures, and how to circumvent them
    • JavaScript
    • Hidden form fields (e.g., Django CSRF)
    • CAPTCHAs
    • IP address limitations
    • How to cover your scraping code with tests
    • Why you should store snapshotted pages
    • Using mock objects to avoid network I/O
    • Using a fake getPage for Twisted
    • Parallelism
    • A quick tour of different models:
    • Twisted
    • gevent
    • celery
    • Handling JavaScript
    • Automating a full web browser with Selenium RC
    • Running JavaScript within Python using python-spidermonkey
    • Conclusion
    • Use your power for good, not evil.
    • Q&A

    At 1:20pm to 4:40pm, Wednesday 7th March

    In H3, Santa Clara Convention Center

    Coverage video