Surge 2010 schedule

Thursday 30th September 2010

  • Keynote

    by John Allspaw

    At 9:00am to 9:30am, Thursday 30th September

  • Keynote

    by Bryan Cantrill

    At 9:30am to 10:00am, Thursday 30th September

  • Libcloud: a unified interface into the cloud

    by Paul Querna

    What is possible when you can consume compute resources on various hosting providers with nothing more than a python script? This talk will discuss Apache Libcloud, an Apache Incubator project dedicated to building standard interfaces into cloud computing.

    Apache Libcloud is a unified interface into many popular cloud providers such as Amazon EC2, GoGrid, and Rackspace. Libcloud was created to address the problem that each cloud hosting provider provides a proprietary, slightly different, implementation of their API for accessing resources. Libcloud brings interoperability, visibility, and more importantly scriptability to many standard system administrator tasks.

    In addition to providing APIs to manage servers, Libcloud is working on developing a portable Server Image format, so that identical machines can be created on any cloud provider, without a dependency on configuration management tools.

    At 10:00am to 11:00am, Thursday 30th September

    Coverage video

  • Scalable Design Patterns

    by Theo Schlossnagle

    Building scalable architecture is not rocket science — it's computer science. The tome "Design Patterns" shows us two things: (1) that there are many applicable approaches to solving common programming problems and (2) people misapply them all the time. In this talk, we'll take a whirlwind tour though different patterns for scalable architecture design and focus on evaluating if each is the right tool for the job. Topics include load balancing, networking, caching, operations management, code deployment, storage, service decoupling and data management (the RDBMS vs. noSQL argument).

    At 10:00am to 11:00am, Thursday 30th September

    Coverage video video

  • Embracing Concurrency at Scale

    by Justin Sheehy

    We're at Surge because we agree that scalability matters. However, words like "scaling" get thrown around sometimes without discussing the fundamental problems that come along with distributed systems. Some of these problems (such as the CAP theorem) are often referred to without understanding the context that makes them important.

    Justin will focus on methods for designing and building robust fundamentally-concurrent distributed systems. These approaches have been learned through building customer-facing web applications, data storage and processing systems, and server management tools. We will look at practices that are "common knowledge" but too often forgotten, at old lessons that the software industry at large has somehow missed, and at some general "good practices" and rules that must be thrown away when moving into a distributed and concurrent world.

    At 11:00am to 12:00pm, Thursday 30th September

    Coverage video video

  • Federated Autonomous Services (FAS), because SOA is sooo Enterprise

    Web Services are often conflated with Service Oriented Architecture (SOA). This talk is about REST, JSON, HTTP. And not about SOAP, WSDL or even XML. It will describe the long path from a monolithic two-and-a-half-tier LAMP architecture to decoupled webservice oriented approach.

    About 70 developers, product-owner and scrum master in 8 Scrum teams had to be coordinated on tasks dealing with daily business, new product features and the "open heart surgery" to transform Germany's biggest website without downtime to this new model.

    Nowadays Federated Autonomous Services (FAS) have their own dedicated developer and operations team, their own release cycle. They still serve over 18 Billion dynamic HTTP requests a month. They expose their interface via HTTP and have a standardized way to deal with Access Control, Service Configuration and Event Handling. They have no real-time dependencies to any other service, except for infrastructural ones.

    The talk will show how Open Source Software like Nginx, HAProxy, Tornado, memcached or jetty power the backbone of the VZ infrastructure. It will also show how one can reduce complexity and cost by moving away from centralized, expensive HA components (aka NetApp, HDS) to commodity hardware.

    At 11:00am to 12:00pm, Thursday 30th September

  • The most common MySQL scalability mistakes, and how to avoid them.

    by Ronald Bradford

    The most common mistakes are easy to avoid however many startups continue to fall prey, with the impact including large re-design costs, delays in new feature releases, lower staff productivity and less then ideal ROI. All growing and successful sites need to achieve higher Availability, seamless Scalability and proven Resilience. Know the right MySQL environment to provide a suitable architecture and application design to support these essential needs.

    Some details of the presentation would include:

    • The different types of accessible data (e.g. R/W, R, none)
    • What limits MySQL availability (e.g software upgrades, blocking statements, locking etc)
    • The three components of scalability - Read Scalability/Write Scalability/Caching
    • Design practices for increasing scalability and not physical resources
    • Disaster is inevitable. Having a tested and functional failover strategy
    • When other products are better (e.g. Static files, Session management via Key/Value store)
    • What a lack of accurate monitoring causes
    • What a lack of breakability testing causes
    • What does "No Downtime" mean to your organization
    • Implementing a successful "failed whale" approach with pre-emptive analysis
    • Identifying when MySQL is not your bottleneck

    At 11:00am to 12:00pm, Thursday 30th September

  • Database Scalability Patterns

    by Robert Treat

    We often have clients approach us looking for help in scaling their systems, and all too often their long term vision is a mixed reality based on the approaches read about on popular blogs trying to solve very different problems. Hey, scaling your database can be difficult enough by itself, you don't want to get tripped up by not understanding where you're going. In Database Scalability Patterns we will attempt to distill all of the information/hype/discussions around scaling databases, and break down the common patterns we've seen dealing with scaling databases.

    "Buzzwords" we'll cover (and hopefully debuzz) include:

    • Vertical Scaling
    • Horizontal Partitioning
    • Horizontal Scaling
    • Read Slaves
    • Multi-Master
    • Monitoring
    • Vertical Partitioning
    • Federated Data Storage

    More important than just describing what these things are (although that's a good first step), we'll also discuss along the way different points in the life-cycle of your database when you need to be thinking about the different options in front of you. We'll factor in the types of application that your working on (think OLTP vs OLAP, or Social Networking vs. Corporate Application), the environment you'll be working on (Scaling "in the cloud" is very different from DIY in the datacenter), and we will talk about the types of tools you'll need to accomplish these goals (All replication systems are not the same, and some won't help at all).

    At 1:30pm to 2:30pm, Thursday 30th September

    Coverage video

  • Going 0 to 60: Scaling LinkedIn

    by Ruslan Belkin

    Scaling LinkedIn to be the largest professional network in the world. Have you ever wondered what architectures the site like LinkedIn may have used and what insights teams have learned while growing the system from serving just a handful to close to a hundred million of users? Ruslan will share his experiences after facing many complex challenges through the years of hyper growth and will answer questions about LinkedIn architecture and the way LinkedIn engineers approach building innovative products of the future with scale.

    At 1:30pm to 2:30pm, Thursday 30th September

    Coverage video

  • A Day in the Life of Facebook Operations

    by Tom Cook

    Facebook is now the #2 global website, responsible for billions of photos, conversations, and interactions between people all around the world running on top of tens of thousands of servers spread across multiple geographically-separated datacenters. When problems arise in the infrastructure behind the scenes it directly impacts the ability of people to connect and share with those they care about around the World.

    Facebook's Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.

    This talk will go into how Facebook is "run" day-to-day with particular focus on actual tools in use (configuration management systems, monitoring, automation, etc), how we detect anomalies and respond to them, and the processes we use internally for rapidly pushing out changes while still keeping a handle on site stability.

    At 2:30pm to 3:30pm, Thursday 30th September

  • Scaling and Loadbalancing Wikia Across The World

    by Artur Bergman

    Wikia hosts around a 100 000 wikis using the open source Mediawiki software. In this talk I'll take a tour through the process of taking a legacy source code and turning it into a globally distributed system. Wikia runs across 6 datacenters in US and Europe, with half of them being CDN nodes and half being full datacenters. Traffic is directed to closest node depending on traffic situation. In a case of degradation the system turns into a read-only mode. The multiple level of redundancy and distribution contributed to a 99.995% availability to end users.

    Specific issues involve:

    • Varnish - caching and loadbalancing
    • Memcache - implementing cache coherency across distributed datacenters
    • Session management -- using Riak to transparently failing over
    • Mysql replication
    • Filesystem
    • Monitoring
    • Small footprint -- high throughput using SSD based machines
    • Mediawiki
    • Dealing with loadspikes like Lost Season finale.

    At 2:30pm to 3:30pm, Thursday 30th September

    Coverage video

  • PHP Performance Checklist

    by Rasmus Lerdorf

    There has been a lot of interest in PHP performance lately, spurred by Facebook's HipHop PHP announcement in February. Most people don't know how fast their site is and will make uninformed architecture decisions or spend time optimizing the wrong things based mostly on myths and innuendo. This talk will try to get you started down the path of a systematic approach to benchmarking, profiling and optimizing your entire web site.

    At 4:00pm to 5:00pm, Thursday 30th September

    Coverage video

  • Working with Dimensional Data in a Distributed Hash Table

    by Mike Malone

    Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems are simple to operate and capable of managing massive data sets. These characteristics come at a cost though: an impoverished query language that, in practice, can handle little more than exact-match lookups at scale.

    This talk will explore the real world technical challenges we faced at SimpleGeo while building a web-scale spatial database on top of Apache Cassandra. Cassandra is a distributed database that falls into the broad category of second-generation systems described above. We chose Cassandra after carefully considering desirable database characteristics based on our prior experiences building large scale web applications. Cassandra offers operational simplicity, decentralized operations, no single points of failure, online load balancing and re-balancing, and linear horizontal scalability.

    Unfortunately, Cassandra fell far short of providing the sort of sophisticated spatial queries we needed. We developed a short term solution that was good enough for most use cases, but far from optimal. Long term, our challenge was to bridge the gap without compromising any of the desirable qualities that led us to choose Cassandra in the first place.

    The result is a robust general purpose mechanism for overlaying sophisticated data structures on top of distributed hash tables. By overlaying a spatial tree, for example, we're able to durably persist massive amounts of spatial data and service complex nearest-neighbor and multidimensional range queries across billions of rows fast enough for an online consumer facing application. We continue to improve and evolve the system, but we're eager to share what we've learned so far.

    At 4:00pm to 5:00pm, Thursday 30th September

    Coverage video

Friday 1st October 2010

  • Enterprise solutions from commodity components: The Promise and the Peril

    by Bryan Cantrill

    The economics of commodity components are undeniable, but they also can suffer from acute reliability problems that introduce new (and often unanticipatable) failure modes. Even in a thoughtful architecture that is putatively designed around unreliable components, these failure modes can have dire consequences, potentially cascading into systemic failure. This talk will dissect some examples of these failures, exploring how the original failing component was able to induce broader failure, how the problem was ultimately understood, and what larger lessons can be drawn from the experience.

    At 9:00am to 10:00am, Friday 1st October

  • Go with the flow - Meditations on network infrastructure analysis

    by Benjamin Black

    Highly scaled distributed web applications are predicated on a functional network, yet organizations rarely have detailed information about the consumption and expense of network resources. This data is essential for effective denial of service detection, intrusion detection, troubleshooting, capacity planning, and traffic engineering, but the time, cost and knowledge required to acquire and analyze the data can be a prohibitive barrier. Most organizations default to reactively analyzing this information after the fact, if at all. The dynamic nature of modern infrastructures can make these challenges even more acute.

    This presentation will investigate representative scenarios that would benefit from detailed understanding of network traffic while outlining principles and tools for gathering and evaluating the data.

    At 9:00am to 10:00am, Friday 1st October

    Coverage video

  • Scaling myYearbook.com - Lessons Learned From Rapid Growth

    by Gavin M. Roy

    myYearbook.com is one of the top 25 most trafficked websites in the United States, experiencing large scale growth over a very short period of time. Employing technologies such as PHP, PostgreSQL, memcached as well as newer cutting edge technologies, myYearbook.com has been able to achieve operational stability in the face of large volumes of traffic. In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.

    At 10:00am to 11:00am, Friday 1st October

  • Don't bet the farm on your cache

    What happens when the one part of your infrastructure that should never go down misbehaves? This is a case-study of all the chaos that ensued when the unthinkable happened -- the cache layer went down. CNN.com endures several DDOS attempts everyday and, on that particular day, someone got lucky. We will discuss several key factors that inevitably caused an outage of one of the world's most popular new sites from a relatively minor DOS attack including:

    • Bugs in code
    • Overzealous configurations
    • Lack of real world testing
    • Insufficient monitoring

    We will also discuss the immediate solutions we used to get the site back on the air, as well as the long term fixes to the underlying issues.

    At 11:00am to 12:00pm, Friday 1st October

  • The "Go or No-Go": Operability and Contingency at Etsy

    by John Allspaw

    You've been working on the wicked new feature for a long time. Design is done, the product people love it, and the code's about as polished as it can be. Launching new public-facing features is different than making small changes to existing functionality. I'll talk about the process we have at Etsy (influenced by Flickr's) for making sure that new awesome thing is *operable* and the right attention has been given to contingency planning, on both the technical and human sides.

    At 11:00am to 12:00pm, Friday 1st October

  • Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud

    by Rod Cope

    Hadoop, HBase, and friends are built from the ground up to support Big Data/NoSQL, but that doesn't make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. I've learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic's Hadoop infrastructure. Come to this session to learn about some of the "gotchas" you might run into when deploying Hadoop and HBase in your own private cloud and how to avoid them.

    Here are some general areas we'll explore:

    • Hard-to-find configuration problems and debugging techniques
    • Under-documented yet critical features
    • Deployment recommendations for particular use cases
    • Advice on how to import Big Data
    • Using JRuby/Ruby to make life with Hadoop and HBase easier

    At 11:00am to 12:00pm, Friday 1st October

    Coverage video

  • Design for Scale - Patterns, Anti-Patterns, Successes and Failures

    by Christopher Brown

    This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services. From API to access control, to deployment and configuration, we'll explore the techniques that work, and some that don't with an critical eye toward your next design.

    At 1:30pm to 2:30pm, Friday 1st October

    Coverage video

  • Quantifying Scalability FTW

    by Neil Gunther

    You probably already collect performance data, but data ain't information. Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words:

    information = measurement + method

    So, measurement alone is only half the story; you need a method to transform your data. In this presentation I will show you a method that I have developed and applied successfully to large-scale web sites and stack applications to quantify the benefits of proposed scaling strategies. To the degree that you don't quantify your scalability, you run the risk of ending up with WTF rather than FTW.

    At 1:30pm to 2:30pm, Friday 1st October

    Coverage video

  • Anycast Routing: Local Delivery

    by Tom Daly

    Anycast Routing is used on the Internet to provide many services, including NTP and DNS, but very few know that you can locally deliver websites and content over HTTP/TCP/Anycast. There's many factors that go into designing an anycasted network, including:

    • Site and Carrier Selection — why both are important
    • Routing Protocol Design and BGP Policy
    • Load Balancing in Datacenters without Load Balancers
    • Application Design, State Management, specifics for TCP applications
    • Statistics Collection, Reporting, and Monitoring (Internally and Externally)
    • Distributed Denial of Service Attacks and Anycast Benefits and Risks

    We'll discuss a real world event on Dyn Inc's network which caused a severe service degradation for one of our nameservers due to uncontrolled anycast route propagation, where global traffic landed in our Tokyo datacenter. (failure)

    We'll also depict how live DDoS attacks are contained to their source region based upon anycast routing. (success)

    At 2:30pm to 3:30pm, Friday 1st October

    Coverage video

  • From disaster to stability: scaling challenges of my.opera.com

    by Cosimo Streppone

    My Opera started around 2002 as a hacked version of phpBB. By 2007, it was slowly heading for disaster, with severely overloaded databases and backends. Our (back then) million of users were just as frustrated as us.

    Today, we have 5M+ users and growing, a lot more features, APIs, browser integration services, and the site is stable and fast. This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.

    At 2:30pm to 3:30pm, Friday 1st October

    Coverage video

  • Availability, the Cloud and Everything

    by Joe Williams

    The talk will focus on how I (with the help of the entire Cloudant team) built our database service based on CouchDB on top of EC2. Specifically how we use Erlang, Chef, EC2 and other tools to build highly available and performant database clusters. This includes using Chef and Erlang's hot code upgrades to automate cluster-wide upgrades without restarting any services.

    At 4:00pm to 5:00pm, Friday 1st October

    Coverage video

  • Why Some Architects Almost Never Shard Their Applications

    by Baron Schwartz

    "Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded. Sharding is a strategy that should be understood in its context: as one of the many legitimate choices. This session covers a spectrum of strategies for scaling an application. It gives special coverage to topics that typically force sharding, such as write workload, choice of database technology, and choice of deployment platform. You'll learn the pros and cons of various strategies, and how to avoid the pitfalls and capitalize on the upsides.

    At 4:00pm to 5:00pm, Friday 1st October

    Coverage video

  • Plenary Keynote - A Scalability Call to Action

    by Theo Schlossnagle

    At 5:00pm to 5:30pm, Friday 1st October

    Coverage video

Unscheduled