by John Allspaw
by Paul Querna
What is possible when you can consume compute resources on various hosting providers with nothing more than a python script? This talk will discuss Apache Libcloud, an Apache Incubator project dedicated to building standard interfaces into cloud computing.
Apache Libcloud is a unified interface into many popular cloud providers such as Amazon EC2, GoGrid, and Rackspace. Libcloud was created to address the problem that each cloud hosting provider provides a proprietary, slightly different, implementation of their API for accessing resources. Libcloud brings interoperability, visibility, and more importantly scriptability to many standard system administrator tasks.
In addition to providing APIs to manage servers, Libcloud is working on developing a portable Server Image format, so that identical machines can be created on any cloud provider, without a dependency on configuration management tools.
Building scalable architecture is not rocket science — it's computer science. The tome "Design Patterns" shows us two things: (1) that there are many applicable approaches to solving common programming problems and (2) people misapply them all the time. In this talk, we'll take a whirlwind tour though different patterns for scalable architecture design and focus on evaluating if each is the right tool for the job. Topics include load balancing, networking, caching, operations management, code deployment, storage, service decoupling and data management (the RDBMS vs. noSQL argument).
We're at Surge because we agree that scalability matters. However, words like "scaling" get thrown around sometimes without discussing the fundamental problems that come along with distributed systems. Some of these problems (such as the CAP theorem) are often referred to without understanding the context that makes them important.
Justin will focus on methods for designing and building robust fundamentally-concurrent distributed systems. These approaches have been learned through building customer-facing web applications, data storage and processing systems, and server management tools. We will look at practices that are "common knowledge" but too often forgotten, at old lessons that the software industry at large has somehow missed, and at some general "good practices" and rules that must be thrown away when moving into a distributed and concurrent world.
Web Services are often conflated with Service Oriented Architecture (SOA). This talk is about REST, JSON, HTTP. And not about SOAP, WSDL or even XML. It will describe the long path from a monolithic two-and-a-half-tier LAMP architecture to decoupled webservice oriented approach.
About 70 developers, product-owner and scrum master in 8 Scrum teams had to be coordinated on tasks dealing with daily business, new product features and the "open heart surgery" to transform Germany's biggest website without downtime to this new model.
Nowadays Federated Autonomous Services (FAS) have their own dedicated developer and operations team, their own release cycle. They still serve over 18 Billion dynamic HTTP requests a month. They expose their interface via HTTP and have a standardized way to deal with Access Control, Service Configuration and Event Handling. They have no real-time dependencies to any other service, except for infrastructural ones.
The talk will show how Open Source Software like Nginx, HAProxy, Tornado, memcached or jetty power the backbone of the VZ infrastructure. It will also show how one can reduce complexity and cost by moving away from centralized, expensive HA components (aka NetApp, HDS) to commodity hardware.
The most common mistakes are easy to avoid however many startups continue to fall prey, with the impact including large re-design costs, delays in new feature releases, lower staff productivity and less then ideal ROI. All growing and successful sites need to achieve higher Availability, seamless Scalability and proven Resilience. Know the right MySQL environment to provide a suitable architecture and application design to support these essential needs.
Some details of the presentation would include:
by Robert Treat
We often have clients approach us looking for help in scaling their systems, and all too often their long term vision is a mixed reality based on the approaches read about on popular blogs trying to solve very different problems. Hey, scaling your database can be difficult enough by itself, you don't want to get tripped up by not understanding where you're going. In Database Scalability Patterns we will attempt to distill all of the information/hype/discussions around scaling databases, and break down the common patterns we've seen dealing with scaling databases.
"Buzzwords" we'll cover (and hopefully debuzz) include:
More important than just describing what these things are (although that's a good first step), we'll also discuss along the way different points in the life-cycle of your database when you need to be thinking about the different options in front of you. We'll factor in the types of application that your working on (think OLTP vs OLAP, or Social Networking vs. Corporate Application), the environment you'll be working on (Scaling "in the cloud" is very different from DIY in the datacenter), and we will talk about the types of tools you'll need to accomplish these goals (All replication systems are not the same, and some won't help at all).
Scaling LinkedIn to be the largest professional network in the world. Have you ever wondered what architectures the site like LinkedIn may have used and what insights teams have learned while growing the system from serving just a handful to close to a hundred million of users? Ruslan will share his experiences after facing many complex challenges through the years of hyper growth and will answer questions about LinkedIn architecture and the way LinkedIn engineers approach building innovative products of the future with scale.
by Tom Cook
Facebook is now the #2 global website, responsible for billions of photos, conversations, and interactions between people all around the world running on top of tens of thousands of servers spread across multiple geographically-separated datacenters. When problems arise in the infrastructure behind the scenes it directly impacts the ability of people to connect and share with those they care about around the World.
Facebook's Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.
This talk will go into how Facebook is "run" day-to-day with particular focus on actual tools in use (configuration management systems, monitoring, automation, etc), how we detect anomalies and respond to them, and the processes we use internally for rapidly pushing out changes while still keeping a handle on site stability.
Wikia hosts around a 100 000 wikis using the open source Mediawiki software. In this talk I'll take a tour through the process of taking a legacy source code and turning it into a globally distributed system. Wikia runs across 6 datacenters in US and Europe, with half of them being CDN nodes and half being full datacenters. Traffic is directed to closest node depending on traffic situation. In a case of degradation the system turns into a read-only mode. The multiple level of redundancy and distribution contributed to a 99.995% availability to end users.
Specific issues involve:
There has been a lot of interest in PHP performance lately, spurred by Facebook's HipHop PHP announcement in February. Most people don't know how fast their site is and will make uninformed architecture decisions or spend time optimizing the wrong things based mostly on myths and innuendo. This talk will try to get you started down the path of a systematic approach to benchmarking, profiling and optimizing your entire web site.
by Mike Malone
Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems are simple to operate and capable of managing massive data sets. These characteristics come at a cost though: an impoverished query language that, in practice, can handle little more than exact-match lookups at scale.
This talk will explore the real world technical challenges we faced at SimpleGeo while building a web-scale spatial database on top of Apache Cassandra. Cassandra is a distributed database that falls into the broad category of second-generation systems described above. We chose Cassandra after carefully considering desirable database characteristics based on our prior experiences building large scale web applications. Cassandra offers operational simplicity, decentralized operations, no single points of failure, online load balancing and re-balancing, and linear horizontal scalability.
Unfortunately, Cassandra fell far short of providing the sort of sophisticated spatial queries we needed. We developed a short term solution that was good enough for most use cases, but far from optimal. Long term, our challenge was to bridge the gap without compromising any of the desirable qualities that led us to choose Cassandra in the first place.
The result is a robust general purpose mechanism for overlaying sophisticated data structures on top of distributed hash tables. By overlaying a spatial tree, for example, we're able to durably persist massive amounts of spatial data and service complex nearest-neighbor and multidimensional range queries across billions of rows fast enough for an online consumer facing application. We continue to improve and evolve the system, but we're eager to share what we've learned so far.
The economics of commodity components are undeniable, but they also can suffer from acute reliability problems that introduce new (and often unanticipatable) failure modes. Even in a thoughtful architecture that is putatively designed around unreliable components, these failure modes can have dire consequences, potentially cascading into systemic failure. This talk will dissect some examples of these failures, exploring how the original failing component was able to induce broader failure, how the problem was ultimately understood, and what larger lessons can be drawn from the experience.
Highly scaled distributed web applications are predicated on a functional network, yet organizations rarely have detailed information about the consumption and expense of network resources. This data is essential for effective denial of service detection, intrusion detection, troubleshooting, capacity planning, and traffic engineering, but the time, cost and knowledge required to acquire and analyze the data can be a prohibitive barrier. Most organizations default to reactively analyzing this information after the fact, if at all. The dynamic nature of modern infrastructures can make these challenges even more acute.
This presentation will investigate representative scenarios that would benefit from detailed understanding of network traffic while outlining principles and tools for gathering and evaluating the data.
by Gavin M. Roy
myYearbook.com is one of the top 25 most trafficked websites in the United States, experiencing large scale growth over a very short period of time. Employing technologies such as PHP, PostgreSQL, memcached as well as newer cutting edge technologies, myYearbook.com has been able to achieve operational stability in the face of large volumes of traffic. In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.
What happens when the one part of your infrastructure that should never go down misbehaves? This is a case-study of all the chaos that ensued when the unthinkable happened -- the cache layer went down. CNN.com endures several DDOS attempts everyday and, on that particular day, someone got lucky. We will discuss several key factors that inevitably caused an outage of one of the world's most popular new sites from a relatively minor DOS attack including:
We will also discuss the immediate solutions we used to get the site back on the air, as well as the long term fixes to the underlying issues.
by John Allspaw
You've been working on the wicked new feature for a long time. Design is done, the product people love it, and the code's about as polished as it can be. Launching new public-facing features is different than making small changes to existing functionality. I'll talk about the process we have at Etsy (influenced by Flickr's) for making sure that new awesome thing is *operable* and the right attention has been given to contingency planning, on both the technical and human sides.
by Rod Cope
Hadoop, HBase, and friends are built from the ground up to support Big Data/NoSQL, but that doesn't make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. I've learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic's Hadoop infrastructure. Come to this session to learn about some of the "gotchas" you might run into when deploying Hadoop and HBase in your own private cloud and how to avoid them.
Here are some general areas we'll explore:
This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services. From API to access control, to deployment and configuration, we'll explore the techniques that work, and some that don't with an critical eye toward your next design.
by Neil Gunther
You probably already collect performance data, but data ain't information. Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words:
information = measurement + method
So, measurement alone is only half the story; you need a method to transform your data. In this presentation I will show you a method that I have developed and applied successfully to large-scale web sites and stack applications to quantify the benefits of proposed scaling strategies. To the degree that you don't quantify your scalability, you run the risk of ending up with WTF rather than FTW.
by Tom Daly
Anycast Routing is used on the Internet to provide many services, including NTP and DNS, but very few know that you can locally deliver websites and content over HTTP/TCP/Anycast. There's many factors that go into designing an anycasted network, including:
We'll discuss a real world event on Dyn Inc's network which caused a severe service degradation for one of our nameservers due to uncontrolled anycast route propagation, where global traffic landed in our Tokyo datacenter. (failure)
We'll also depict how live DDoS attacks are contained to their source region based upon anycast routing. (success)
My Opera started around 2002 as a hacked version of phpBB. By 2007, it was slowly heading for disaster, with severely overloaded databases and backends. Our (back then) million of users were just as frustrated as us.
Today, we have 5M+ users and growing, a lot more features, APIs, browser integration services, and the site is stable and fast. This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.
by Joe Williams
The talk will focus on how I (with the help of the entire Cloudant team) built our database service based on CouchDB on top of EC2. Specifically how we use Erlang, Chef, EC2 and other tools to build highly available and performant database clusters. This includes using Chef and Erlang's hot code upgrades to automate cluster-wide upgrades without restarting any services.
"Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded. Sharding is a strategy that should be understood in its context: as one of the many legitimate choices. This session covers a spectrum of strategies for scaling an application. It gives special coverage to topics that typically force sharding, such as write workload, choice of database technology, and choice of deployment platform. You'll learn the pros and cons of various strategies, and how to avoid the pitfalls and capitalize on the upsides.
United States United States, Baltimore
30th September to 1st October 2010