Your current filters are…
The economics of commodity components are undeniable, but they also can suffer from acute reliability problems that introduce new (and often unanticipatable) failure modes. Even in a thoughtful architecture that is putatively designed around unreliable components, these failure modes can have dire consequences, potentially cascading into systemic failure. This talk will dissect some examples of these failures, exploring how the original failing component was able to induce broader failure, how the problem was ultimately understood, and what larger lessons can be drawn from the experience.
Highly scaled distributed web applications are predicated on a functional network, yet organizations rarely have detailed information about the consumption and expense of network resources. This data is essential for effective denial of service detection, intrusion detection, troubleshooting, capacity planning, and traffic engineering, but the time, cost and knowledge required to acquire and analyze the data can be a prohibitive barrier. Most organizations default to reactively analyzing this information after the fact, if at all. The dynamic nature of modern infrastructures can make these challenges even more acute.
This presentation will investigate representative scenarios that would benefit from detailed understanding of network traffic while outlining principles and tools for gathering and evaluating the data.
by Gavin M. Roy
myYearbook.com is one of the top 25 most trafficked websites in the United States, experiencing large scale growth over a very short period of time. Employing technologies such as PHP, PostgreSQL, memcached as well as newer cutting edge technologies, myYearbook.com has been able to achieve operational stability in the face of large volumes of traffic. In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.
What happens when the one part of your infrastructure that should never go down misbehaves? This is a case-study of all the chaos that ensued when the unthinkable happened -- the cache layer went down. CNN.com endures several DDOS attempts everyday and, on that particular day, someone got lucky. We will discuss several key factors that inevitably caused an outage of one of the world's most popular new sites from a relatively minor DOS attack including:
We will also discuss the immediate solutions we used to get the site back on the air, as well as the long term fixes to the underlying issues.
by John Allspaw
You've been working on the wicked new feature for a long time. Design is done, the product people love it, and the code's about as polished as it can be. Launching new public-facing features is different than making small changes to existing functionality. I'll talk about the process we have at Etsy (influenced by Flickr's) for making sure that new awesome thing is *operable* and the right attention has been given to contingency planning, on both the technical and human sides.
by Rod Cope
Hadoop, HBase, and friends are built from the ground up to support Big Data/NoSQL, but that doesn't make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. I've learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic's Hadoop infrastructure. Come to this session to learn about some of the "gotchas" you might run into when deploying Hadoop and HBase in your own private cloud and how to avoid them.
Here are some general areas we'll explore:
This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services. From API to access control, to deployment and configuration, we'll explore the techniques that work, and some that don't with an critical eye toward your next design.
by Neil Gunther
You probably already collect performance data, but data ain't information. Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words:
information = measurement + method
So, measurement alone is only half the story; you need a method to transform your data. In this presentation I will show you a method that I have developed and applied successfully to large-scale web sites and stack applications to quantify the benefits of proposed scaling strategies. To the degree that you don't quantify your scalability, you run the risk of ending up with WTF rather than FTW.
by Tom Daly
Anycast Routing is used on the Internet to provide many services, including NTP and DNS, but very few know that you can locally deliver websites and content over HTTP/TCP/Anycast. There's many factors that go into designing an anycasted network, including:
We'll discuss a real world event on Dyn Inc's network which caused a severe service degradation for one of our nameservers due to uncontrolled anycast route propagation, where global traffic landed in our Tokyo datacenter. (failure)
We'll also depict how live DDoS attacks are contained to their source region based upon anycast routing. (success)
My Opera started around 2002 as a hacked version of phpBB. By 2007, it was slowly heading for disaster, with severely overloaded databases and backends. Our (back then) million of users were just as frustrated as us.
Today, we have 5M+ users and growing, a lot more features, APIs, browser integration services, and the site is stable and fast. This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.
by Joe Williams
The talk will focus on how I (with the help of the entire Cloudant team) built our database service based on CouchDB on top of EC2. Specifically how we use Erlang, Chef, EC2 and other tools to build highly available and performant database clusters. This includes using Chef and Erlang's hot code upgrades to automate cluster-wide upgrades without restarting any services.
"Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded. Sharding is a strategy that should be understood in its context: as one of the many legitimate choices. This session covers a spectrum of strategies for scaling an application. It gives special coverage to topics that typically force sharding, such as write workload, choice of database technology, and choice of deployment platform. You'll learn the pros and cons of various strategies, and how to avoid the pitfalls and capitalize on the upsides.
United States United States, Baltimore
30th September to 1st October 2010