This application serves as a tutorial for the use of Big Data in the cloud. We start with the commoncrawl.org crawl of approximately 5 Billion web pages. We use a Map Reduce program to scour the commoncrawl corpus for web pages that contain mentions of a brand or keyword of interest, say, `Citibank`, and additionally, have a `Follow me on twitter` link. We harvest this twitter handle, and store it in HBase. Once we have harvested about 5000 twitter handles, we write and run a program to subscribe to the twitter streaming API for public status updates of these folks. As the twitter status updates pour in, we use a natural language processing library to evaluate the sentiment of these tweets, and store the sentiment score back in HBase. Finally, we use a program written in R, and the rhbase connector to do a real time statistical evaluation of the sentiment expressed by the twitterverse towards this brand or keyword. This presentation includes full details on installing and operating all necessary software in the cloud.
by Richard Cole
Amazon Elastic MapReduce is one of the largest operators of Hadoop in the world, with over one million Hadoop clusters run on the Amazon Web Services infrastructure over the last year. Since its launch over three years ago, the Amazon Elastic MapReduce team has helped users of all sizes manage the wide variety of Hadoop failure conditions, hardware and network issues, and data errors that make operating Hadoop clusters so challenging. These failures and/or performance degradations have kindled the team to develop a number of tools and best practices to help its customers more efficiently operate and troubleshoot their Hadoop clusters. This talk will detail the team’s findings, including a number of Hadoop best practices and general troubleshooting tips.
13th–14th June 2012