This application serves as a tutorial for the use of Big Data in the cloud. We start with the commoncrawl.org crawl of approximately 5 Billion web pages. We use a Map Reduce program to scour the commoncrawl corpus for web pages that contain mentions of a brand or keyword of interest, say, `Citibank`, and additionally, have a `Follow me on twitter` link. We harvest this twitter handle, and store it in HBase. Once we have harvested about 5000 twitter handles, we write and run a program to subscribe to the twitter streaming API for public status updates of these folks. As the twitter status updates pour in, we use a natural language processing library to evaluate the sentiment of these tweets, and store the sentiment score back in HBase. Finally, we use a program written in R, and the rhbase connector to do a real time statistical evaluation of the sentiment expressed by the twitterverse towards this brand or keyword. This presentation includes full details on installing and operating all necessary software in the cloud.
13th–14th June 2012