by Doug Cutting
Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.
by Dave Campbell
In a world where data increasing 10x every 5 years and 85% of that information is coming from new data sources, how do our existing technologies to manage and analyze data stack up? This talk discusses some of the key implications that Big Data will have on our existing technology infrastructure and where do we need to go as a community and ecosystem to make the most of the opportunity that lies ahead.
How big data tools and technologies give us back our individual identity … because if you didn’t know you were unique and special, well, you are. Big data can be applied to solving socio-economic problems that rival the scale and importance of building ad optimization models.
by Mike Olson
Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I’ll present a brief summary of some of the critical social and business problems that we’re attacking with the open source Apache Hadoop platform.
by Josh Green
Entrepreneurs and industry executives know that Big Data has the potential to transform the way business is done. But thus far, Big Data has not lived up to the hype. Why? Because a lot of us are making a rookie mistake – a mistake that Panjiva CEO Josh Green made the first time he came up against Big Data – we’re becoming smitten with data sets and trying to find problems that our data sets of choice can help solve.
In the past, entrepreneurs and executives had limited data at their disposal. Coming across a new data set was a rare and precious moment – a bit like striking gold. In this world, an intensive focus on a new data set made sense, because you never knew when you would come across additional data. These instincts, honed in a world of scarce data, are downright dangerous in a world of virtually limitless data.
Working with data is hard. It takes time and money – and, in today’s world, there’s opportunity cost associated with it. When you’re playing around with Data Set A, you’re missing out on an opportunity to play with Data Set B.
To succeed in the Big Data world, entrepreneurs and executives need to be ruthless in prioritizing which data sets they’re going to dig into – and which they’re going to steer clear of. How best to prioritize? In the same way that businesses have always prioritized – by focusing our time, our money, and our energy on our toughest problems.
It all starts with the identification of a problem worth solving. This is a decision that can be made without ever touching Big Data. Once the problem has been identified, the hunt for data is on. And that’s where the real fun begins. Because in today’s Big Data world, the hunt is almost always successful.
Josh will discuss his experiences working with some of the world’s largest companies (Panjiva currently counts over 35 Fortune 500 companies as clients) to track down data to solve a real-world problem – and help you avoid many of the mistakes that he’s made along the way.
by Sanjay Mehta and Eddie Satterly
Organizations today are generating data at an ever-increasing velocity but how can they leverage this Big Data? In this session, Expedia, one of the world’s leading online travel companies, describes how they tapped into their massive machine data to deliver unprecedented insights across key IT and business areas – from ad metrics and risk analysis, to capacity planning, security, and availability analysis.By using Splunk to harness their data, Expedia saved tens of millions of dollars andfreed up key resources who now can focus on innovation instead of just operations.
by Dave Rubin
Management Strategies for Big Data will demystify the concepts of Big Data, bring real world examples and give practical guide on how to apply Big Data within their organization. This is based on the upcoming O’Reilly book, “Management Strategies for Big Data”. We will bring a practical approach to mapping the value of emerging Big Data technologies to real business value.
The target audience is business and technical decision makers. The session objective is to give the listener the high level understanding of this new emerging technology and how to apply it to their business strategy. There will be a focus on supporting technology, target architectures and case studies of how to find new revenue, improve efficiencies and uncover new innovation.
by Jen Zeralli and Jeff Sternberg
Some examples of ideas we think the Strata crowd may be interested in include:
Creating user-centric, data-driven products including a recommendation engine and Facebook/Linkedin-style “newsfeed” product in a highly regulated industry where our clients (primarily Investment Banks, Private Equity Firms and Asset Managers) fiercely guard their privacy due to the secretive nature of their businesses. Chinese Walls, Insider Trading and concerns over “Private and Material” data require us to take a careful and somewhat modified approach when compared to traditional consumer applications of these types of algorithms. This project was also the first time we got our feet wet in collective intelligence as well as Hadoop.
Entity management which is critical to our data accuracy and quality. This can be quite a beast in a world where companies are constantly forming, merging and going out of business. We employ a variety of methods to maintain the accuracy of our data from algorithmic checks to manual review and user-facing linking/workflow applications.
Document (SEC filings, transcripts etc) parsing, and processing. Timeliness and accurateness of this data is critical to our business. Our implementation of SOLR has significantly improved our process and turnaround time.
Ingesting proprietary client data including their portfolios and running advanced analytics (attribution, risk analytics, etc) on this data.
The vast permutations of data available in our company, person, key development, and transaction screening engine which is another tool where speed is vital for our clients.
Operating as a data arm as part of a larger enterprise that moves the market (for example by downgrading the United State’s credit rating this year).
This session will teach participants how to architect big data systems that leverage virtualization and platform as a service.
We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service. We will show how virtualization can be used to simplify deployment and provisioning of Hadoop, SQL and NoSQL databases. We will describe the workload patterns of Hadoop and the infrastructure design implications. We will discuss the current and future role of PaaS to make it easy to deploy Java, SQL, R, and Python jobs against big-data sets.
by Ian White
The era of big geodata has arrived. Federal transparency initiatives have spawned millions of rows of data, state and local programs engage developers and wonks with APIs, contests and data galore. Sub meter imagery ensures unparalleled accuracy and collection efforts mean timely updates. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how and (maybe) for what.
With opportunity comes challenge—the expertise in sourcing, identifying, collecting, normalizing and maintaining geographic data is often overlooked in the mad rush to analyze. Curation, or the human side of extract, transform and load (ETL) has increased in scope, scale and importance as data proliferation translates to a deluge of non-standardized data types, lacking sufficient documentation or validation, questioning underlying value. Big Data calls for expertise in curating. Acquiring, validating and arranging data in collections that are relevant to the right audience at the right time.
The CEO of Urban Mapping, Ian White, will demonstrate why your maps are only as good as your data, the issues around big data curating and illustrate how data acquisition can be addressed from the get-go of any geospatial intelligence project or program planning
by Leigh Dodds
There are many different approaches to putting data on the web, ranging from bulk downloads through to rich APIs. These styles suit a range of different data processing and integration patterns. But the history of the web has shown that value and network effects follow from making things addressable.
Facebook’s Open Graph, Schema.org, and a recent scramble towards a “Rosetta Stone” for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies data integration. Big Data offers ways to mine huge datasets for insight. Linked Data creates massively inter-connected datasets that can be mined or drawn upon to enrich queries and analysis
This talk will look at the concept of Linked Data and how a rapidly growing number of inter-connected databases, from a diverse range of sources, can be used to contextualise Big Data.
28th February to 1st March 2012