by Chris Deptula and James Dixon
The big data world is extremely chaotic based on technology in its infancy. Learn how to tame this chaos, integrate it within your existing data environments (RDBMS, analytic databases, applications), manage the workflow, orchestrate jobs, improve productivity and make using big data technologies accessible to a much wider spectrum of developers, analysts and data scientists. Learn how you can actually leverage Hadoop and NoSQL stores via an intuitive, graphical big data IDE – eliminating the need for deep developer skills such as Hadoop MapReduce, Pig scripting, or NoSQL queries.
The effect of big data on all business models cannot be denied. This panel of SCM experts looks at how business are using, or should be using, big data to drive supply chain management issues focusing on the broader manufacturing issues that must be addressed as well as practical tips that can be applied in dealing with supply chains that now span the globe.
Despite the hype, enterprises are still wrapping their arms around the large amounts of data they’re sitting on, and how to leverage it. In this session, we’ll look at a snapshot of how enterprises are thinking about their big data strategies, and what it means to their top line.
by Richard Taylor
Do you want to write less code and get more done? This tutorial will demonstrate a natural language parsing technology to extract entities from all kinds of text using massively parallel clusters. Attendees will gain hands-on experience with the newly-released, data-centric cluster programming technology from HPCC Systems to extract entities from semi-structured and free-form text data. Students will leave with all the data and code used in the class along with the latest HPCC Client Tools installation, HPCC documentation, and HPCC’s VMware installation. Prizes, give-aways and a raffle is included.
This session is sponsored by HPCC
by Nate McCall
The database industry has been abuzz over the past year about NoSQL databases and their applications in the realm of solutions commonly placed under the ‘big data’ heading.
This interest has filtered down to software development organizations who have had to scramble to make sense of terminology, concepts and patterns particularly in the areas of distributed computing which were previously limited to academics and a very small number of special case applications.
Like all of these systems, Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space in the NoSQL/Big Data space, still requires a substantial learning curve to implement correctly.
This tutorial will walk attendees through the fundamentals of Apache Cassandra, installing a small working cluster either locally or via a cloud provider, and practice configuring and managing this cluster with the tools provided in the open source distribution.
Attendees will then use this cluster to design a simple Java web application as a way to gain practical, hands on experience in designing applications to take advantage of the massive performance gains and operational efficiency that can be leveraged from a correctly architected Apache Cassandra cluster.
Attendees should leave the tutorial with hands-on knowledge of building a real, working distributed database.
by Mark Madsen
Mark Madsen talks about how regular businesses will eventually embrace a data-driven mindset, with some trademark ‘Madsen’ history background to put it in context. People throw around ‘industrial revolution of data’ and ‘new oil’ a lot without really thinking about what things like the scientific method, or steam power, or petrochemicals did as a result.
“Big data” provides the opportunity to combine new, rich data sources in novel ways to discover business insights. How do you use analytics to exploit this data so that it will yield real business value? Learn a proven technique that ensures you identify where and how big data analytics can be successfully deployed within your organization. Case study examples will demonstrate its use.
In this session, business agility expert Michael Hugos will present examples from his work in applying immersive animation techniques and gaming dynamics, and discuss how they can address the challenges of consuming – and responding to – the data deluge, turning information overload into business advantage.
by Marcia Tal
In this session, Marcia Tal will demonstrate how significant business value is being realized through sophisticated understanding of intent and interconnectedness, at scale.
by Marti Hearst
Search user interfaces are slow to change; ideas for new search interfaces rarely take hold. This talk will forecast how search is likely to change and what will stay the same in the coming years.
by Jon Bruner
Jon Bruner leads a panel discussion with a few of the day’s presenters and takes final questions from the audience.
by Doug Cutting
Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.
by Dave Campbell
In a world where data increasing 10x every 5 years and 85% of that information is coming from new data sources, how do our existing technologies to manage and analyze data stack up? This talk discusses some of the key implications that Big Data will have on our existing technology infrastructure and where do we need to go as a community and ecosystem to make the most of the opportunity that lies ahead.
How big data tools and technologies give us back our individual identity … because if you didn’t know you were unique and special, well, you are. Big data can be applied to solving socio-economic problems that rival the scale and importance of building ad optimization models.
by Mike Olson
Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I’ll present a brief summary of some of the critical social and business problems that we’re attacking with the open source Apache Hadoop platform.
by Josh Green
Entrepreneurs and industry executives know that Big Data has the potential to transform the way business is done. But thus far, Big Data has not lived up to the hype. Why? Because a lot of us are making a rookie mistake – a mistake that Panjiva CEO Josh Green made the first time he came up against Big Data – we’re becoming smitten with data sets and trying to find problems that our data sets of choice can help solve.
In the past, entrepreneurs and executives had limited data at their disposal. Coming across a new data set was a rare and precious moment – a bit like striking gold. In this world, an intensive focus on a new data set made sense, because you never knew when you would come across additional data. These instincts, honed in a world of scarce data, are downright dangerous in a world of virtually limitless data.
Working with data is hard. It takes time and money – and, in today’s world, there’s opportunity cost associated with it. When you’re playing around with Data Set A, you’re missing out on an opportunity to play with Data Set B.
To succeed in the Big Data world, entrepreneurs and executives need to be ruthless in prioritizing which data sets they’re going to dig into – and which they’re going to steer clear of. How best to prioritize? In the same way that businesses have always prioritized – by focusing our time, our money, and our energy on our toughest problems.
It all starts with the identification of a problem worth solving. This is a decision that can be made without ever touching Big Data. Once the problem has been identified, the hunt for data is on. And that’s where the real fun begins. Because in today’s Big Data world, the hunt is almost always successful.
Josh will discuss his experiences working with some of the world’s largest companies (Panjiva currently counts over 35 Fortune 500 companies as clients) to track down data to solve a real-world problem – and help you avoid many of the mistakes that he’s made along the way.
by Sanjay Mehta and Eddie Satterly
Organizations today are generating data at an ever-increasing velocity but how can they leverage this Big Data? In this session, Expedia, one of the world’s leading online travel companies, describes how they tapped into their massive machine data to deliver unprecedented insights across key IT and business areas – from ad metrics and risk analysis, to capacity planning, security, and availability analysis.By using Splunk to harness their data, Expedia saved tens of millions of dollars andfreed up key resources who now can focus on innovation instead of just operations.
by Dave Rubin
Management Strategies for Big Data will demystify the concepts of Big Data, bring real world examples and give practical guide on how to apply Big Data within their organization. This is based on the upcoming O’Reilly book, “Management Strategies for Big Data”. We will bring a practical approach to mapping the value of emerging Big Data technologies to real business value.
The target audience is business and technical decision makers. The session objective is to give the listener the high level understanding of this new emerging technology and how to apply it to their business strategy. There will be a focus on supporting technology, target architectures and case studies of how to find new revenue, improve efficiencies and uncover new innovation.
by Jen Zeralli and Jeff Sternberg
Some examples of ideas we think the Strata crowd may be interested in include:
Creating user-centric, data-driven products including a recommendation engine and Facebook/Linkedin-style “newsfeed” product in a highly regulated industry where our clients (primarily Investment Banks, Private Equity Firms and Asset Managers) fiercely guard their privacy due to the secretive nature of their businesses. Chinese Walls, Insider Trading and concerns over “Private and Material” data require us to take a careful and somewhat modified approach when compared to traditional consumer applications of these types of algorithms. This project was also the first time we got our feet wet in collective intelligence as well as Hadoop.
Entity management which is critical to our data accuracy and quality. This can be quite a beast in a world where companies are constantly forming, merging and going out of business. We employ a variety of methods to maintain the accuracy of our data from algorithmic checks to manual review and user-facing linking/workflow applications.
Document (SEC filings, transcripts etc) parsing, and processing. Timeliness and accurateness of this data is critical to our business. Our implementation of SOLR has significantly improved our process and turnaround time.
Ingesting proprietary client data including their portfolios and running advanced analytics (attribution, risk analytics, etc) on this data.
The vast permutations of data available in our company, person, key development, and transaction screening engine which is another tool where speed is vital for our clients.
Operating as a data arm as part of a larger enterprise that moves the market (for example by downgrading the United State’s credit rating this year).
This session will teach participants how to architect big data systems that leverage virtualization and platform as a service.
We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service. We will show how virtualization can be used to simplify deployment and provisioning of Hadoop, SQL and NoSQL databases. We will describe the workload patterns of Hadoop and the infrastructure design implications. We will discuss the current and future role of PaaS to make it easy to deploy Java, SQL, R, and Python jobs against big-data sets.
by Ian White
The era of big geodata has arrived. Federal transparency initiatives have spawned millions of rows of data, state and local programs engage developers and wonks with APIs, contests and data galore. Sub meter imagery ensures unparalleled accuracy and collection efforts mean timely updates. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how and (maybe) for what.
With opportunity comes challenge—the expertise in sourcing, identifying, collecting, normalizing and maintaining geographic data is often overlooked in the mad rush to analyze. Curation, or the human side of extract, transform and load (ETL) has increased in scope, scale and importance as data proliferation translates to a deluge of non-standardized data types, lacking sufficient documentation or validation, questioning underlying value. Big Data calls for expertise in curating. Acquiring, validating and arranging data in collections that are relevant to the right audience at the right time.
The CEO of Urban Mapping, Ian White, will demonstrate why your maps are only as good as your data, the issues around big data curating and illustrate how data acquisition can be addressed from the get-go of any geospatial intelligence project or program planning
by Leigh Dodds
There are many different approaches to putting data on the web, ranging from bulk downloads through to rich APIs. These styles suit a range of different data processing and integration patterns. But the history of the web has shown that value and network effects follow from making things addressable.
Facebook’s Open Graph, Schema.org, and a recent scramble towards a “Rosetta Stone” for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies data integration. Big Data offers ways to mine huge datasets for insight. Linked Data creates massively inter-connected datasets that can be mined or drawn upon to enrich queries and analysis
This talk will look at the concept of Linked Data and how a rapidly growing number of inter-connected databases, from a diverse range of sources, can be used to contextualise Big Data.
by Jon Gosier
Big data isn’t just an abstract problem for corporations, financial firms, and tech companies. To your mother, a ‘big data’ problem might simply be too much email, or a lost file on her computer.
We need to democratize access to the tools used for understanding information by taking the hard-work out of drawing insight from excessive quantities of information. To help humans process content more efficiently and to help them capture more of their world.
Tools to effectively do this need to be visual, intuitive, and quick. This talk looks at some of the data visualization platforms that are helping to solve big data problems for normal people.
How are businesses using big data to connect with their customers, deliver new products or services faster and create a competitive advantage? Luke Lonergan, co-founder & CTO, Greenplum, a division of EMC, gives insight into the changing nature of customer intimacy and how the technologies and techniques around big data analysis provide business advantage in today’s social, mobile environment – and why it is imperative to adopt a big data analytics strategy.
This session is sponsored by Greenplum, a division of EMC²
Big Data provides big banks with the means to monetize the transaction data stream in ways that are both pro-consumer and pro-merchant. By utilizing data-driven personalization services, financial institutions can offer a better customer experience and boost customer loyalty. For example, integrating rewards and analysis within a consumer’s online banking statement can save a consumer on average $1,000 per year just by comparing plans, pricing, and usage habits within wireless, cable, and gas categories. Financial institutions benefit by increasing their relationship value with customers. Merchants benefit from increased analytics and are able to reward loyal customers with deals that matter most based upon their purchasing habits.
These data driven services increase a bank’s relationship value with customers. 94% of consumers indicate that they’d use a specific card that was ties to money-saving discounts over a card that did not and 3 in 4 admitted that they’d switch banks if their bank did not offer loyalty rewards.
Big data is not just big stakes for loyalty—it can be used to drive customer acquisition and increase market share (or credit card ‘share of wallet’) which drive other banking revenue streams.
Furthermore, data driven offerings help promote the conversion of non-online customers to online banking and billpay, a cost reduction potential of $167 per account per year or $8.3 billion annually according to Javelin.
by Gary Lang
Gary Lang, Senior VP Engineering, MarkLogic, will discuss the concept of Big Data Applications and walk through three in-production implementations of Big Data Applications in action. These applications include how LexisNexis built a next-generate search application, how a major financial institution simplified its technology infrastructure for managing complex derivative trades, and how a major movie studio implemented an Enterprise Data Layer for access to all of their content across multiple silos.
This session is sponsored by MarkLogic
by Vineet Tyagi
Enterprises today are well on their way to putting Big Data to work. Many are experimenting with Big Data, if already not in production. The data deluge is forcing everyone to ask the key question – What is the cost of big data analytics? This session will address some of the key concerns in creating a Big Data solution that will provide for lower cost “per TB Data Managed and Analyzed”
The session will talk about why nobody wants to talk about the costs involved with Hadoop, NOSQL and other options.It will also exemplify how to reduce costs, choose the right technology options and address some of the unsaid issues in dealing with BIG Data.
This session is sponsored by Impetus Technologies
by Siraj Khaliq
One doesn’t normally think about Big Data when the rain falls, but we’ve been measuring and analyzing Big Weather for years. Due to recent advancements in Big Data, cloud computing, and network maturity it’s now possible to work with extremely large weather-related data sets.
The Climate Corporation combines Big Data, climatology and agronomics to protect the $3 trillion global agriculture industry with automated full-season weather insurance. Every day, The Climate Corporation utilizes 2.5 million daily weather measurements, 150 billion soil observations, and 10 trillion scenario data points to build and price their products. At any given time, more than 50 terabytes of data is stored in their systems, the equivalent of 100,000 full-length movies or 10,000,000 music tracks. All of this is meant to provide the intelligence and analysis necessary to reduce the risk of adverse weather on U.S. farmers, which is the cause of more than 90% of crop loss.
The Climate Corporation’s generation system uses thousands of servers to periodically process decades of historical data and generate 10,000 weather scenarios at each location and measurement, going out several years. This results in over 10 trillion scenario data points (e.g. an expected rainfall value at a specific place and time in the future), for use in an insurance premium pricing and risk analysis system amounting to over fifty terabytes of data in our live systems at any given time. Weather-related data is ingested multiple times a day directly from major climate models and incorporated into The Climate Corporation’s system. Under the hood, the The Climate Corporation’s Web site is running complex algorithms against a huge dataset in real-time, returning a premium price within seconds. The size of this data set has grown an average of 10x every year as the company adds more granular geographic data. Hear The Climate Corporation CEO David Friedberg discuss how to apply big data principles to the real-world challenge of protecting people and businesses from the financial impact of adverse weather.
by Peter Kuhn
Personalized Cancer Care: How to predict and monitor the response of cancer drugs in individual patients.
1. Biology: Cancer spreads through the body by cancer cells leaving the primary site of cancer, traveling through the blood to find a new site where it can settle, colonize, expand and eventually kill the patient.
2. Challenge: the concentration of the cancer cells is about 1 to 1 million normal white blood cells or 1 to 2 billion cells if you include the red blood cells. This makes for about a handful of these cells in a tube of blood (assuming that you have given blood before, you can picture this pretty easily). A cell is about 10 microns in diameter
3. Opportunity: if can find these cells, we could always just take a tube of blood and characterize the disease in that patient at that point in time to make treatment decisions. We have significant numbers of drugs going through the development pipeline but no good way of making decisions about which drug to take at which time.
4. Solution: create a large monolayer of 10 million cells, stain the cells, then image them and then find the cells computationally by an iterative process. It is a simple data driven solution to very large challenge. It is simple in the world of algorithms, HPC and cloud, and setup to revolutionize cancer care.
28th February to 1st March 2012