Your current filters are…
by Jon Gosier
Big data isn’t just an abstract problem for corporations, financial firms, and tech companies. To your mother, a ‘big data’ problem might simply be too much email, or a lost file on her computer.
We need to democratize access to the tools used for understanding information by taking the hard-work out of drawing insight from excessive quantities of information. To help humans process content more efficiently and to help them capture more of their world.
Tools to effectively do this need to be visual, intuitive, and quick. This talk looks at some of the data visualization platforms that are helping to solve big data problems for normal people.
How are businesses using big data to connect with their customers, deliver new products or services faster and create a competitive advantage? Luke Lonergan, co-founder & CTO, Greenplum, a division of EMC, gives insight into the changing nature of customer intimacy and how the technologies and techniques around big data analysis provide business advantage in today’s social, mobile environment – and why it is imperative to adopt a big data analytics strategy.
This session is sponsored by Greenplum, a division of EMC²
by Coco Krumme
Why data can tell us only so much about food, flavor, and our preferences.
by Pete Warden
Why unstructured data beats structured.
by Usman Haque
The expected massive growth of connected device, appliance and sensor markets in the coming years – often called ‘The Internet of Things’ – will need a more rich concept of ‘open data’ than is currently common. When data is generated through activities of people doing things inside their homes and outside in public in their cities, the question of who owns the data becomes almost irrelevant next to the questions of who has access to the data, what do they do with it, and how do citizens manage and make sense of their data while retaining the ‘openness’ that we’ve seen drive creativity and business on the web over the last few years.
by Gary Lang
Big Data is about extracting value from fast, huge, varied, complex data sets. But simply crunching data is only the first step. As adoption of MapReduce and data analytic technologies increases, forward thinking companies are starting to build applications on their core data assets. In this keynote, MarkLogic’s Gary Lang will explore what these Big Data Applications look like, offering some tantalizing real-world glimpses at what data wrapped in applications makes possible.
This keynote is sponsored by MarkLogic
by Hal Varian
Google Insights for Search provides an index of search activity for millions of queries. These queries can sometimes help understand consumer behavior. Hal describes some of the issues that arise in trying to use this data for short-term economic forecasts and provide examples.
This session will shed light on real-world use cases for NoSQL databases by providing case studies from enterprise production users taking advantage of the massively scalable and highly-available architecture of Apache Cassandra.
At the end of this session you will have an good understanding of the types of requirements Cassandra can satisfy through a carefully thought-out architecture designed to manage all forms of modern data, that scales to meet the requirements of “big data” management, that offers linear performance scale-out capabilities, and delivers the type of high availability that most every online, 24×7 application needs.
Attendees of this session will learn, among other things, how to handle component failures in a complex distributed system. The cloud monitoring team uses geographic redundancy and isolation along with an innovative build process to create a system where failures can be quickly detected and addressed by the team.
They can also expect to learn how to understand and cope with the relational/non-relational impedance. There are data modeling anti-patterns that are easy to fall prey to when coming from a relational background, and the right approaches are often not intuitive.
Finally, attendees will hear on how to make open source work. Many open source projects suffer from poor documentation and support, or companies that offer support at exorbitant prices. Understanding how open source communities work and employing engineers familiar with open source software makes it easier to leverage these projects.
by Kirkland Barrett
Learn how Microsoft manages a 10,000 person IT Organization utilizing Business Intelligence capabilities to drive communication of strategy, performance monitoring of key analytics, employee self-service BI, and leadership decision-making throughout the global Microsot IT organization. The session will focus on high-level BI challenges and needs of IT executives, Microsoft IT’s BI strategy, and the capabilities that helped to drive BI internal use from 300 users to over 40,000 users (and growing) through self-service BI methodologies.
So much of the privacy discussion is about data collection and access, fears of a future dystopia, and the complexities of law. There seems to be a real vacuum around how societal norms should be mapped to rapidly growing capabilities of big data. What’s difficult about some of these big data use-cases is that even the intended and approved uses of data can lead to decisions or actions that negatively affect specific individuals or groups. These can range from effects on safety (by making a person more easily identifiable or locatable), to fairness (because the purpose of the application is some form of discrimination), to autonomy (by limiting individual choice or through subtle manipulation).
Regrettably, data professionals (e.g., scientists, engineers, designers, analysts) are left in a “don’t ask don’t tell” privacy conundrum where no framework exists to assess the societal impact of their work. Such a framework would need to go beyond default “procedural protections” (e.g., the Fair Information Practice Principles) to “substantive protections” that evaluate possible product impact at design-time and track actual impact as the product moves into the market.
This conversation will address, from academic and industrial perspectives, specific use-cases within people search, background checks, online advertising, and voter targeting. Through these use-cases, we’ll explore the feasibility of a “responsible innovation” framework that might guide data professionals.
There are many modern techniques for identifying anomalies in datasets. There are fewer that work as online algorithms suitable for application to real-time streaming data. What’s worse? Most of these methodologies require a deep understanding of the data itself. In this talk, we tour what the options are for identifying anomalies in real-time data and discuss how much we really need to know before hand to guess at the ever-useful question: is this normal?
by Vipul Sharma
Recommendation systems have become critical for delivering relevant and personalized content to your users. Such systems not only drive revenues and generate significant user engagement for web companies but also are a great discovery tool for users. Facebook’s newsfeed, Linkedin’s people you may know and Eventbrite’s event recommendations are some great examples of recommendation systems.
During this talk we will share the architecture and design of Eventbrite’s data platform and recommendation engine. We will describe how we mined a massive social graph of 18M users and 6B first degree connections to provide relevant event recommendations. We will provide details of our data platform, which supports processing more than 2 TB social graph data daily. We intent to describe how Hadoop is becoming the most important tool to do data mining and also discuss how machine learning is changing in presence of Hadoop and big data.
We hope to provide enough details that folks can learn from our experiences while building their data platform and recommendation systems.
by Max Gadney
Videographics achieve the two most important criteria of the visualizer.
They engage attention and they inform.
I am currently working with the BBC to define a new format – that of the ‘Video Dat Graphic’. Some of these exist online to degrees of success but we are codifying best practice, auditing current activity and can show our work in the market context.
I will discuss how video is an information rich medium – from a survey of data resolution across media and how these videos can compliment the BBC online offering as a whole.
Some subjects to cover will be - storytelling principles – what actually works in 2 minutes - scripting and storyboarding – drafting a plan - timescales, costs and resources - designing for cognition – how video needs to understand how we perceive
I’ll be showing many examples in addition to our work.
This is a high paced session, with lots to look at and an excellent mix of storytelling and information design ideas. There is an excellent balance between theory and practical advice.
Apache Hadoop is the leading platform for storing, processing and managing “big data”. Please join Arun C. Murthy, Hortonworks co-founder and VP of Apache Hadoop for the Apache Software Foundation, for a discussion about the next generation of Apache Hadoop, known as hadoop-0.23. Attend this session to learn how MapReduce has been re-architected by the community to improve reliability, availability and scalability as well as the ability to support alternate programming paradigms. You will also learn about HDFS Federation, which allows for significant scalability improvements, as well as other important advancements. Arun will also share details about the roadmap and answer questions from the audience about future enhancements to Apache Hadoop.
This session is sponsored by Hortonworks
Social data is growing, Twitter produces 250+ million tweets per day and 27 million links to news and media. Big Data can give insights into these large datasets but first the data must be curated, cleaned and quantified before it has value. We will cover how we move from unstructured to structured and how we take simple data and apply complex processes to give context to the data.
We will cover how we developed a platform that can deal with billions of items per day and perform complex analysis before handing the data onto thousands of customers in real-time. We will also walk through our platform architecture looking at our use of Hadoop, HBase, 0MQ, Kafka and many other cutting edge technologies. You will learn some of the pitfalls of running a production Hadoop cluster and the value when you make it work.
In “The Evolution of Data Products”, O’Reilly Media’s Mike Loukides notes: “the question of how we take the next step — where data recedes into the background — is surprisingly tough. Do we want products that deliver data? Or do we want products that deliver results based on data? We’re evolving toward the latter, though we’re not there yet.” In this talk, Jeremy Howard will show why taking this step is tough, and will lay out what needs to be done to deliver results based on data. He will particularly draw on his experience in building Optimal Decisions Group, where he developed a new approach to insurance pricing which focused on delivering results (i.e.: determine the optimal price for a customer) instead of delivering data (i.e. calculating a customer’s risk, which had been the standard approach used by actuaries previously).
Delivering results based on data requires 3 steps:
1) Creating predictive models for each component of the system 2) Combining these predictive models into a simulation 3) Using an optimization algorithm to optimize the inputs to the simulation based on the desired outcomes and the system constraints
Unfortunately, many data scientists today are not sufficiently familiar with steps 2) and 3) of this process. Although many data scientists have been developing skills in predictive modelling, simulation and optimization skills are still rare. Jeremy will show how these 3 steps fit together, give examples of their use in real world situations, and will introduce some of the key algorithms and methods that can be used.
Once social media and web companies discovered Hadoop as the good enough solution for any data analytics problem that did not fit into mysql, Hadoop was on a rapid rise on the financial industry. The reasons the financial industry is adopting Hadoop very fast are very different than in other industries. Banks typically are not engineering driven organizations and terms like agile development, shared root key or cron tab scheduling are no go’s in a bank but standard around Hadoop.
This entertaining talk for bankers and other financial services managers with technical experience or engineers discusses four business intelligence platform deployments on Hadoop:
1. Long-term storage and analytics of transactions and the huge cost saves Hadoop can provide;
2. Identifying cross and up sell opportunities by analyzing web log files in combination with customer profiles;
3. Value-at-risk analytics; and
4. Understanding the SLA issues and identifying problems in a thousands-of-nodes, big services oriented architecture.
This session discusses the different use cases and the challenges to overcome in building and using BI on Hadoop.
Big Data provides big banks with the means to monetize the transaction data stream in ways that are both pro-consumer and pro-merchant. By utilizing data-driven personalization services, financial institutions can offer a better customer experience and boost customer loyalty. For example, integrating rewards and analysis within a consumer’s online banking statement can save a consumer on average $1,000 per year just by comparing plans, pricing, and usage habits within wireless, cable, and gas categories. Financial institutions benefit by increasing their relationship value with customers. Merchants benefit from increased analytics and are able to reward loyal customers with deals that matter most based upon their purchasing habits.
These data driven services increase a bank’s relationship value with customers. 94% of consumers indicate that they’d use a specific card that was ties to money-saving discounts over a card that did not and 3 in 4 admitted that they’d switch banks if their bank did not offer loyalty rewards.
Big data is not just big stakes for loyalty—it can be used to drive customer acquisition and increase market share (or credit card ‘share of wallet’) which drive other banking revenue streams.
Furthermore, data driven offerings help promote the conversion of non-online customers to online banking and billpay, a cost reduction potential of $167 per account per year or $8.3 billion annually according to Javelin.
Pretty Simple Data Privacy isn’t a company or a project. Rather, it is the idea that we’ve made personal data privacy too complicated and granular. Rather than get deeper and deeper into algorithmic approaches, we should be providing users a very simple set of choices about their data and a easy interface to mark their data as usable, off limits, or negotiable.
Most privacy choices come down to Yes, No, and Maybe. Many users are willing to let their personal data be used in a research context, or if they get something back in return for their data. Many want the right to say no if they don’t like or understand the terms. And many are willing to negotiate for the vast majority of questions that fall in between. PSDP ties together privacy and policy issues explored in existing projects to standardize informed consent, create iconic representations of privacy policies, and move towards a world where users manage the way their vendors use their data.
The session specifically builds on the experience of the Consent to Research project in personalized genomics, quantified self, and other personally identifiable data projects.
by Ryan Ismert
Our presentation will cover the nascent fusion of automatically-collected live Digital Records of sports Events (DREs) with Augmented Reality (AR), primarily for television broadcast.
AR has long been used to in broadcast sports to show elements of the event that are otherwise difficult to see – the canonical examples are the virtual yellow “1st and 10” line for American Football and ESPNs KZone™ strike zone graphics. Similarly, sports leagues and teams have historically collected large amounts of data on events, often expending huge amounts of manual effort to do so. Our talk will discuss the evolution of data-driven AR graphics and the systems that make them possible. We’ll focus on systems for automating the collection of huge amounts of event data/metadata, such as the race car tracking technology used by NASCAR and the MLB’s PitchFX™ ball tracking system. We provide a rubric for thinking about classes of sports event data that encompasses scoring, event and action semantics metadata, and participant motion.
We’ll briefly discuss the history of these sports data collection technologies, and then take a deeper look at how the current first generation of automated systems are being leveraged for increasingly sophisticated analyses and visualizations, often via AR, but also through virtual worlds renderings from viewpoints unavailable or impossible from broadcast cameras. The remainder of the talk will examine two case studies highlighting the interplay between rich, live sports data and augmented reality visualization.
The first case study will describe one of the first of the next-gen digital records systems to come online and track players – Sportvision’s FieldFX™ system for baseball. Although exceeding difficult to collect, the availability of robust player motion data promises to revolutionize areas such as coaching and scouting performance analysis, fantasy sports and wagering, broadcast TV graphics and commentary, and sports medicine. We’ll show examples of some potential applications, and also cover data quality challenges in some detail, in order to examine the impact that these challenges have on the applications using the data.
The second case study will examine the rise of automated DRE collection as an answer to that nagging question about AR – ‘what sort of things do people want to see that way?’ Many of the latest wave of AR startups are banking huge amounts of venture money that the answer is in user-generated or crowd-sourced content. While this may end up being true for some consumer-focused mobile applications, our experience in the notoriously tight-fisted rights and monetization environment of sports has led directly to the requirement to create owned, curated data sources. This came about from four realizations that we think are more generally applicable to AR businesses…
Cool looking isn’t a business, even in sports.
It must be best shown in context, over video, or it won’t be shown at all.
The ability to technically execute AR is no longer a barrier to entry. Cutting edge visualization will only seem amazing for the next six seconds.
We established impossibly high quality expectations, and now the whole industry has to live with them.
by David Miller
Have the demographics of inventors changed over the last 40 years with the Midwest’s industrial center becoming the rust belt while Silicon Valley has grown to dominate the high tech industry? Have the number of inventors declined in Michigan and grown in California? What states have the most then versus now? These simple questions are used to illustrate the simplicity of processing big data using the HPCC Systems open-source platform. The focus of this session will be on the Enterprise Control Language (ECL) and its use for ETL, data analytics, and query processing.
by Nathan Marz
Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it’s fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language. Twitter relies upon Storm for much of its analytics.
After being open-sourced, Storm instantly attracted a large community. It is by far the most watched JVM project on GitHub and the mailing list is active with over 300 users.
Storm has a wide range of use cases, from stream processing to continuous computation to distributed RPC. In this talk I’ll introduce Storm and show how easy it is to use for realtime computation.
by Sean Byrnes
Flurry’s analytics and advertising platform for mobile tracks data on over 330 million devices per month. We operate a 500 node Hadoop and HBase cluster to mine and manage all this data. This talk will go over some of the lessons learned, architecture choices, and advantages of running this big data platform. Some of the covered topics include:
by Alexander Gray
Machine learning holds the key for massive waves that are already starting to fundamentally change business, from targeted advertising, to personalization, to real-time data-driven business processes. But is ML really possible on big data with state-of-the-art methods (which yield the highest predictive accuracies), or just simple ones (such as linear models)? Can ML really be done in real time today? Is MapReduce really the best technical solution to large-scale ML? Does it really make sense to send data to the cloud and do ML there? In this talk I will review the current state of machine learning technology both at the research level and the industry-readiness level, and current best solution options.
This session is sponsored by Skytree, Inc
by Gary Lang
Gary Lang, Senior VP Engineering, MarkLogic, will discuss the concept of Big Data Applications and walk through three in-production implementations of Big Data Applications in action. These applications include how LexisNexis built a next-generate search application, how a major financial institution simplified its technology infrastructure for managing complex derivative trades, and how a major movie studio implemented an Enterprise Data Layer for access to all of their content across multiple silos.
This session is sponsored by MarkLogic
by DJ Patil
What does it really take to build a data product? Recall and relevancy are only parts of the challenge. In fact, an entire new approach is required to build consistently great data products. This includes new paradigms of design, web development, engineering, and testing that allow a team to prototype for 1x, build for 10x, and engineer for 100x. I’ll explain the Data Jujitsu approach which is an agile approach that supports all these scales.
The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
by JC Herz
This talk uses Mike Loukides’ essay on the evolution of data products (from “overt” to “covert”) as a springboard. The dichotomy between power-user interfaces and “just tell me which app to buy” is rhetorically powerful and conceptually seductive. But it’s more useful to think about data services in the context of an OODA loop: Is the system augmenting the user’s ability to observe, orient, decide, or act.
In video surveillance, hundreds of hours of video recordings are culled from multiple cameras. Within this video are hours of recordings that do not change from one minute to the next, one hours to the next and in some cases, one day to the next. Identifying information that is interesting and that can be shared, analyzed and viewed by a larger community from this video is a time-consuming task that often requires human intervention assisted by digital processing tools.
Using Map/Reduce we can harness parallel processing and clusters of graphical processors to identify and tag useful periods of time for faster analysis. The result is an aggregate video file that contains metadata tags that link back to the start of those scenes in the original file. In essence, creating an index into hundreds-of-thousands of hours of recording that can be reviewed, shared and analyzed by a much larger group of individuals.
This session will review examples where this is being done in the real world and discuss the process for developing a Hadoop process that can break a video down into scenes that are analyzed by maps to determine interest and then reduced into a single index file that contains 30 seconds of recording around that scene. Moreover, the file will contain the necessary metadata to jump back into the original at the start point and allow the viewer to view the scene in context of the entire recording.
28th February to 1st March 2012