Data isn’t just for supporting decisions and creating actionable interfaces. Data can create nuance, giving new understandings that lead to further questioning—rather than just actionable decisions. In particular, curiosity, and creative thinking can be driven by combining different data sets and techniques to develop a narrative around a set of data sets that tells the story of a place—the emotions, history, and change embedded in the experience of the place.
In this session, we’ll see how far we can go in exploring one street in San Francisco, Haight Street, and see how well we can understand it’s geography, ebbs and flows, and behavior by combining as many data sources as possible. We’ll integrate basic public data from the city, street and mapping data from Open Street Maps, real estate and rental listings data, data from social services like Foursquare, Yelp and Instagram, and analyze photographs of streets from mapping services to create a holistic view of one street and see what we can understand from this. We’ll show how you can summarize this data numerically, textually, and visually, using a number of simple techniques.
We’ll cover how traditional data analysis tools like R and NumPy can be combined with tools more often associated with robotics like OpenCV (computer-vision) to create a more complete data set. We’ll also cover how traditional data visualization techniques can be combined with mapping and augmented reality to present a more complete picture of any place, including Haight Street.
by Larry Murdock
In 2007 Leapfrog embarked on a the Learning Path project to enable their learning toys to upload play logs to Leapfrog as an aid to parents in understanding what and how their children learn from their toys. For Leapfrog this would bolster their position as the educational toy leader and innovator, create opportunities to understand customers better and provide valuable information about the use of products for product lifecycle planning.
This talk will present the strategy and business opportunities as they were planned, then discuss the challenges of implementation. In 2007 we looked at map reduce as a solution to potentially large data volumes but settled on Oracle RAC for reporting flexibility and product maturity. We faced demand estimation issues, SLA challenges, metadata and data management issues, data quality issues and then the killer…our data collection from our users was not passive.
by John Mulholland
Business and operational management of data content has become a top priority for the most vital American financial institutions. From big banks to mortgage lenders, there is an effort currently sweeping the industry to overhaul and – just as importantly – align data standards in a way that is efficient and understandable nationwide. For the first time, the financial industry is being transparent regarding data quality, both internally and externally. Firms are developing a 360 degree view of issues such as customer data, loan life cycle and security data, allowing business partners to develop stronger products, maintain better customer relationships and have a deeper knowledge of overall risk position. Reduction in overall data interfaces and databases as well as the complexity of the overall environment opens the door for tens of millions in cost savings for financial firms embracing data management as an agent of change. Simply put, improved data management standards and techniques are mitigating risk for financial firms and increasing stability for an industry that is obviously critical to our nation’s economical structure.
by Sam Shah
Collborative filtering is a method of making predictions about a user’s interests based on the preferences of many other users. It’s used to make recommendations on many Internet sites, including LinkedIn. For instance, there’s a “Viewers of this profile also viewed” module on a user’s profile that shows other covisited pages. This “wisdom of the crowd” recommendation platform, built atop Hadoop, exists across many entities on LinkedIn, including jobs, companies, etc., and is a significant driver of engagement.
During this talk, I will build a complete, scalable item-to-item collaborative filtering MapReduce flow in front of the audience. We’ll then get into some performance optimizations, model improvements, and practical considerations: a few simple tweaks can result in an order of magnitude performance improvement and a substantial increase in clickthroughs from the naive approach. This simple covisitation method gets us more than 80% of the way to the more sophisticated algorithms we have tried.
This is a practical talk that is accessible to all.
One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business.
The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging.
New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage.
A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.
SALT is an ECL code generator for use with the open source HPCC platform for data-intensive computing. The input to the SALT tool is a small, user-defined specification stored as a text file which includes declarative statements describing the user input data and process parameters, the output is ECL code which is then compiled into optimized C++ for execution on the HPCC platform.
The SALT tool can be used to generate complete applications ready to-execute for data profiling, data hygiene (also called data cleansing, the process of cleaning data), data source consistency monitoring (checking consistency of data value distributions among multiple sources of input), data file delta changes, data ingest, and record linking and clustering.
SALT record linking and clustering capabilities include internal linking – the batch process of linking records from multiple sources which refer to the same entity to a unique entity identifier; and external linking – also called entity resolution, the batch process of linking information from an external file to a previously linked base or authority file in order to assign entity identifiers to the external data, or an online process where information entered about an entity is resolved to a specific entity identifier, or an online process for searching for records in an authority file which best match entered information about an entity.
SALT Use Case – LexisNexis Risk Solutions Insurance Services used SALT to develop a new insurance header file and insurance ID to combine all the available LexisNexis person data with insurance data. Process combines 1.5 billion insurance records and 9 billion person records. 290 million core clusters are produced by the linking process. Reduced source lines of code from 20,000+ to a 48 line SALT specification. Reduced linking time from 9 days to 55 hours. Precision of 99.9907 was achieved.
Summary and Conclusions – Using SALT in combination with the HPCC high-performance data-intensive computing platform can help organizations solve the complex data integration and processing issues resulting from the Big Data problem, helping organizations improve data quality, increase productivity, and enhance data analysis capabilities, timeliness, and effectiveness.
These days users won’t tolerate slow applications. More often than not, the database is the bottleneck in the application. To solve this many people add a caching tier like memcache on top of their database. This has been extremely successful but also creates some difficult challenges for developers such as mapping SQL data to key-value pairs, consistency problems and transactional integrity. When you reach a certain size you may also need to shard your database, leading to even more complexity.
VMware vFabric SQLFire gives you the speed and scale you need in a substantially simpler way. SQLFire is a memory-optimized and horizontally-scalable distributed SQL database. Because SQLFire is memory oriented you get the speed and low latency that users demand, while using a real SQL interface. SQLFire is horizontally scalable, so if you need more capacity you just add more nodes and data is automatically rebalanced. Instead of sharding, SQLFire automatically partitions data across nodes in the distributed database. SQLFire even supports replication across datacenters, so users anywhere on the globe can enjoy the same fast experience.
Stop by to learn more how SQLFire gives high performance without all the complexity.
This session is sponsored by VMware
by Jim Tommaney and Fernanda Foertter
Demands for real-time analytics to derive information, patterns, and revenue from Big Data have left legacy DBMS technologies in the dust. Two foundational technologies have proven to be critical to handle today’s data scale problems – 1) the tremendous parallelism delivered by today’s multi-core/distributed server, and 2) column storage to solve the I/O bottleneck when analyzing large data sets.
In this session, Jim Tommaney will provide an overview of column store databases, the benefits and where companies are implementing them in their organizations. He will also discuss specifics of the InfiniDB Map Reduce style distribution framework and how it helps provide linear scalability for SQL operations. Together, these have tremendous synergies to provide companies a new level of performance to attack big data analytics in a simplistic and scalable manner.
This session is sponsored by Calpont Corporation
by Swaminathan Sivasubramanian
Reliability and scalability of your application is dependent on how its application state is managed. To run applications at massive scale requires one to operate datastores that can scale to operate seamlessly across thousands of servers and can deal with various failure modes such as server failures, datacenter failures and network partitions. The goal of Amazon DynamoDB is to eliminate this complexity and operational overhead for our customers by offering a seamlessly scalable database service. In this talk, we will talk about how developers can build applications on DynamoDB without having to deal with the complexity of operating a large scale database.
This session is sponsored by Amazon
by Jen Zeralli and Jeff Sternberg
Some examples of ideas we think the Strata crowd may be interested in include:
Creating user-centric, data-driven products including a recommendation engine and Facebook/Linkedin-style “newsfeed” product in a highly regulated industry where our clients (primarily Investment Banks, Private Equity Firms and Asset Managers) fiercely guard their privacy due to the secretive nature of their businesses. Chinese Walls, Insider Trading and concerns over “Private and Material” data require us to take a careful and somewhat modified approach when compared to traditional consumer applications of these types of algorithms. This project was also the first time we got our feet wet in collective intelligence as well as Hadoop.
Entity management which is critical to our data accuracy and quality. This can be quite a beast in a world where companies are constantly forming, merging and going out of business. We employ a variety of methods to maintain the accuracy of our data from algorithmic checks to manual review and user-facing linking/workflow applications.
Document (SEC filings, transcripts etc) parsing, and processing. Timeliness and accurateness of this data is critical to our business. Our implementation of SOLR has significantly improved our process and turnaround time.
Ingesting proprietary client data including their portfolios and running advanced analytics (attribution, risk analytics, etc) on this data.
The vast permutations of data available in our company, person, key development, and transaction screening engine which is another tool where speed is vital for our clients.
Operating as a data arm as part of a larger enterprise that moves the market (for example by downgrading the United State’s credit rating this year).
by Bitsy Hansen
I am frequently asked for advice about using data visualization to solve communication problems that are better served through improved information architecture. A nicely formatted bar chart won’t rescue you from a poorly planned user interface. When designing meaningful data experiences it’s essential to understand the problems your users are trying to solve.
In this case, I was asked to take a look at a global data-delivery platform with a number of issues. How do we appeal to a broad cross-section of business users? How do we surface information to our clients in a useful way? How do we facilitate action, beyond information sharing? How do we measure success?
A user-centered approach allowed us to weave together a more meaningful experience for our business users and usability testing revealed helpful insights about how information sharing and data analysis flows within large organizations.
Data visualization is a powerful tool for revealing simple answers to complex questions, but context is key. User-centered design methods ensure that your audience receives the information they need in a usable and actionable way. Data visualization and user experience practices are not mutually exclusive. They work best when they work together.
Where are all the coffee shops in my neighborhood?
Seemingly easy questions can become complex when you consider ambiguity. This one sounds simple until you consider that folks may define “coffee shop” differently and the boundaries of your “neighborhood” differently. One person’s Central Austin, may be someone else’s South Dallas.
How about instead of working too hard to define the parameters in an attempt to completely remove the ambiguity, we instead look at what people do, interact with and talk about. We can watch what people do and decide from there what a coffee shop is and where the boundaries of your neighborhood are. It might not be the “truth”, but it can be darn close.
When we learn to embrace ambiguity, not only can we still find the answers to our questions, but we can also find answers to questions we hadn’t even thought to ask.
by Asad Khan
The second one is how to enable simple experiences directly through an HTML5-based interface. The lightweight Web interface gives developer the same experience as they would get on the Server. The web interface provides a zero installation experience to the developer across all client platforms. This also allowed us to use HTML5 support in the browsers to give some basic data visualization support for quick data analysis and charting.
by Josh Wills
Tools like Pig, Hive, and Cascading ease the burden of writing MapReduce pipelines by defining Tuple-oriented data models and providing support for filtering, joining and aggregating those records. However, there are many data sets that do not naturally fit into the Tuple model, such as images, time series, audio files and seismograms. To process data in these binary formats, developers often go back to writing MapReduces using the low-level Java APIs.
In this session, Cloudera Data Scientist Josh Wills will share insights and “how to” tricks about Crunch, a Java library that aims to make writing, testing and running MapReduce pipelines that run over any type of data easy, efficient and even fun. Crunch’s design is modeled after Google’s FlumeJava library and focuses on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution on the Hadoop cluster.
by Mark Pollack
Hadoop is not an island. To deliver a complete Big Data solution, a data pipeline needs to be developed that incorporates and orchestrates many diverse technologies.
A Hadoop focused data pipeline not only needs to coordinate the running of multiple Hadoop jobs (MapReduce, Hive, or Pig), but also encompass real-time data acquisition and the analysis of reduced data sets extracted into relational/NoSQL databases or dedicated analytical engines.
Using an example of real-time weblog processing, in this session we will demonstrate how the open source Spring Batch and Spring Integration projects can be used to build manageable and robust pipeline solutions around Hadoop.
MapReduce, Hadoop, and other “NoSQL” big data approaches open opportunities for data scientists in every industry to develop new data-driven applications for digital marketing optimization and social network analysis through the power of iterative, big data analysis. But what about the business user or analyst? How can they unlock insights through standard business intelligence (BI) tools or SQL access? The challenge with emerging big data technologies is finding staff with the specialized skill sets of the data scientist to implement and use these solutions. Business leaders and enterprise architects struggle to understand, implement, and integrate these big data technologies with their existing business processes and IT investments and provide value to the business. This session will explore a new class of analytic platforms and technologies such as SQL-MapReduce® which bring the science of data to the art of business. By fusing standard business intelligence and analytics with next-generation data processing techniques such as MapReduce, big data analysis is no longer just in the hands of the few data science or MapReduce specialists in an organization! You’ll learn how business users can easily access, explore, and iterate their analysis of big data to unlock deeper sights. See example applications with digital marketing optimization, fraud detection and prevention, social network and relationship analysis, and more.
This session is sponsored by Teradata Aster
by Kuntal Malia and Kate Zimmerman
ModCloth.com is an online clothing, accessories, and decor retailer with a focus on independent and vintage-inspired fashion. The fashion industry has slowly become more accessible to the end customer beyond its traditional enclave of celebrities, magazine editors and the fashionistas, and companies such as ModCloth are using technology to push the trend further.
ModCloth creates an engaging platform and leverages social channels to cultivate a community driven shopping experience. ModCloth is also letting users curate the types of pieces they’d like to see featured on the site through its program Be the Buyer. If the item receives enough votes, the site will produce and sell the item.
In this session we will cover how ModCloth has leveraged this rich user interaction to cater to their demands
This session will teach participants how to architect big data systems that leverage virtualization and platform as a service.
We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service. We will show how virtualization can be used to simplify deployment and provisioning of Hadoop, SQL and NoSQL databases. We will describe the workload patterns of Hadoop and the infrastructure design implications. We will discuss the current and future role of PaaS to make it easy to deploy Java, SQL, R, and Python jobs against big-data sets.
by Tim Estes
Data Scientists deal with a complex world of Big Data – increasing volume, velocity and variety of data – demanding an evolution in the solutions for analytics. Analytics today are not just about statics but really understanding the meaning of content regardless of the source or the structure. This is even more the case with unstructured data. While unstructured data has been a major issue in the area of Intelligence and National Security, its now a mainstream problem with the overwhelming amount of information that users and business most face every day from Social Media and Online content. We can’t just search or count anymore- it is vital to create and make sense of the valuable interconnections of entities and relationships that are key to our daily decisions. Tim will introduce Automated Understanding for Big Data and explain how this new evolution is the fundamental step in the next wave of software. Tim will show the power of this new capability on a large and valuable dataset that has never been deeply understood by software before.
This session is sponsored by Digital Reasoning
by Ian White
The era of big geodata has arrived. Federal transparency initiatives have spawned millions of rows of data, state and local programs engage developers and wonks with APIs, contests and data galore. Sub meter imagery ensures unparalleled accuracy and collection efforts mean timely updates. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how and (maybe) for what.
With opportunity comes challenge—the expertise in sourcing, identifying, collecting, normalizing and maintaining geographic data is often overlooked in the mad rush to analyze. Curation, or the human side of extract, transform and load (ETL) has increased in scope, scale and importance as data proliferation translates to a deluge of non-standardized data types, lacking sufficient documentation or validation, questioning underlying value. Big Data calls for expertise in curating. Acquiring, validating and arranging data in collections that are relevant to the right audience at the right time.
The CEO of Urban Mapping, Ian White, will demonstrate why your maps are only as good as your data, the issues around big data curating and illustrate how data acquisition can be addressed from the get-go of any geospatial intelligence project or program planning
by Rohit Valia
The Hadoop framework is an established solution for big data management and analysis. In practice, Hadoop applications vary significantly. Your data center infrastructure is used by multiple lines of business and multiple differing workloads.
This session looks at the requirements for a multi-tenant big data cluster: one where different lines of businesses, different projects, and multiple applications can be run with assured SLAs, resulting in higher utilization and ROI for these clusters.
This session is sponsored by Platform Computing
Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.
Netflix is known for pushing the envelope of recommendation technologies. In particular, the Netflix Prize put a focus on using explicit user feedback to predict ratings. This kind of recommendation showed its value in the time when Netflix’s business was primarily mailing DVDs. Nowadays Netflix has moved into the streaming world and this has spurred numerous changes in the way people use the service. The service is now available on dozens of devices and more than 40 countries.
Instead of spending time deciding what to add to a DVD queue to watch later, people now access the service and watch whatever appeals to them at that moment. Also, Netflix now has richer contextual information such as the time and day when people are watching content, or the device they are using.
In this talk I will describe some of the ways we use implicit and contextualized information to create a personalized experience for Netflix users.
Many options exist when choosing a framework to build a custom data explorer on top of your company’s stack. With a brief nod to out-of-the-box business intelligence solutions, the presenters will offer an overview of the creative coding frameworks that lend themselves to data visualization on and across web browsers and native apps written for Mac OS X, iOS, Windows, and Android. Evaluation of the strengths and weaknesses of libraries such as Processing, OpenFrameworks, Cinder, Polycode, Nodebox, d3.js, PhiloGL, Raphael.js, Protovis, and WebGL will be explored through visual examples and code. The audience should come away with a sense of what investments into education will return a high value product that serves unique design goals.
by Ana Martinez and Kin Lane
Join us for an in depth architectural review of the latest infrastructure built by Citygrid to process and serve the local places data available via Citygrid APIs.
We will present how Hadoop is used to process large amounts of inbound data from disparate sources and to solve the complex problem of matching for places.
We will also discuss how Hadoop is used to generate the Solr and MongoDB indexes used for serving.
We will describe the function of the places, content and ad APIs and SDKs, and the characteristics of their underlying data, in the context of real world use cases.
We will focus on some of the limitations of Lucene and Solr for geographic search, and discuss some of the most recent developments we are exploring for our next generation APIs.
Finally, we will give a preview of Citygrid’s next generation real time event processing system, inspired by Twitter’s Rainbird and build on top of Cassandra.
Using Hadoop based business intelligence analytics, this session looks at the Hadoop source code and its development over time and illustrates some interesting and fun facts we will share with the audience. This talk will illustrate text and related analytics with Hadoop on Hadoop to reveal the true hidden secrets of the elephant.
This entertaining session highlights the value of data correlation across multiple datasets and the visualization of those correlations to reveal hidden data relationships.
This is the story about how a developer and two marketing scientists used design thinking, LEAN methods, and marketing science to generate a meaningful data driven experience for marketers.
Obstacles include unstable data sources, changing requirements, fluid markets, the Gartner Hype Cycle, changing definitions, changing terminologies, seemingly constant feature creep, and dissent.
Solutions include continuous feedback, iterating violently, rapid design labs, aggressive inquiry, design thinking, seemingly constant feature combat and beer.
Attendees can expect to learn how not to replicate the mistakes we made, and to make up their own minds if the solution space we discovered is suitable to them.
by Joris Poort
Big data science and cloud computing is changing how engineering driven companies develop highly complex products. Utilizing a novel cloud platform based on hadoop, big data analytics, and applied mathematics tools, the traditional product development cycle can be drastically sped up and used to provide new unique insights into highly complex products improving their final designs. Data science on the cloud can be utilized as a platform to collaborate between disciplinary silo’s within engineering organizations providing new opportunities for applications of advanced machine learning and optimization tools. These tools are demonstrating drastic improvements in aerospace, automotive, and other high-tech industries.
An airplane wing case study will be shown to illustrate the ideas and methods presented. The case study will show how complex engineering disciplines such as aerodynamics and structural analysis can be simultaneously run on the cloud and coupled to not only increase the speed of product development but also used to develop better final product designs. Several tools described in the case study will be shown through a live demonstration.
by Leigh Dodds
There are many different approaches to putting data on the web, ranging from bulk downloads through to rich APIs. These styles suit a range of different data processing and integration patterns. But the history of the web has shown that value and network effects follow from making things addressable.
Facebook’s Open Graph, Schema.org, and a recent scramble towards a “Rosetta Stone” for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies data integration. Big Data offers ways to mine huge datasets for insight. Linked Data creates massively inter-connected datasets that can be mined or drawn upon to enrich queries and analysis
This talk will look at the concept of Linked Data and how a rapidly growing number of inter-connected databases, from a diverse range of sources, can be used to contextualise Big Data.
Since the early days of the data deluge, Lift Lab has been helping many actors of the ‘smart city’ in transforming the accumulation of network data (e.g. cellular network activity, aggregated credit card transactions, real-time traffic information, user-generated content) into products or services. Due to their innovative and transversal incline, our projects generally involve a wide variety of professionals from physicist and engineers to lawyers, decision makers and strategists.
Our innovation methods embark these different stakeholders with fast prototyped tools that promote the processing, recompilation, interpretation, and reinterpretation of insights. For instance, our experience shows that the multiple perspectives extracted from the use of exploratory data visualizations is crucial to quickly answer some basic questions and provoke many better ones. Moreover, the ability to quickly sketch an interactive system or dashboard is a way to develop a common language amongst varied and different stakeholders. It allows them to focus on tangible opportunities of product or service that are hidden within their data. In this form of rapid visual business intelligence, an analysis and its visualization are not the results, but rather the supporting elements of a co-creation process to extract value from data.
We will exemplify our methods with tools that help engage a wide spectrum of professionals to the innovation path in data science. These tools are based on a flexible data platform and visual programming environment that permit to go beyond the limited design possibilities industry standards. Additionally they reduce the prototyping time necessary to sketch interactive visualizations that allow the different stakeholder of an organization to take an active part in the design of services or products.
by Jon Gosier
Big data isn’t just an abstract problem for corporations, financial firms, and tech companies. To your mother, a ‘big data’ problem might simply be too much email, or a lost file on her computer.
We need to democratize access to the tools used for understanding information by taking the hard-work out of drawing insight from excessive quantities of information. To help humans process content more efficiently and to help them capture more of their world.
Tools to effectively do this need to be visual, intuitive, and quick. This talk looks at some of the data visualization platforms that are helping to solve big data problems for normal people.
28th February to 1st March 2012