by Mano Marks and Chris Broadfoot
Beautiful, useful and scalable techniques for analysing and displaying spatial information are key to unlocking important trends in geospatial and geotemporal data. Recent developments in HTML 5 enable rendering of complex visualisations within the browser, facilitating fast, dynamic user interfaces built around web maps. Client-side visualization allows developers to forgo expensive server-side tasks to render visualisations. These new interfaces have enabled a new class of application, empowering any user to explore large, enterprise-scale spatial data without requiring specialised geographic information technology software. This session will examine existing enterprise-scale, server-side visualization technologies and demonstrate how cutting edge technologies can supplement and replace them while enable additional capabilities.
by Ed Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
by Vineet Tyagi
Enterprises today are well on their way to putting Big Data to work. Many are experimenting with Big Data, if already not in production. The data deluge is forcing everyone to ask the key question – What is the cost of big data analytics? This session will address some of the key concerns in creating a Big Data solution that will provide for lower cost “per TB Data Managed and Analyzed”
The session will talk about why nobody wants to talk about the costs involved with Hadoop, NOSQL and other options.It will also exemplify how to reduce costs, choose the right technology options and address some of the unsaid issues in dealing with BIG Data.
This session is sponsored by Impetus Technologies
Mobile devices are ideal data capture and presentation points. They offer boundless opportunities for data collection and the presentation of temporally- and spatially-relevant data. The most compelling mobile applications will require aggregation, analysis and transformation of data from many devices and users. But intermittent network connectivity and constrained processing, storage, bandwidth and battery resources present significant obstacles. Highlighted with real-world applications, this session will cover challenges and approaches to device data collection; device-device and device-cloud data synchronization; and cloud-based data aggregation, analysis and transformation.
One of the most significant challenges faced by individuals and organizations is how to discover and collaborate with data within and across their organizations, which often stays trapped in application and organizational silos. We believe that internal data marketplaces or data hubs will emerge as a solution to this problem of how data scientists and other professionals can work together to in a friction-free manner on data inside corporations and between corporations and unleash significant value for all.
This session will cover this concept in two dimensions.
Piyush from Microsoft will walk through the concept of internal data markets – an IT managed solution that allows organizations to efficiently and securely discover, publish and collaborate on data from various sub-groups within an organization, and from partners and vendors across the extended organization
Francis, from ScraperWiki, will talk through stories of both how people have already used data hubs, and stories which give signs of what is to come. For example – how Australian activists use collaborative web scraping to gather a national picture of planning applications, and how Nike are releasing open corporate data to create disruptive innovation. There’ll be a section where the audience can briefly tell how they use the Internet to collaborate on working with data, and ends with a challenge to use open data as a weapon.
In a research environment, under the current operating system, most data and figures collected or generated during your work is lost, intentionally tossed aside or classified as “junk”, or at worst trapped in silos or locked behind embargo periods. This stifles and limits scientific research at its core, making it much more difficult to validate experiments, reproduce experiments or even stumble upon new breakthroughs that may be buried in your null results.
Changing this reality not only takes the right tools and technology to store, sift and publish data, but also a shift in the way we think of and value data as a scientific contribution in the research process. In the digital age, we’re not bound by the physical limitations of analog medium such as the traditional scientific journal or research paper, nor should our data be locked into understandings based off that medium.
This session will look at the socio-cultural context of data science in the research environment, specifically at the importance of publishing negative results through tools like FigShare – an open data project that fosters data publication, not only for supplementary information tied to publication, but all of the back end information needed to reproduce and validate the work, as well as the negative results. We’ll hear about the broader cultural shift needed in how we incentivise better practices in the lab and how companies like Digital Science are working to use technology to push those levers to address the social issue. The session will also include a look at the real-world implications in clinical research and medicine from Ben Goldacre, an epidemiologist who has been looking at not only the ethical consequences but issues in efficacy and validation.
Big data isn’t just about multi-terrabyte data sets hidden inside eventually-concurrent distributed databases in the cloud, or enterprise-scale data warehousing, or even the emerging market in data. It’s also about the hidden data you carry with you all the time, about the slowly growing data sets on your movements, contacts and social interactions.
Until recently most people’s understanding of what can actually be done with the data collected about us by our own cell phones was theoretical; there were few real-world examples. But over the last couple of years this has changed dramatically.
This talk will discuss the data that you carry with you; the data on your cell phone and other mobile devices, along with the possibilities for making use of that hidden data to reveal things about our lives that we might not realise ourselves. We will explore the types of data that is collected, and the online data sources that you could be usefully cross-correlated with it.
by Max Yankelevich
The founder and CEO of CrowdControl Software, Max Yankelevich is going to explore new ways to solve big data problems involving crowdsourcing. He will define crowdsourcing and the common barriers to applying it to Big Data. Everyone knows managing the crowd can be a nightmare given the complexity involved and the quality issues that arise. Many companies focus on the quantity of data when often times it’s the quality that really matters. Through years of research at MIT and in collaboration with Amazon Mechanical Turk, Yankelevich has created an artificial intelligence application that maximizes the quality of crowdsourced work at scale. He will cover specifc company use cases for collecting, controlling, correcting and enriching data. From startups to Fortune 500 companies, this new methodology is transforming data driven businesses. Its applications range from human sentiment analysis to keeping business listings up to date.
This session is sponsored by CrowdControl Software
As more data become less costly and technology breaks barriers to acquisition and analysis, the opportunity to deliver actionable information for civic purposes grows. This might be termed the “common good” challenge for Big Data. But actionable data has always been the challenge for nonprofits and civic organizations. The needs haven’t changed on the civic side— data intermediary organizations live this every day. Our community and civic clients have struggled with obtaining data—from the public realm or from their own systems—that will inform their decision-making, help tell their stories to funders, mobilize support, and direct their efforts.
This presentation will draw from experiences, old and new, deployed by common-good data intermediaries in order to spotlight challenges moving ahead. We’ll draw on experiences such as that in Chicago where MCIC is endeavoring to run an Apps Competition that’s centered on community/hacker collaboration. We’ll explore the history of the national neighborhood indicator movement and efforts by voters’ groups to build mapping platforms in order to have a voice in political redistricting plans. We’ll also talk about new efforts such as Data without Borders and their success in bringing coders to the community. By touching on these stories, we’ll highlight the potential for disruptive approaches to break through barriers: resources, communication styles, data availability.
by Siraj Khaliq
One doesn’t normally think about Big Data when the rain falls, but we’ve been measuring and analyzing Big Weather for years. Due to recent advancements in Big Data, cloud computing, and network maturity it’s now possible to work with extremely large weather-related data sets.
The Climate Corporation combines Big Data, climatology and agronomics to protect the $3 trillion global agriculture industry with automated full-season weather insurance. Every day, The Climate Corporation utilizes 2.5 million daily weather measurements, 150 billion soil observations, and 10 trillion scenario data points to build and price their products. At any given time, more than 50 terabytes of data is stored in their systems, the equivalent of 100,000 full-length movies or 10,000,000 music tracks. All of this is meant to provide the intelligence and analysis necessary to reduce the risk of adverse weather on U.S. farmers, which is the cause of more than 90% of crop loss.
The Climate Corporation’s generation system uses thousands of servers to periodically process decades of historical data and generate 10,000 weather scenarios at each location and measurement, going out several years. This results in over 10 trillion scenario data points (e.g. an expected rainfall value at a specific place and time in the future), for use in an insurance premium pricing and risk analysis system amounting to over fifty terabytes of data in our live systems at any given time. Weather-related data is ingested multiple times a day directly from major climate models and incorporated into The Climate Corporation’s system. Under the hood, the The Climate Corporation’s Web site is running complex algorithms against a huge dataset in real-time, returning a premium price within seconds. The size of this data set has grown an average of 10x every year as the company adds more granular geographic data. Hear The Climate Corporation CEO David Friedberg discuss how to apply big data principles to the real-world challenge of protecting people and businesses from the financial impact of adverse weather.
by Robbie Allen
With recent advances in linguistic algorithms, data processing capabilities and the availability of large structured data sets, it is now possible for software to create long form narratives that rival humans in quality and depth. This means content development can take advantage of many of the positive attributes of software, namely, continuous improvement, collaborative development and significant computational processing.
Robbie Allen, the CEO of Automated Insights, and his team have done this to great effect by automatically creating over 100,000 articles covering College Basketball, College Football, NBA, MLB, NFL in a 10 month period. Automated Insights is now branching out beyond sports into finance, real-estate, government, and healthcare.
In this talk, Robbie will share the lesson’s his company has learned about the viability of automated content and where the technology is headed. It all started with short sentences of uniform content and has expanded to the point where software can generate several paragraphs of unique prose highlighting the important aspects of an event or story.
The advent of crowdsourcing has wildly expanded the ways we think of incorporating human judgments into computational workflows. Computer scientists, economists, and sociologists have explored how to effectively and efficiently distribute microwork tasks to crowds and use their work as inputs to create or improve data products. Simultaneously, crowdsourcing providers are exploring the bounds of mechanical QA flows, worker interfaces, and workforce management systems.
But what tasks should be performed by humans rather than algorithms? And what makes a set of human judgments robust? Quantity? Consensus? Quality or trustworthiness of the workers? Moreover, the robustness of judgments depends not only on the workers, but on the task design. Effective crowdsourcing is a cooperative endeavor.
In this talk, we will analyze various dimensions of microwork that characterize applications, tasks, and crowds. Drawing on our experience at companies that have pioneered the use of microwork (Samasource) and data science (LinkedIn), we will offer practical advice to help you design crowdsourcing workflows to meet your data product needs.
by Sage Weil
As the size and performance requirements of storage systems have increased, file system designers have looked to new architectures to facilitate system scalability.
Ceph’s architecture consists of an object storage, block storage and a POSIX-compliant file system. It’s in the most significant storage system that has been accepted into the Linux kernel. Ceph has both kernel and userland implementations.The CRUSH algorithm controlled, scalable, decentralized placement of replicated data. In addition, it has a fully leveraged, highly scalable metadata layer. Ceph offers compatibility with S3, Swift and Google Storage and is a drop in replacement for HDFS (and other File Systems).
Ceph is unique because it’s massively scalable to the exabyte level. The storage system is self-managing and self-healing which means limited system administrator involvement. It runs on commodity hardware, has no single point of failure, leverages an intelligent storage node system and it open source.
This talk will describe the Ceph architecture and then focus on the current status and future of the project. This will include a discussion of Ceph’s integration with Openstack, the file system, RBD clients in the Linux kernel, RBD support for virtual block devices in Qemu/KVM and libvirt, and current engineering challenges.
NetApp is a fast growing provider of storage technology. Its devices “phone home” regularly, sending unstructured auto-support log and configuration data back to centralized data centers. This data is used to provide timely support, to improve sales, and to plan product improvements. To allow this, data is collected, organized, and analyzed. The system currently ingests 5 TB of compressed data per week, which is growing 40% per year. NetApp was previously storing flat files on disk volumes and keeping summary data in relational databases. Now NetApp is working with Think Big Analytics, deploying Hadoop, HBase and related technologies to ingest, organize, transform and present auto-support data. This will enable business users to make decisions and provide timely response, and will enable automated response based on predictive models. Key requirements include:
In this session we look at the the lessons learned while designing and implementing a system to:
by Paul Brown
Scientists dealt with big data and big analytics for at least a decade before the business world precipitated buzz-words like ‘Big Data’, ‘Data Tsunami’ and ‘the Industrial Revolution of data’ from the strange broth of their marketing solution and came to realize they had the same problems. Both the scientific world and the commercial world share requirements for a high performance informatics platform supporting the collection, curation, collaboration, exploration, and analysis of massive datasets.
In this talk we will sketch the design of SciDB and explain how it differs from hadoop-based systems, SQL DBMS products, and NoSQL platforms, and explain why that matters. We will present benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.
SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:
• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data
• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation
• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis
• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations
• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data
Come learn how the Mendeley team built the largest crowdsourced database of research literature, scaled to handle 120M uploaded documents, and how they’re using technologies such as Hadoop, Apache Mahout and Thrift to generate daily statistics and recommendations on over 7 TB of academic research data. Jan Reichelt, Mendeley co-founder, will talk about the lessons learned in building the service and how this is shaking up the stodgy old field of academic publishing.
In addition to the technical story, Jan will also show how Mendeley’s real-time data on content usage provides never-before-seen insight into how academics collect, read, share, and annotate academic research. Why should you care about academic publishing? It’s a fascinating story… while you’re using Github and Google+ to share information, the best that all the world’s big brains can come up with is swapping PDFs!
Academic publishing is facing many of the same stressors as other kinds of publishing as their content moves online, but since academic publishing has typically derived revenue from institutional purchases as opposed to individual ones and ad sales don’t contribute as much to revenues, the business models have diverged to where academic publishing has had until now very little end-user focus. Academic content is also read more intensively, curated more carefully by end users, and managed with specialized tools, which gives us a unique opportunity to look at content usage at a level of detail not possible in any other industry and distill some insights that are relevant across all of publishing.
A story on the U.S. Census will tell the broad themes behind the data and use people to exemplify those themes. But what every reader also wants to know answers to more specific questions: How did my community change? What happened where I live, in my neighborhood? And being able to provide those answers through an interactive visualization is what story-telling through the data is all about. A story or report on a subject by its very nature summarizes the underlying data. But readers may have questions specific to a time, date or place. Visualizing the data and providing effective, targeted ways to drill deeper is key to giving the reader more than just the story. The visualization can enhance and deepen the experience. Cheryl Phillips will discuss data visualization strategies to do just that, providing examples from The Seattle Times and other journalism organizations.
by Jacomo Corbo
Measuring productivity remains a notoriously difficult problem, nowhere more so perhaps than in innovation. Feedback on the progress of projects and the performance of workers is scant, highly uncertain, and collected either too infrequently or too slowly. Yet such information is indispensable to the efficient allocation of resources to innovation projects. These challenges are all the more acute for companies involved in complex product development, where performance hinges critically on an organization’s capacity to constantly and consistently innovate. At the same time, information captured by enterprises has generally gone from scarce to superabundant, affording them an unprecedented opportunity to monitor information flows, observe worker interactions and organizational structures, and estimate individual and organizational performance.
We will discuss how companies are using data to obtain sharper, more timely insights. Specifically, we will present how real-time information about engineering collaborations are being leveraged to measure, model, and ultimately forecast organizational productivity and project performance with a level of accuracy and timeliness heretofore impossible. Over the past couple of years, QuantumBlack has developed and deployed an analytics tool to help companies in a variety of industries, from aerospace and automotive to software and semiconductor manufacturing, improve the yield of their project investments. The software tracks and analyses real-time communication and collaboration data, as well as data on performance metrics related to tasks and projects under assessment, to forecast organizational productivity, predict the success or failure of projects, identify performance bottlenecks and drivers, and ultimately help optimize resource and work allocation strategies.
The talk will center on case studies involving successful deployments at several Formula One (F1) teams. We will show how we were able to forecast the productivity of innovation teams, improve investment yields by as much as 15%, and raise productivity by nearly 20%. Certainly, this is no free lunch and we will dwell on some of the more important difficulties: the technological and computing challenges associated with machine-learning and real-time analysis of a transient data set that can grow at the rate of several terabytes per day, some of the privacy issues associated with trawling employee communications even if by machine-only readers, and finally some of the cultural and management challenges that we and our clients faced in deploying a capability that forecasts individual and organizational performance. By the same token, there is a great deal that enterprises can do to help build and facilitate the adoption of analytical capabilities within their ranks. After all, and as we will show, the returns certainly warrant the effort.
by Marc Smith
Networks are a data structure common found across all social media services that allow populations to author collections of connections. The Social Media Research Foundation’s (http://www.smrfoundation.org) free and open NodeXL project (http://nodexl.codeplex.com) makes analysis of social media networks accessible to most users of the Excel spreadsheet application. With NodeXL, Networks become as easy to create as pie charts. Applying the tool to a range of social media networks has already revealed the variations present in online social spaces. A review of the tool and images of Twitter, flickr, YouTube, and email networks will be presented.
We now live in a sea of tweets, posts, blogs, and updates coming from a significant fraction of the people in the connected world. Our personal and professional relationships are now made up as much of texts, emails, phone calls, photos, videos, documents, slides, and game play as by face-to-face interactions. Social media can be a bewildering stream of comments, a daunting fire hose of content. With better tools and a few key concepts from the social sciences, the social media swarm of favorites, comments, tags, likes, ratings, and links can be brought into clearer focus to reveal key people, topics and sub-communities. As more social interactions move through machine-readable data sets new insights and illustrations of human relationships and organizations become possible. But new forms of data require new tools to collect, analyze, and communicate insights.
by Peter Kuhn
Personalized Cancer Care: How to predict and monitor the response of cancer drugs in individual patients.
1. Biology: Cancer spreads through the body by cancer cells leaving the primary site of cancer, traveling through the blood to find a new site where it can settle, colonize, expand and eventually kill the patient.
2. Challenge: the concentration of the cancer cells is about 1 to 1 million normal white blood cells or 1 to 2 billion cells if you include the red blood cells. This makes for about a handful of these cells in a tube of blood (assuming that you have given blood before, you can picture this pretty easily). A cell is about 10 microns in diameter
3. Opportunity: if can find these cells, we could always just take a tube of blood and characterize the disease in that patient at that point in time to make treatment decisions. We have significant numbers of drugs going through the development pipeline but no good way of making decisions about which drug to take at which time.
4. Solution: create a large monolayer of 10 million cells, stain the cells, then image them and then find the cells computationally by an iterative process. It is a simple data driven solution to very large challenge. It is simple in the world of algorithms, HPC and cloud, and setup to revolutionize cancer care.
28th February to 1st March 2012