by Jen Zeralli and Jeff Sternberg
Some examples of ideas we think the Strata crowd may be interested in include:
Creating user-centric, data-driven products including a recommendation engine and Facebook/Linkedin-style “newsfeed” product in a highly regulated industry where our clients (primarily Investment Banks, Private Equity Firms and Asset Managers) fiercely guard their privacy due to the secretive nature of their businesses. Chinese Walls, Insider Trading and concerns over “Private and Material” data require us to take a careful and somewhat modified approach when compared to traditional consumer applications of these types of algorithms. This project was also the first time we got our feet wet in collective intelligence as well as Hadoop.
Entity management which is critical to our data accuracy and quality. This can be quite a beast in a world where companies are constantly forming, merging and going out of business. We employ a variety of methods to maintain the accuracy of our data from algorithmic checks to manual review and user-facing linking/workflow applications.
Document (SEC filings, transcripts etc) parsing, and processing. Timeliness and accurateness of this data is critical to our business. Our implementation of SOLR has significantly improved our process and turnaround time.
Ingesting proprietary client data including their portfolios and running advanced analytics (attribution, risk analytics, etc) on this data.
The vast permutations of data available in our company, person, key development, and transaction screening engine which is another tool where speed is vital for our clients.
Operating as a data arm as part of a larger enterprise that moves the market (for example by downgrading the United State’s credit rating this year).
by Ian White
The era of big geodata has arrived. Federal transparency initiatives have spawned millions of rows of data, state and local programs engage developers and wonks with APIs, contests and data galore. Sub meter imagery ensures unparalleled accuracy and collection efforts mean timely updates. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how and (maybe) for what.
With opportunity comes challenge—the expertise in sourcing, identifying, collecting, normalizing and maintaining geographic data is often overlooked in the mad rush to analyze. Curation, or the human side of extract, transform and load (ETL) has increased in scope, scale and importance as data proliferation translates to a deluge of non-standardized data types, lacking sufficient documentation or validation, questioning underlying value. Big Data calls for expertise in curating. Acquiring, validating and arranging data in collections that are relevant to the right audience at the right time.
The CEO of Urban Mapping, Ian White, will demonstrate why your maps are only as good as your data, the issues around big data curating and illustrate how data acquisition can be addressed from the get-go of any geospatial intelligence project or program planning
by Leigh Dodds
There are many different approaches to putting data on the web, ranging from bulk downloads through to rich APIs. These styles suit a range of different data processing and integration patterns. But the history of the web has shown that value and network effects follow from making things addressable.
Facebook’s Open Graph, Schema.org, and a recent scramble towards a “Rosetta Stone” for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies data integration. Big Data offers ways to mine huge datasets for insight. Linked Data creates massively inter-connected datasets that can be mined or drawn upon to enrich queries and analysis
This talk will look at the concept of Linked Data and how a rapidly growing number of inter-connected databases, from a diverse range of sources, can be used to contextualise Big Data.
by Peter Kuhn
Personalized Cancer Care: How to predict and monitor the response of cancer drugs in individual patients.
1. Biology: Cancer spreads through the body by cancer cells leaving the primary site of cancer, traveling through the blood to find a new site where it can settle, colonize, expand and eventually kill the patient.
2. Challenge: the concentration of the cancer cells is about 1 to 1 million normal white blood cells or 1 to 2 billion cells if you include the red blood cells. This makes for about a handful of these cells in a tube of blood (assuming that you have given blood before, you can picture this pretty easily). A cell is about 10 microns in diameter
3. Opportunity: if can find these cells, we could always just take a tube of blood and characterize the disease in that patient at that point in time to make treatment decisions. We have significant numbers of drugs going through the development pipeline but no good way of making decisions about which drug to take at which time.
4. Solution: create a large monolayer of 10 million cells, stain the cells, then image them and then find the cells computationally by an iterative process. It is a simple data driven solution to very large challenge. It is simple in the world of algorithms, HPC and cloud, and setup to revolutionize cancer care.
28th February to 1st March 2012