The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.
Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.
In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.
In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.
by Peter Kuhn
Personalized Cancer Care: How to predict and monitor the response of cancer drugs in individual patients.
1. Biology: Cancer spreads through the body by cancer cells leaving the primary site of cancer, traveling through the blood to find a new site where it can settle, colonize, expand and eventually kill the patient.
2. Challenge: the concentration of the cancer cells is about 1 to 1 million normal white blood cells or 1 to 2 billion cells if you include the red blood cells. This makes for about a handful of these cells in a tube of blood (assuming that you have given blood before, you can picture this pretty easily). A cell is about 10 microns in diameter
3. Opportunity: if can find these cells, we could always just take a tube of blood and characterize the disease in that patient at that point in time to make treatment decisions. We have significant numbers of drugs going through the development pipeline but no good way of making decisions about which drug to take at which time.
4. Solution: create a large monolayer of 10 million cells, stain the cells, then image them and then find the cells computationally by an iterative process. It is a simple data driven solution to very large challenge. It is simple in the world of algorithms, HPC and cloud, and setup to revolutionize cancer care.
28th February to 1st March 2012