by Abe Taha
Hadoop is fast becoming an integral part of BI infrastructures to empower everyone in the business to affordably analyze large amounts of multi-structured data not previously possible with traditional OLAP or RDBMS technologies. This session will cover best practices for implementing Big Data Analytics directly on Hadoop. Topics will include architecture, integration and data exchange; self-service data exploration and analysis; openness and extensibility; the roles of Hive and the Hive metastore; integration with BI dashboards, reporting and visualization tools; using existing SAS and SPSS models and advanced analytics; ingesting data with Sqoop, Flume, and ETL.
NetApp collects 500 TB per year of unstructured data from devices that phone home, sending unstructured auto-support log and configuration data back to centralized data centers. NetApp needed to scale their architecture, to provide timely support data, to extend their data warehouse, and to be able to do ad hoc analysis and build predictive models for device support,product analysis and cross-sales. This presentation presents the challenges and a solution architecture based on Apache Hadoop, Apache HBase, Apache Flume, Apache AVRO, HIVE as well as PIG and Solr. The presentation reviews lessons from implementation.
by Bilung Lee and Kathleen Ting
Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.