by Sanjay Radia
Scalability of the NameNode has been a key struggle. Because the NameNode keeps all the namespace and block locations in memory, the size of the NameNode heap limits the number of files and also the number of blocks addressable. This also limits the total cluster storage that can be supported by the NameNode.
Federated HDFS allows multiple independent namespaces (and NameNodes) to share the physical storage within a cluster. This is enabled by the introduction of the notion of Block pools which is analogous to LUNs in a SAN storage system.
by Jakob Homan
Kafka is a distributed pub-sub system that handles streaming data and provides the ability to load data directly into Apache Hadoop. It provides a highly performant messaging system combined with an simple, extensible API. Kafka is currently in production at LinkedIn and was recently open-sourced. Learn more at http://sna-projects.com/kafka/
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application's execution. High availability, security, and improved multi-tenancy are fundamental to the new architecture. The new architecture also increases innovation, agility and hardware utilization.
23rd March 2011