The Yahoo! Developer Network and the London Hadoop User Group are excited to announce a Hadoop evening at our Yahoo! offices in Central London on 23rd March 2011 at 6:30pm.
Visiting speakers Owen O'Malley and Sanjay Radia from Yahoo's Head office in Silicon Valley will be talking about Federated HDFS and the Next Gerenation of Hadoop MapReduce. Jakob Homan from Linkedin will be talking about Kafka.
This free to attend event will have limited space, so make sure that you sign up early to assure your attendance. The talks will start at 6:30pm with registration starting at 6:00pm.
Next Generation of Hadoop MapReduce by Owen O'Malley
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application's execution. High availability, security, and improved multi-tenancy are fundamental to the new architecture. The new architecture also increases innovation, agility and hardware utilization.
Federated HDFS by Sanjay Radia
Scalability of the NameNode has been a key struggle. Because the NameNode keeps all the namespace and block locations in memory, the size of the NameNode heap limits the number of files and also the number of blocks addressable. This also limits the total cluster storage that can be supported by the NameNode.
Federated HDFS allows multiple independent namespaces (and NameNodes) to share the physical storage within a cluster. This is enabled by the introduction of the notion of Block pools which is analogous to LUNs in a SAN storage system.
This approach offers a number of advantages besides scalability: it can isolate namespaces of different applications improving the overall availability of the cluster. The Block pool abstraction allows other services (such as HBase) to use the block storage with perhaps a different namespace structure.
Applications prefer to continue to use a singe namespace. Namespaces can be mounted to create such a unified view. A client-slide mount table provides an efficient way to do that, compared to a server-side mount table: it avoids an RPC to the central mount table and is also tolerant of its failure. The simplest approach is to have shared cluster-wide namespace; this can be achieved by giving the same client-side mount table to each client of the cluster. Client-side mount tables also allow applications to create a private namespace view. This is analogous to the per-process namespaces that are used to deal with remote execution in distributed systems.
Kafka by Jakob Homan
Kafka is a distributed pub-sub system that handles streaming
data and provides the ability to load data directly into Apache Hadoop. It provides a highly performant
messaging system combined with an simple, extensible API. Kafka is currently in
production at LinkedIn and was recently open-sourced. Learn more at http://sna-projects.com/kafka/
Owen O'Malley is a software architect on Apache Hadoop working for Yahoo's Hadoop development team. He was contributing to Hadoop before it was factored out of Nutch, is the single largest Hadoop contributor, the winner of the Gray and Minute Sort benchmark, and was the original chair of the Hadoop Project Management Committee. Before working on Hadoop, he worked on Yahoo Search's WebMap project, which builds a graph of the known web and applies many heuristics to the entire graph that control search. Prior to Yahoo, he wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He received his PhD in Software Engineering from University of California, Irvine.
Sanjay is the architect of the Hadoop project at Yahoo! Previously he held senior engineering positions at Cassatt, Sun Microsystems and INRIA where he developed software for distributed systems and grid/utility computing infrastructures. Sanjay has PhD in Computer Science from University of Waterloo, Canada. Sanjay is a Hadoop committer and PCM member.
Jakob Homan, a member of the Search, Network and Analytics (SNA) team at LinkedIn, is currently focusing on extending projects within the Apache Hadoop ecosystem, including HDFS, Hive and Howl. He is an Hadoop committer and PMC member.