MapReduce itself constitutes a model rather than an implementation

Following reducing, the file system then assembles data in a coherent form and makes it available for user view, or for subsequent MapReduce processing. In one of its most central roles, the HDFS allows for the Hadoop MapReduce fault tolerance mechanism to thrive. The HDFS in that function keeps integrity reports, block condition reports and triggers the replication of faulty or missing blocks due to node failure.As highlighted above, a successful MapReduce implementation requires parallelization, synchronization abstraction from the user, as well as fault-tolerance and an effective data management scheme. These features must however be hidden from the user’s responsibility, while embodying the fabric of the framework. With today’s use of MapReduce for petabyte and exabyte size data, other important factors such as scalability stem from the correct application of the tenets outlined above. With Apache Hadoop, all but synchronization abstraction rests outside of the HDFS’s authority. The Hadoop Distributed File System is however not a dedicated choice within many HPC environments. In such settings including, but not limited to NERSC , the NY state Grid, the Open Science Grid, and TeraGrid, where the Message Passing Interface is widely adopted,tower garden shared-disk file systems offering the visibility of a storage solution to all nodes are favored. The use of the HDFS and by extension of MapReduce not only through Hadoop, but Twister, Dryad need either the isolation of limited resources to that end, or the revamping of the cluster solely to MapReduce’s end.

For example, at NERSC the system administrators have set aside for Hadoop, 400 nodes out of 17000. Access via ssh and queuing policies for these nodes are different than the rest of the cluster. The latter case being impractical, wherever applicable in such settings, MapReduce users find themselves not able to make full use of existing resources. More often, this means for cluster administrators, the installation of the model on top of an already dedicated system. This approach requires traversing additional layers of indirections to the system as the HDFS must be accomodated on the host cluster. Such additional layers of indirections have been shown as performance degrading by, in which Hadoop’s performance is shown as poor in virtualized settings, precisely because of the many layers of abstractions sitting between the native systems and the Hadoop Distributed File System. In the case of Twister and Dryad, the issue is not of HDFS’ use, as such frameworks do not make use of it, but rather, the use of a similar approach requiring that each node be granted independent storage from all other nodes in the cluster. As has been our experience, such a request as in Hadoop’s case leads either to resource isolation, or to resource limitation. This is because, MPI friendly clusters, such as TeraGrid, and NERSC, offer immense computing power, but also favor directly networked and POSIX file systems, as such a scheme is suitable for their operations.This model aiming at distributed processing of large datasets is compatible in its tenets rather than its implementation to most widely available HPC platforms. The three pillars of the model can be found in the majority of HPC platforms in use today. Where not present, the latter two can be engineered. Nonetheless, the tight coupling of the MapReduce model to its implementation in various cases outlined, introduces inefficiencies not suitable for traditional, batch and legacy systems.

These inefficiencies, we believe, should not handicap the model from evolving, nor should it restrict applications from using the full force of its capacity in HPC environments. To this effect, we designed the implementation of MapReduce on such systems for adaptability reasons first, and subsequently analyzed the performance aspects presented by the approach. In this work we focus on NFS and GPFS. Even though we limit ourselves here to NFS and GPFS, due to a common high level structure, and disk presentation, MARIANE can be installed and evaluated on a myriad of shared-disk and clustered file systems. We will focus in future work to benchmark a sizable number of such file systems with MARIANE in a performance evaluation, as this paper’s scope is focused on quantifying the HDFS bottleneck along with its performance implications of MARIANE’s design choices.The National Energy Research Scientific Computing Center hosts over 7 central clusters, as well as a myriad of specialized ”sub-clusters” hosting various energy research projects. NERSC totals approximately 17,000 available nodes setup for MPI use, used for research purposes, and within which, sit over 200,000 processing cores. NERSC also offers over 2000 Petabytes of storage space for compute and data intensive applications. Apache Hadoop is installed on its Magellan cluster and benefits from 400 processing cores and 785 TB of data space. This stems from Hadoop’s requirement to operate under the HDFS and similarly structured storage systems such as S3 with Amazon MapReduce . Hadoop and the HDFS however, similarly to other MapReduce implementations such as Twister and LEMO-MR, when operating under global filesystems as in NERSC’s case, require dedicated disks, or at least, dedicated disk partitions. Each node also needs dedicated ports and active MapReduce daemons running on them.

In Hadoop’s case, the HDFS needs to be large enough to accommodate minimum file replication requirements for an effective fault-tolerance mechanism. This by default means 3 times the input size for each file present on the file system, and as such, 3TB of space for a 1TB set of input files. These various conditions can cause the MapReduce cluster to be spared minimal resources, as overall resources can not be solely dedicated to MapReduce. In a similar environment, MARIANE can make use of the existing global file system as it is configured. The framework only uses the secure shell, and does not require any dedicated ports and daemons, nor does it require file replication space for its input. In its operation, MARIANE can afford constant node deletion and addition to its host file without requiring the cluster to be reconfigured. The framework does not need to be started or stopped, and only requires installation on the ”Master” node, rather than on all participating computers as it is the case with Hadoop and Twister. The MARIANE framework is thus designed to be seamlessly setup on HPC environments with no disruption to the environment’s integrity or structure.Input distribution is directly operated through the shared file system. As the input is deposited by the user, the FS is optimized to perform caching and pre-fetching to make the data visible to all nodes on-demand. This frees the MapReduce framework from accounting, and transferring input to the diverse nodes. Another benefit of shared-disk file systems with MapReduce,stacking flower pot tower one which became apparent as the application was implemented is the following: current MapReduce implementations, because of their tightly coupled input storage model to their framework require cluster re-configuration upon cluster size increase and decrease. This does not allow for an elastic cluster approach such as displayed by the Amazon EC2 cloud computing framework, or Microsoft’s Azure. Although cloud computing is conceptually separate from MapReduce, we believe that the adoption of some of its features, more specifically ”elasticity”, can positively benefit the turn around time of MapReduce applications. With theinput medium isolated from the processing nodes, as MARIANE features, more nodes can be instantaneously added onto the cluster, without incurring the cost of data redistribution, or cluster re-balancing. Such operations can be highly time consuming, and only allow for the job to be started at their completion, when all data settles on the cluster nodes involved. With MARIANE, input storage is independent from the diverse processing nodes. Separating the I/O structure from the nodes allows for a swift reconfiguration and a faster application turnaround time.

In Hadoop’s case, removing a node holding a crucial input chunk means finding a node holding a duplicate of the chunk held by the exiting node and copying it to the arriving node, or just re-balancing the cluster, as to redistribute the data evenly across all nodes. Such an operation with large scale input datasets can be time consuming. Instead, according to the number of participating workers in the cluster, nodes can be assigned file markers as to what part of the input to process. Should a node drop or be replaced, the arriving machine simply inherits its file markers, in the form of simple programming variables. This mechanism, as our performance evaluation will show, also makes for an efficient and light-weight fault-tolerant framework.While Hadoop uses task and input chunk replication to fault-tolerance ends, we opted for a node specific fault tolerance mechanism, rather than an input specific one. With this approach, node failure does not impact data availability, and new nodes can be assigned failed work with no need for expensive data relocation. Upon failures, we elected for an exception handler to notify the master before terminating, or in the case of sudden death of one of the workers, the rupture of a communication pipe. Furthermore, the master receives the completion status of the ”map” and ”reduce” functions from all its workers. Should a worker fail, the master receives notification of the event through a return code, or a broken pipe signal upon sudden death of the worker. The master then updates its node availability and job completion data structures to indicate that a job was not completed, and that a node has failed. We later evaluate this low overhead fault-tolerant component along with Hadoop’s data replication and node heartbeat detection capability, and assess job completion times in the face of node failures.In this section, we test the applicability of MARIANE to traditional MapReduce problems along side Hadoop MapReduce. We not only test the framework in various node addition schemes with constant input sizes to account for cluster increase scalability and speed-up, but also test the cluster for increasing input sizes faced with static cluster sizes. Our experiments were conducted on the National Energy Research Scientific Computing Center’s cluster , and in the Binghamton University Grid and Cloud Computing Research Lab. On NERSC, we performed our tests on the Magellan cluster where MARIANE was installed on top of GPFS, and tested along side the local Apache Hadoop v.20 installation running on the same test bed. The Binghamton University Grid and Cloud Computing Research Lab benefits from the same Hadoop version, and hosts MARIANE using NFS. In all the experiments showcased, we ran MARIANE along side Hadoop using identical nodes, identical node counts, identical input data and similar user source code. We chose for these experiments the application of traditional MapReduce problems, available from Apache Hadoop’s example repository. One being UrlRank, a popular application similar to Google page rank, and used at Yahoo! with Hadoop MapReduce to categorize web pages by request frequency. Similar versions of this application are also used for data mining and user statistical analysis at Facebook. The second application we tested was ”Distributed Grep”, where typically, documents containing word or sequences patterns are searched and returned from Terabytes of raw data, or potentially hundreds of millions of individual documents, making processing on a single computer, an unacceptable option. Variations to this application include pattern appearance frequencies and count. As a final application group, we tested a compute intensive application, one of our own: distributed parsing and processing of XML elements, using the AxisJava parser. AxisJava is a web services-based toolkit consisting of a SOAP engine capable of creating SOAP processors allowing data content parsing. Resulting tokens from this parsing can then subsequently be streamed into application-friendly content and returned to the user. Our input data for testing purposes is composed of arrays of floating point numbers. In, we have shown that AxisJava scales poorly for applications that require processing arrays of floating point numbers. The computation is intensive and demanding on the systems involved, regardless of overall data size. With our XML data parsing experiments, we test the performance of both frameworks’ fault-tolerance mechanisms in the face of node failures.When drawing or redrawing boundaries of electoral districts, officials commonly rely on four criteria: contiguity, compactness, respect for existing administrative regions, and respect for communities of interest . While the first three of these criteria are defined relatively easily , attempts to define a community of interest have suffered from the ambiguity surrounding the concept . However, the vagueness of this fourth criterion has not prevented jurisdictions across the world from applying it in redistricting.