[ Pobierz całość w formacie PDF ]
.Hadoop MapReduce is inher-ently aware of HDFS and can use the namenode during the scheduling of tasks to decidethe best placement of map tasks with respect to machines where there is a local copyof the data.This avoids a significant amount of network overhead during processing,as workers do not need to copy data over the network to access it, and it removes oneof the primary bottlenecks when processing huge amounts of data.Hadoop MapReduce is similar to traditional distributed computing systems in thatthere is a framework and there is the user s application or job.A master node coordi-nates cluster resources while workers simply do what they re told, which in this caseis to run a map or reduce task on behalf of a user.Client applications written againstthe Hadoop APIs can submit jobs either synchronously and block for the result, orasynchronously and poll the master for job status.Cluster daemons are long-lived whileuser tasks are executed in ephemeral child processes.Although executing a separateprocess incurs the overhead of launching a separate JVM, it isolates the frameworkfrom untrusted user code that could and in many cases does fail in destructive ways.Since MapReduce is specifically targeting batch processing tasks, the additional over-head, while undesirable, is not necessarily a showstopper.One of the ingredients in the secret sauce of MapReduce is the notion of data locality,by which we mean the ability to execute computation on the same machine where thedata being processed is stored.Many traditional high-performance computing (HPC)systems have a similar master/worker model, but computation is generally distinct fromdata storage.In the classic HPC model, data is usually stored on a large shared cen-tralized storage system such as a SAN or NAS.When a job executes, workers fetch thedata from the central storage system, process it, and write the result back to the storagedevice.The problem is that this can lead to a storm effect when there are a large numberof workers attempting to fetch the same data at the same time and, for large datasets,quickly causes bandwidth contention.MapReduce flips this model on its head.InsteadIntroducing Hadoop MapReduce | 33 of using a central storage system, a distributed filesystem is used where each worker isusually1 both a storage node as well as a compute node.Blocks that make up files aredistributed to nodes when they are initially written and when computation isperformed, the user-supplied code is executed on the machine where the block can bepushed to the machine where the block is stored locally.Remember that HDFS storesmultiple replicas of each block.This is not just for data availability in the face of failures,but also to increase the chance that a machine with a copy of the data has availablecapacity to run a task.DaemonsThere are two major daemons in Hadoop MapReduce: the jobtracker and thetasktracker.JobtrackerThe jobtracker is the master process, responsible for accepting job submissions fromclients, scheduling tasks to run on worker nodes, and providing administrative func-tions such as worker health and task progress monitoring to the cluster.There is onejobtracker per MapReduce cluster and it usually runs on reliable hardware since afailure of the master will result in the failure of all running jobs.Clients and tasktrackers(see  Tasktracker on page 35) communicate with the jobtracker by way of remoteprocedure calls (RPC).Just like the relationship between datanodes and the namenode in HDFS, tasktrackersinform the jobtracker as to their current health and status by way of regular heartbeats.Each heartbeat contains the total number of map and reduce task slots available (see Tasktracker on page 35), the number occupied, and detailed information aboutany currently executing tasks.After a configurable period of no heartbeats, a tasktrackeris assumed dead.The jobtracker uses a thread pool to process heartbeats and clientrequests in parallel.When a job is submitted, information about each task that makes up the job is storedin memory.This task information updates with each tasktracker heartbeat while thetasks are running, providing a near real-time view of task progress and health.After thejob completes, this information is retained for a configurable window of time or untila specified number of jobs have been executed.On an active cluster where many jobs,each with many tasks, are running, this information can consume a considerableamount of RAM.It s difficult to estimate memory consumption without knowing howbig each job will be (measured by the number of tasks it contains) or how many jobs1 [ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • szamanka888.keep.pl