Hadoop源码分析V1.1

Click here to load reader

Transcript of Hadoop源码分析V1.1

HadoopV1.1

Apache HadoopHDFSMapReduceApache HadoopHadoop-The Definitive Guide [email protected] 20130814HadoopV1.1121 32 53 RPC124 HDFS265 HDFS396 DataNode727 NameNode768 Lease909 Heartbeat10010 HDFS10711 MapReduce11812 12213 13214 14215 15316 171

2013-08-24

2013-09-26

1 GoogleGoogle5GoogleClusterChubbyGFSMapReduceBigTable

030603SOSPGFS04OSDIMapReduce06OSDIBigTableSOSPOSDIASOSPOSDIApacheApacheHadoop Chubby-->ZooKeeper GFS-->HDFS BigTable-->HBase MapReduce-->Hadoop MapReduceFacebookHiveYahhoPig HadoopHDFSMapReduceHDFSMapReduceHadoopHadoop12 October, 2012: Release 1.0.4 availableHadoopcore/mapred/tools/hdfs

HadoopHadoopHDFSAPIAmazon S3confconffsfs

HadoopPackageDependences

toolsDistCparchive

mapreduceHadoopMap/Reduce

filecacheHDFSMap/Reduce

fs

hdfsHDFSHadoop

ipcIPCioComment by czhangmz: IPCRPC SunRPCJavaIPCstubsskeletonsIPCJavaStringWritableIOException/ServerJavaRPC.getServer()ServerClientServerClientJavaRPC.getClient()ClientServerClientHadoop DFSNameNodeDataNodeHeartbeatHadoop

io/

netDNSsocket

security

conf

metrics

util

recordDDLC++Java

httpJettyHTTP Servlet

logHTTPHTTP Servlet

2 Comment by czhangmz: HadoopWritableHadoopremote procedure callRPCRPCRPCHDFSMapReduceHadoopJavaComment by czhangmz: Why Not Use Java Object Serialization?Hadoop-The Definitive Guide SEChapter 4: Hadoop I/Oorg.apache.hadoop.ioWritableWritableDataOutputDataInput/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */

package org.apache.hadoop.io;

import java.io.DataOutput;import java.io.DataInput;import java.io.IOException;

/** * A serializable object which implements a simple, efficient, serialization * protocol, based on {@link DataInput} and {@link DataOutput}. * * Any key or value type in the Hadoop Map-Reduce * framework implements this interface. * * Implementations typically implement a static read(DataInput) * method which constructs a new instance, calls {@link #readFields(DataInput)} * and returns the instance. * * Example: * * public class MyWritable implements Writable {Comment by czhangmz: MyWritable * // Some data * private int counter; * private long timestamp; * * public void write(DataOutput out) throws IOException { * out.writeInt(counter); * out.writeLong(timestamp); * } * * public void readFields(DataInput in) throws IOException { * counter = in.readInt(); * timestamp = in.readLong(); * } * * public static MyWritable read(DataInput in) throws IOException { * MyWritable w = new MyWritable(); * w.readFields(in); * return w; * } * } * */public interface Writable { /** * Serialize the fields of this object to out. * * @param out DataOuput to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException;

/** * Deserialize the fields of this object from in. * * For efficiency, implementations should attempt to re-use storage in the * existing object where possible. * * @param in DataInput to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException;} writereadFieldsWritableorg.apache.hadoop.ioComment by czhangmz:

WritableComparableWritablejava.lang.ComparableIntWritableWritableComparableMapReduceHadoopJava ComparatorRawComparatorpackage org.apache.hadoop.io;

import java.util.Comparator;import org.apache.hadoop.io.serializer.DeserializerComparator;

/** * * A {@link Comparator} that operates directly on byte representations of * objects. * * @param * @see DeserializerComparator */public interface RawComparator extends Comparator {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}WritableComparatorWritableComparableRawComparatorcompare()RawComparatorWritableHadooporg.apache.hadoop.ioWritableComment by czhangmz: WritableHadoop

WritableJavashortcharIntWritableget()set()JavaWritable

booleanBooleanWritable1

byteByteWritable1

intIntWritable4

VIntWritable1-5

floatFloatWritable4

longLongWritable8

VLongWritable1-9

doubleDoubleWritable8

IntWritableLongWritableVIntWritableVLongWritable-127127-127127TextUTF-8Writablejava.lang.StringWritableComment by czhangmz: TextStringHadoop-The Definitive Guide SEChapter 4: Hadoop I/OObjectWritableHadoopRPCRPCJavaStringWritableObjectWritableRPC/WritableWritableRPCMyWritableObjectWritableObjectWritableComment by czhangmz: WritableMyWritableObjectWritableObjectWritableWritableFactoriesWritableMyWritableWritableFactoriesWritableFactories.setFactoryMapReduceWritableMapReduce APIHadoopserialization frameworkAPISerializationWritableSerializationWritableSerializationSerializationSerializerDeserializerHadoopio.serizalizationsSerializationorg.apache.hadoop.io.serializer. WritableSerializationComment by czhangmz: Apache Thrift and Google Protocol Buffers are both popular serialization frameworks, and they are commonly used as a format for persistent binary data.

3 RPC/HadoopRPCIPCHadoop WritableHadoop WritableJavaStringWritableIPCJavaHadoopRPCCORBAIDLstubskeletonIOExceptionComment by czhangmz: RPCorg.apache.hadoop.ipcClientServerServerRPCServerRPCorg.apache.hadoop.ipcComment by czhangmz: Hadoop/

org.apache.hadoop.ipc.ClientIPCorg.apache.hadoop.rpc.Client

CallConnectionConnectionIdCall InvocationVOConnection ThreadConnectionId ClientServerHDFSNameNode/DataNodeClientClientConnectionIdConnectionIDConnectionIdInetSocketAddressIP++InetSocketAddressClient.ConnectionRPCConnectionRPCConnectionidClient.CallConnectionHashCall//callsprivate Hashtable calls = new Hashtable();RPCaddCallConnectionJavaStringWritableCallObjectWritableClient.Connectionsocket/Client. writeHeader()Writable/Client.ConnectionsocketCallCallCallObejctwaitnotifyRPCClientClient.ConnectionClientcall()/** Make a call, passing param, to the IPC server defined by * remoteId, returning the value. * Throws exceptions if there are network problems or if the remote code * threw an exception. */ public Writable call(Writable param, ConnectionId remoteId) throws InterruptedException, IOException { Call call = new Call(param); Connection connection = getConnection(remoteId, call);Comment by czhangmz: connection.sendParam(call); // send the parameter boolean interrupted = false; synchronized (call) { while (!call.done) { try { call.wait(); // wait for the result } catch (InterruptedException ie) { // save the fact that we were interrupted interrupted = true; } }

if (interrupted) { // set the interrupt flag now that we are done waiting Thread.currentThread().interrupt(); }

if (call.error != null) { if (call.error instanceof RemoteException) { call.error.fillInStackTrace(); throw call.error; } else { // local exception // use the connection because it will reflect an ip change, unlike // the remoteId throw wrapException(connection.getRemoteAddress(), call.error); } } else { return call.value; } } }Client.getConnection() --> Client.Connection.setupIOstreams() --> Client.Connection.setupConnection()socketClient.Connection.sendParam()java iosocketClient.Connection.setupIOstreams()Client.Connection.run()public void run() { if (LOG.isDebugEnabled()) LOG.debug(getName() + ": starting, having connections " + connections.size());

while (waitForWork()) {//wait here for work - read or close connection receiveResponse(); // } close(); if (LOG.isDebugEnabled()) LOG.debug(getName() + ": stopped, remaining connections " + connections.size()); }Client.Connection. receiveResponse ()/* Receive a response. * Because only one receiver, so no synchronization on in. */ private void receiveResponse() { if (shouldCloseConnection.get()) { return; } touch(); try { int id = in.readInt(); // try to read an id

if (LOG.isDebugEnabled()) LOG.debug(getName() + " got value #" + id);

Call call = calls.get(id);

int state = in.readInt(); // read call status if (state == Status.SUCCESS.state) { Writable value = ReflectionUtils.newInstance(valueClass, conf); value.readFields(in); // read value call.setValue(value); calls.remove(id); } else if (state == Status.ERROR.state) { call.setException(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); calls.remove(id); } else if (state == Status.FATAL.state) { // Close the connection markClosed(new RemoteException(WritableUtils.readString(in), WritableUtils.readString(in))); } } catch (IOException e) { markClosed(e); } }callcallclientorg.apache.hadoop.ipc. Server

CallListenerResponderConnectionHandlerCall Listener ListenerListener.ReaderReaderResponder RPCResponderConnection Handler callQueuecallServer/** Called for each call. */ public abstract Writable call(Class protocol, Writable param, long receiveTime) throws IOException;ServerServercallComment by czhangmz: HadoopRPCServerRPCServer.CallClient.CallServer.CallidparamClient.CallconnectionCallconnectiontimestampresponseWritableServer.ConnectionsocketHadoopServerJavaNIOsocketsocketServeracceptsocketListenerServer.HandlerunServer.CallServer.callResponderNIOResponderServercallcallServerListenerListenerrun()public void run() { LOG.info(getName() + ": starting"); SERVER.set(Server.this); while (running) { SelectionKey key = null; try { selector.select(); Iterator iter = selector.selectedKeys().iterator(); while (iter.hasNext()) { key = iter.next(); iter.remove(); try { if (key.isValid()) { if (key.isAcceptable()) doAccept(key);Comment by czhangmz: } } catch (IOException e) { } key = null; } } catch (OutOfMemoryError e) { // we can run out of memory if we have too many threads // log the event and sleep for a minute and give // some thread(s) a chance to finish LOG.warn("Out of Memory in server select", e); closeCurrentConnection(key, e); cleanupConnections(true); try { Thread.sleep(60000); } catch (Exception ie) {} } catch (Exception e) { closeCurrentConnection(key, e); } cleanupConnections(false); } LOG.info("Stopping " + this.getName());

synchronized (this) { try { acceptChannel.close(); selector.close(); } catch (IOException e) { }

selector= null; acceptChannel= null; // clean up all connections while (!connectionList.isEmpty()) { closeConnection(connectionList.remove(0)); } } }ListenerdoAccept ()void doAccept(SelectionKey key) throws IOException, OutOfMemoryError { Connection c = null; ServerSocketChannel server = (ServerSocketChannel) key.channel(); SocketChannel channel; while ((channel = server.accept()) != null) { channel.configureBlocking(false); channel.socket().setTcpNoDelay(tcpNoDelay); Reader reader = getReader(); //readersreader try { reader.startAdd(); //readSelectoraddingtrue SelectionKey readKey = reader.registerChannel(channel);// c = new Connection(readKey, channel, System.currentTimeMillis());// readKey.attach(c); //connectionreadKey synchronized (connectionList) { connectionList.add(numConnections, c); numConnections++; } if (LOG.isDebugEnabled()) LOG.debug("Server connection from " + c.toString() + "; # active connections: " + numConnections + "; # queued calls: " + callQueue.size()); } finally { //addingfalsenotify()reader,Listenerreaderwait() reader.finishAdd(); }

} }readerreaderdoRead()doRead()Server.ConnectionreadAndProcess()readAndProcess()Server.ConnectionprocessOneRpc()processData()processData()callcallcallQueuerpccallServerHandlercallrun()final Call call = callQueue.take(); // pop the queue; maybe blocked herevalue = call(call.connection.protocol, call.param, Comment by czhangmz: ipc.Servercall()call()RPC.Server call.timestamp);synchronized (call.connection.responseQueue) { // setupResponse() needs to be sync'ed together with // responder.doResponse() since setupResponse may use // SASL to encrypt response data and SASL enforces // its own message ordering. setupResponse(buf, call, (error == null) ? Status.SUCCESS : Status.ERROR, value, errorClass, error); // Discard the large buf and reset it back to // smaller size to freeup heap if (buf.size() > maxRespSize) { LOG.warn("Large response size " + buf.size() + " for call " + call.toString()); buf = new ByteArrayOutputStream(INITIAL_RESP_BUF_SIZE); } responder.doRespond(call);Comment by czhangmz: }Server.ResponderdoRespond()void doRespond(Call call) throws IOException { synchronized (call.connection.responseQueue) { call.connection.responseQueue.addLast(call); if (call.connection.responseQueue.size() == 1) { //writeSelector processResponse(call.connection.responseQueue, true); } } }ClientServerRPC.javastubskeletonCORBAIDLstubskeletonorg.apache.hadoop.rpc:

InvocationClientCacheInvokerServerInvocation VO; ClientCache clientsocket factoryhash key,hashMap ; Invoker InvocationHandler; Server ipc.ServerComment by czhangmz: HadoopRPCServerHadoopRPCInvokerInvocationHandlerinvokeinvokeInvocationHandlerInvokerClientsocketproxyClientInvokerInvocationInvocation: methodNameparameterClassesparametersWritableRPC.Serverorg.apache.hadoop.ipc.ServerRPCInvocationJavasocketDynamic ProxyInvocationVOClientCacheServerInvokerInvokerRPC.Invokerinvoke()public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { final boolean logDebug = LOG.isDebugEnabled(); long startTime = 0; if (logDebug) { startTime = System.currentTimeMillis(); }

ObjectWritable value = (ObjectWritable) client.call(new Invocation(method, args), remoteId); if (logDebug) { long callTime = System.currentTimeMillis() - startTime; LOG.debug("Call: " + method.getName() + " " + callTime); } return value.get(); }invoke() method.invoke(ac, arg); method.invoke(ac, arg); JVMHadoopinvoke()ObjectWritable value = (ObjectWritable) client.call(new Invocation(method, args), remoteId);InvocationVOClientcall()RPC.Serverorg.apache.hadoop.ipc.ServerRPCipc.RPCgetServer()/** Construct a server for a protocol implementation instance listening on a * port and address, with a secret manager. */ public static Server getServer(final Object instance, final String bindAddress, final int port, final int numHandlers, final boolean verbose, Configuration conf, SecretManager previous RECOVER_UPGRADEmv previous.tmp -> current COMPLETE_FINALIZErm finalized.tmp COMPLETE_ROLLBACKrm removed.tmp RECOVER_ROLLBACKmv removed.tmp -> current COMPLETE_CHECKPOINTmv lastcheckpoint.tmp -> previous.checkpoint RECOVER_CHECKPOINTmv lastcheckpoint.tmp -> current RECOVER_UPGRADE1. current->previous.tmp2. current3. previous.tmp->previous previous.tmpcurrentprevious.tmpcurrentStorageDirectoryStorageInfoStorageDirectoryVERSIONStorageDirectoryread/write/DataNodeVERSION#FriJun1415:36:32CST2013namespaceID=1584403768storageID=DS-1617068520-127.0.1.1-50010-1364969464023cTime=0storageType=DATA_NODElayoutVersion=-32StorageDirectoryin_use.lock/StorageDirectorylockunlockStorageStorageDirectoryStorageDataStorageStorageDataNodeDataNode//DataStoragedoUpgrade/doRollback/doFinalizeDataStorageformatDataNodeStorageStorageDirectoryDataStorageStorageFSDatasetStorageBlockFSDataset

BlockBlockblk_3148782637964391313blk_3148782637964391313_242812.metablockId3148782637964391313242812numBytesBlockDatanodeBlockInfoBlockBlockFSVolumedetachdetachsnapshotsnapshotcurrentcurrentdetachsnapshotcurrentcurrentsnapshotdetachcopy-on-writeDatanodeBlockInfodetachBlockBlockdetachBlockDatanodeBlockInfoFSVolumeSetFSVolumeFSDirDataNodeStorageHDFSBlockStorageFSDatasetFSVolumeStorageFSDirFSVolumeFSVolumeSetFSDatasetFSVolumeSetFSDirHDFSFSDirBlockStorageFSDirFSDirFSDirgetBlockInfoBlockgetVolumeMapBlockDatanodeBlockInfoFSVolumeStoragedetachFSVolumeFSVolumerecoverDetachedBlocksdetachStoragedetachdetachFSVolumeFSVolumeBlockFSVolumeBlockFSVolumeSetFSVolumeHDFSchunkFSDatasetActiveFileActiveFileActiveFileFSDatasetFSDatasetFSDatasetInterfaceFSDatasetInterfaceDataNodeFSDatasetFSVolumeSet volumes;private HashMap ongoingCreates = new HashMap();private HashMap volumeMap = new HashMap();;volumesFSDatasetStorageongoingCreatesBlockActiveFileBlockongoingCreatesFSDatasetpublic long getMetaDataLength(Block b) throws IOException;blockblockID public MetaDataInputStream getMetaDataInputStream(Block b) throws IOException;blockblockID public boolean metaFileExists(Block b) throws IOException;block public long getLength(Block b) throws IOException;block public Block getStoredBlock(long blkid) throws IOException;BlockIDBlock public InputStream getBlockInputStream(Block b) throws IOException;public InputStream getBlockInputStream(Block b, long seekOffset) throws IOException;Block public BlockInputStreams getTmpInputStreams(Block b, long blkoff, long ckoff) throws IOException;Blocktmptmpcurrentcurrent public BlockWriteStreams writeToBlock(Block b, boolean isRecovery) throws IOException;blockBlockWriteStreamsisRecoveryblockblockwriteToBlockActiveFileongoingCreatesBlockWriteStreamsActiveFileActiveFilethreadsblk_3148782637964391313DataNodeBlock ID3148782637964391313DataNodetmp/blk_3148782637964391313metatmp/blk_3148782637964391313_XXXXXX.metaXXXXXXisRecoverytruefinalizeBlockdetachedwriteToBlockinterruptongoingCreates/ActiveFileongoingCreates public void updateBlock(Block oldblock, Block newblock) throws IOException;blockupdateBlockupdateBlocktryUpdateBlocktryUpdateBlockvolumeMaptryUpdateBlockupdateBlockjoin public void finalizeBlock(Block b) throws IOException;finalizewriteToBlockblockBlocktmpcurrentFSDatasetfinalizeBlockongoingCreatesblockblockDatanodeBlockInfovolumeMapblk_3148782637964391313DataNodeBlock ID3148782637964391313DataNodetmp/blk_3148782637964391313currentsubdir12tmp/blk_3148782637964391313current/subdir12/blk_3148782637964391313metacurrent/subdir12

public void unfinalizeBlock(Block b) throws IOException;writeToBlockblockfinalizeBlock public boolean isValidBlock(Block b);Block public void invalidate(Block invalidBlks[]) throws IOException;block

public void validateBlockMetadata(Block b) throws IOException;blockDataNode

5 HDFSDataNodeDataXceiverServerDataXceiverDataNode/RPCRPCDataNodeDataXceiverServerDataXceiverDataXceiverBlockSenderBlockReceiver

DataXceiverServerDataXceiversocketDataXceiverServerrunDataXceiverServersocketDataXceiversocketDataXceiverDataXceiverDataXceiverOP_WRITE_BLOCK (80)OP_READ_BLOCK (81)OP_READ_METADATA (82)OP_REPLACE_BLOCK (83)OP_COPY_BLOCK (84)OP_BLOCK_CHECKSUM (85)DataXceiver$HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/bin/hadoop fs -copyFromLocal (OP_WRITE_BLOCK (80))namenodehadoopappendnamenodenamdnodeblockhdfsdatanodenamenodeIOUtils.copyBytes()clientpacketnamenodedatenodesblocksnamenodedatanodesblocksclientdatanode3datanodedatanodedatanodedatanodedatanodeACKclient

DistributedFileSystemcreate() DistributedFileSystemnamenodeRPCnamenode DistributedFileSystemdatanodenamenodeFSDataOutputStreamDFSOutputStream(data queue) DataStreamerdatenodenamenodedatanode(pipeline)DataStreamer1datanode DFSOutputStream(ack queue)datanodedatanode close()datanodenamenodeNamenodehadoopclientorg.apache.hadoop.fs. FsShell:public int run(String argv[]) throws Exception {if ("-put".equals(cmd) || "-copyFromLocal".equals(cmd)) { Path[] srcs = new Path[argv.length-2]; for (int j=0 ; i < argv.length-1 ;) srcs[j++] = new Path(argv[i++]); copyFromLocal(srcs, argv[i++]); }}

org.apache.hadoop.fs. FsShell:void copyFromLocal(Path[] srcs, String dstf) throws IOException { Path dstPath = new Path(dstf); FileSystem dstFs = dstPath.getFileSystem(getConf()); if (srcs.length == 1 && srcs[0].toString().equals("-")) copyFromStdin(dstPath, dstFs); else dstFs.copyFromLocalFile(false, false, srcs, dstPath); }

org.apache.hadoop.fs. FileSystem:public void copyFromLocalFile(boolean delSrc, boolean overwrite, Path[] srcs, Path dst) throws IOException { Configuration conf = getConf(); FileUtil.copy(getLocal(conf), srcs, this, dst, delSrc, overwrite, conf); }

org.apache.hadoop.fs. FileUtil:public static boolean copy(FileSystem srcFS, Path[] srcs, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException {if (srcs.length == 1) return copy(srcFS, srcs[0], dstFS, dst, deleteSource, overwrite, conf);for (Path src : srcs) { try { if (!copy(srcFS, src, dstFS, dst, deleteSource, overwrite, conf)) returnVal = false; } catch (IOException e) { gotException = true; exceptions.append(e.getMessage()); exceptions.append("\n"); } }return returnVal;}FsShellhadooprun()hadoop shellshell-put-copyFromLocalcopyFromLocal()shellcopyFromLocalFile()FileUtil.copy()copy()org.apache.hadoop.fs. FileUtil:public static boolean copy(FileSystem srcFS, Path src, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException {dst = checkDest(src.getName(), dstFS, dst, overwrite);

if (srcFS.getFileStatus(src).isDir()) { checkDependencies(srcFS, src, dstFS, dst); if (!dstFS.mkdirs(dst)) { return false; } FileStatus contents[] = srcFS.listStatus(src); for (int i = 0; i < contents.length; i++) { copy(srcFS, contents[i].getPath(), dstFS, new Path(dst, contents[i].getPath().getName()), deleteSource, overwrite, conf); } } else if (srcFS.isFile(src)) { InputStream in=null; OutputStream out = null; try { in = srcFS.open(src); out = dstFS.create(dst, overwrite);Comment by czhangmz: IOUtils.copyBytes(in, out, conf, true); } catch (IOException e) { IOUtils.closeStream(out); IOUtils.closeStream(in); throw e; } } else { throw new IOException(src.toString() + ": No such file or directory"); } if (deleteSource) { return srcFS.delete(src, true); } else { return true; }}copy()confinoutIOUtils.copyBytes()dstFS.create()Comment by czhangmz:

create()FileSystemFSDataOutputStream create()FSDateOutputStream2FSHDFSDistributedFileSystemcreate()FSDataOutputStreamOutputStreamdfs.create()DFSOutputStream

dfs.create()DFSClientcreate()OutputStreamDFSOutputStreamDFSOutputStreamnamenodestreamer.start()pipelineDataStreamerdata queueblock64M64Kpacket1000packets/blockDataStreamernamenodeorg.apache.hadoop.hdfs.server.namenode. NameNode:public void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine(); if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); }

org.apache.hadoop.hdfs.server.namenode. FSNamesystemvoid startFile(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { startFileInternal(src, permissions, holder, clientMachine, overwrite, false, createParent, replication, blockSize); getEditLog().logSync(); if (auditLog.isInfoEnabled() && isExternalInvocation()) { final HdfsFileStatus stat = dir.getFileInfo(src); logAuditEvent(UserGroupInformation.getCurrentUser(), Server.getRemoteIp(), "create", src, null, stat); } }

org.apache.hadoop.hdfs.server.namenode. FSNamesystemprivate synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, boolean createParent, short replication, long blockSize ) throws IOException {DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine);

if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons); leaseManager.addLease(cons.clientName, src);

} else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit();

// increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } }}namenodecreate()FSNameSystemstartFileInternale()hadoopappendINodenodeunder constructionblocksstampclientIOUtils.copyBytes()client&block

IOUtilscopyBytes()FSOutputSummerchecksumwriteChecksumChunk()DFSClientDFSOutputStreamwriteChunk()

DFSOutputStreamFSOutputSummerDFSOutputStreamwriteChunk()DistributedFileSystemcreate()DFSOutputStreamFSDataOutputStreamDFSOutputStreamwriteChunk()clientblockpacket3datanode1datanode2datanode3clientdatanode1packet1datanode1datanode1datanode2packet1datanode2clientpacket2datanode1datanode2datanode3packet1datanode3clientpacket3datanode1datanode1packet2datanode2datanode datanode3packet1ackdatanode2datanode2ackdatanode1clientpacket

org.apache.hadoop.hdfs.DFSClient.DFSOutputStream// @see FSOutputSummer#writeChunk() @Override protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException { checkOpen(); isClosed(); int cklen = checksum.length; int bytesPerChecksum = this.checksum.getBytesPerChecksum(); if (len > bytesPerChecksum) { throw new IOException("writeChunk() buffer size is " + len + " is larger than supported bytesPerChecksum " + bytesPerChecksum); } if (checksum.length != this.checksum.getChecksumSize()) { throw new IOException("writeChunk() checksum size is supposed to be " + this.checksum.getChecksumSize() + " but found to be " + checksum.length); }

synchronized (dataQueue) { // If queue is full, then wait till we can create enough space while (!closed && dataQueue.size() + ackQueue.size() > maxPackets) { try { dataQueue.wait(); } catch (InterruptedException e) { } } isClosed(); if (currentPacket == null) { currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock); if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk allocating new packet seqno=" + currentPacket.seqno + ", src=" + src + ", packetSize=" + packetSize + ", chunksPerPacket=" + chunksPerPacket + ", bytesCurBlock=" + bytesCurBlock); } }

currentPacket.writeChecksum(checksum, 0, cklen); currentPacket.writeData(b, offset, len); currentPacket.numChunks++; bytesCurBlock += len;

// If packet is full, enqueue it for transmission // if (currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) { if (LOG.isDebugEnabled()) { LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ", src=" + src + ", bytesCurBlock=" + bytesCurBlock + ", blockSize=" + blockSize + ", appendChunk=" + appendChunk); } // // if we allocated a new packet because we encountered a block // boundary, reset bytesCurBlock. // if (bytesCurBlock == blockSize) { currentPacket.lastPacketInBlock = true; bytesCurBlock = 0; lastFlushOffset = 0; } enqueueCurrentPacket(); // If this was the first write after reopening a file, then the above // write filled up any partial chunk. Tell the summer to generate full // crc chunks from now on. if (appendChunk) { appendChunk = false; resetChecksumChunk(bytesPerChecksum); } int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize); computePacketChunkSize(psize, bytesPerChecksum); } } //LOG.debug("DFSClient writeChunk done length " + len + // " checksum length " + cklen); }

org.apache.hadoop.hdfs.DFSClient.DFSOutputStreamprivate synchronized void enqueueCurrentPacket() { synchronized (dataQueue) { if (currentPacket == null) return; dataQueue.addLast(currentPacket); dataQueue.notifyAll(); lastQueuedSeqno = currentPacket.seqno; currentPacket = null; } }DFSOutputStreamorg.apache.hadoop.hdfs.DFSClient.DFSOutputStreamprivate LinkedList dataQueue = new LinkedList();private LinkedList ackQueue = new LinkedList();writeChunk()data queuepacketcurrentPacketnew Packetpacketchecksumpacketdata queuedata queueDataStreamerorg.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamerpublic void run() { long lastPacket = 0;

while (!closed && clientRunning) {

// if the Responder encountered an error, shutdown Responder if (hasError && response != null) { try { response.close(); response.join(); response = null; } catch (InterruptedException e) { } }

Packet one = null; synchronized (dataQueue) {

// process IO errors if any boolean doSleep = processDatanodeError(hasError, false);

// wait for a packet to be sent. long now = System.currentTimeMillis(); while ((!closed && !hasError && clientRunning && dataQueue.size() == 0 && (blockStream == null || ( blockStream != null && now - lastPacket < timeoutValue/2))) || doSleep) { long timeout = timeoutValue/2 - (now-lastPacket); timeout = timeout = blockSize) { throw new IOException("BlockSize " + blockSize + " is smaller than data size. " + " Offset of packet in block " + offsetInBlock + " Aborting file " + src); }

ByteBuffer buf = one.getBuffer(); // move packet from dataQueue to ackQueue if (!one.isHeartbeatPacket()) { dataQueue.removeFirst(); dataQueue.notifyAll(); synchronized (ackQueue) { ackQueue.addLast(one); ackQueue.notifyAll(); } } // write out data to remote datanode blockStream.write(buf.array(), buf.position(), buf.remaining()); if (one.lastPacketInBlock) { blockStream.writeInt(0); // indicate end-of-block } blockStream.flush(); lastPacket = System.currentTimeMillis();

if (LOG.isDebugEnabled()) { LOG.debug("DataStreamer block " + block + " wrote packet seqno:" + one.seqno + " size:" + buf.remaining() + " offsetInBlock:" + one.offsetInBlock + " lastPacketInBlock:" + one.lastPacketInBlock); } } catch (Throwable e) { LOG.warn("DataStreamer Exception: " + StringUtils.stringifyException(e)); if (e instanceof IOException) { setLastException((IOException)e); } hasError = true; } }

if (closed || hasError || !clientRunning) { continue; }

// Is this block full? if (one.lastPacketInBlock) { synchronized (ackQueue) { while (!hasError && ackQueue.size() != 0 && clientRunning) { try { ackQueue.wait(); // wait for acks to arrive from datanodes } catch (InterruptedException e) { } } } LOG.debug("Closing old block " + block); this.setName("DataStreamer for file " + src);

response.close(); // ignore all errors in Response try { response.join(); response = null; } catch (InterruptedException e) { }

if (closed || hasError || !clientRunning) { continue; }

synchronized (dataQueue) { IOUtils.cleanup(LOG, blockStream, blockReplyStream); nodes = null; response = null; blockStream = null; blockReplyStream = null; } } if (progress != null) { progress.progress(); }

// This is used by unit test to trigger race conditions. if (artificialSlowdown != 0 && clientRunning) { LOG.debug("Sleeping for artificial slowdown of " + artificialSlowdown + "ms"); try { Thread.sleep(artificialSlowdown); } catch (InterruptedException e) {} } } }DataStreamerrun()packet1spacketnextBlockOutPutStream()namenodedatanodesblocksorg.apache.hadoop.hdfs.DFSClient.DFSOutputStreamprivate DatanodeInfo[] nextBlockOutputStream(String client) throws IOException { LocatedBlock lb = null; boolean retry = false; DatanodeInfo[] nodes; int count = conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { hasError = false; lastException = null; errorIndex = 0; retry = false; nodes = null; success = false; long startTime = System.currentTimeMillis();

DatanodeInfo[] excluded = excludedNodes.toArray(new DatanodeInfo[0]); lb = locateFollowingBlock(startTime, excluded.length > 0 ? excluded : null); block = lb.getBlock(); accessToken = lb.getBlockToken(); nodes = lb.getLocations(); // // Connect to first DataNode in the list. // success = createBlockOutputStream(nodes, clientName, false);

if (!success) { LOG.info("Abandoning block " + block); namenode.abandonBlock(block, src, clientName);

if (errorIndex < nodes.length) { LOG.info("Excluding datanode " + nodes[errorIndex]); excludedNodes.add(nodes[errorIndex]); }

// Connection failed. Let's wait a little bit and retry retry = true; } } while (retry && --count >= 0);

if (!success) { throw new IOException("Unable to create new block."); } return nodes; }nextBlockOutputStream()3datanodesblockslocateFollowingBlock()datanodecreateBlockOutputStream()org.apache.hadoop.hdfs.DFSClient.DFSOutputStreamprivate LocatedBlock locateFollowingBlock(long start, DatanodeInfo[] excludedNodes ) throws IOException { int retries = conf.getInt("dfs.client.block.write.locateFollowingBlock.retries", 5); long sleeptime = 400; while (true) { long localstart = System.currentTimeMillis(); while (true) { try { if (serverSupportsHdfs630) { return namenode.addBlock(src, clientName, excludedNodes); } else { return namenode.addBlock(src, clientName); } } catch (RemoteException e) { } } }locateFollowingBlock()5namenodedatanodesblocksnamenodedatanodesblocksnamenodeclientaddBlock()FSNamesystem.getAdditionalBlock()DatanodeDescriptor targets[]blockdatanodesInode[] pathINodespathINodeINodependingFileunder constructionINodenewBlockblockLocatedBlock()org.apache.hadoop.hdfs.DFSClient.DFSOutputStream. nextBlockOutputStream()lbclientorg.apache.hadoop.hdfs.DFSClient.DFSOutputStreamcreateBlockOutputStream()clientdatanodeorg.apache.hadoop.hdfs.DFSClient.DFSOutputStream// connects to the first datanode in the pipeline // Returns true if success, otherwise return failure. // private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) { short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; String firstBadLink = ""; if (LOG.isDebugEnabled()) { for (int i = 0; i < nodes.length; i++) { LOG.debug("pipeline = " + nodes[i].getName()); } }

// persist blocks on namenode on next flush persistBlocks = true;

boolean result = false; try { LOG.debug("Connecting to " + nodes[0].getName()); InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName()); s = socketFactory.createSocket(); timeoutValue = 3000 * nodes.length + socketTimeout; NetUtils.connect(s, target, timeoutValue); s.setSoTimeout(timeoutValue); s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); LOG.debug("Send buf size " + s.getSendBufferSize()); long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout;

// // Xmit header info to datanode // DataOutputStream out = new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE)); blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));

out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_WRITE_BLOCK ); out.writeLong( block.getBlockId() ); out.writeLong( block.getGenerationStamp() ); out.writeInt( nodes.length ); out.writeBoolean( recoveryFlag ); // recovery flag Text.writeString( out, client ); out.writeBoolean(false); // Not sending src node information out.writeInt( nodes.length - 1 ); for (int i = 1; i < nodes.length; i++) { nodes[i].write(out); } accessToken.write(out); checksum.writeHeader( out ); out.flush();

// receive ack for connect pipelineStatus = blockReplyStream.readShort(); firstBadLink = Text.readString(blockReplyStream); if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) { if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) { throw new InvalidBlockTokenException( "Got access token error for connect ack with firstBadLink as " + firstBadLink); } else { throw new IOException("Bad connect ack with firstBadLink as " + firstBadLink); } }

blockStream = out; result = true; // success

} catch (IOException ie) { } finally {} return result; }nodes[0]pipelinedatanodestampdatanodedatanodesdatanodefori1datanodepipelineorg.apache.hadoop.hdfs.DFSClient.DFSOutputStream.DataStreamer.run()datanodesblocksdatanodeonedata queueack queueackOKack queuedatanodeack queuedata queueblockStream.write()datanodeblockpacketblockpacket0blockdatanode&datanodeDataTransferProtocol.OP_WRITE_BLOCKdatanodeDataXceiverwriteBlock()org.apache.hadoop.hdfs.server.datanode.DataXceiverprivate void writeBlock(DataInputStream in) throws IOException { DatanodeInfo srcDataNode = null; LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() + " tcp no delay " + s.getTcpNoDelay()); // // Read in the header // Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong()); LOG.info("Receiving block " + block + " src: " + remoteAddress + " dest: " + localAddress); int pipelineSize = in.readInt(); // num of datanodes in entire pipeline boolean isRecovery = in.readBoolean(); // is this part of recovery? String client = Text.readString(in); // working on behalf of this client boolean hasSrcDataNode = in.readBoolean(); // is src node info present if (hasSrcDataNode) { srcDataNode = new DatanodeInfo(); srcDataNode.readFields(in); } int numTargets = in.readInt(); if (numTargets < 0) { throw new IOException("Mislabelled incoming datastream."); } DatanodeInfo targets[] = new DatanodeInfo[numTargets]; for (int i = 0; i < targets.length; i++) { DatanodeInfo tmp = new DatanodeInfo(); tmp.readFields(in); targets[i] = tmp; } Token accessToken = new Token(); accessToken.readFields(in); DataOutputStream replyOut = null; // stream to prev target replyOut = new DataOutputStream( NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); if (datanode.isBlockTokenEnabled) { try { datanode.blockTokenSecretManager.checkAccess(accessToken, null, block, BlockTokenSecretManager.AccessMode.WRITE); } catch (InvalidToken e) { try { if (client.length() != 0) { replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN); Text.writeString(replyOut, datanode.dnRegistration.getName()); replyOut.flush(); } throw new IOException("Access token verification failed, for client " + remoteAddress + " for OP_WRITE_BLOCK for block " + block); } finally { IOUtils.closeStream(replyOut); } } }

DataOutputStream mirrorOut = null; // stream to next target DataInputStream mirrorIn = null; // reply from next target Socket mirrorSock = null; // socket to next target BlockReceiver blockReceiver = null; // responsible for data handling String mirrorNode = null; // the name:port of next target String firstBadLink = ""; // first datanode that failed in connection setup short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS; try { // open a block receiver and check if the block does not exist blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode);

// // Open network conn to backup machine, if // appropriate // if (targets.length > 0) { InetSocketAddress mirrorTarget = null; // Connect to backup machine mirrorNode = targets[0].getName(); mirrorTarget = NetUtils.createSocketAddr(mirrorNode); mirrorSock = datanode.newSocket(); try { int timeoutValue = datanode.socketTimeout + (HdfsConstants.READ_TIMEOUT_EXTENSION * numTargets); int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets); NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue); mirrorSock.setSoTimeout(timeoutValue); mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); mirrorOut = new DataOutputStream( new BufferedOutputStream( NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE)); mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

// Write header: Copied from DFSClient.java! mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); mirrorOut.writeLong( block.getBlockId() ); mirrorOut.writeLong( block.getGenerationStamp() ); mirrorOut.writeInt( pipelineSize ); mirrorOut.writeBoolean( isRecovery ); Text.writeString( mirrorOut, client ); mirrorOut.writeBoolean(hasSrcDataNode); if (hasSrcDataNode) { // pass src node information srcDataNode.write(mirrorOut); } mirrorOut.writeInt( targets.length - 1 ); for ( int i = 1; i < targets.length; i++ ) { targets[i].write( mirrorOut ); } accessToken.write(mirrorOut);

blockReceiver.writeChecksumHeader(mirrorOut); mirrorOut.flush();

// read connect ack (only for clients, not for replication req) if (client.length() != 0) { mirrorInStatus = mirrorIn.readShort(); firstBadLink = Text.readString(mirrorIn); if (LOG.isDebugEnabled() || mirrorInStatus != DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " got response for connect ack " + " from downstream datanode with firstbadlink as " + firstBadLink); } }

} catch (IOException e) { } }

// send connect ack back to source (only for clients) if (client.length() != 0) { if (LOG.isDebugEnabled() || mirrorInStatus != DataTransferProtocol.OP_STATUS_SUCCESS) { LOG.info("Datanode " + targets.length + " forwarding connect ack to upstream firstbadlink is " + firstBadLink); } replyOut.writeShort(mirrorInStatus); Text.writeString(replyOut, firstBadLink); replyOut.flush(); }

// receive the block and mirror to the next target String mirrorAddr = (mirrorSock == null) ? null : mirrorNode; blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length);

// if this write is for a replication request (and not // from a client), then confirm block. For client-writes, // the block is finalized in the PacketResponder. if (client.length() == 0) { datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT); LOG.info("Received block " + block + " src: " + remoteAddress + " dest: " + localAddress + " of size " + block.getNumBytes()); }

if (datanode.blockScanner != null) { datanode.blockScanner.addBlock(block); } } catch (IOException ioe) { } finally { } }datanodedatanodesDatanodeInfo targets[]clientdatanodereplyOutdatanodedatanodeBlockReceiver, DataInputStream inDataNodeDataOutputStream mirrorOutDataNodeOutputStream outdatanodetargets.length>0datanode1datanodedatanodesdatanodereceiveBlock()datanodeorg.apache.hadoop.hdfs.server.datanode.BlockReceivervoid receiveBlock( DataOutputStream mirrOut, // output to next datanode DataInputStream mirrIn, // input from next datanode DataOutputStream replyOut, // output to previous datanode String mirrAddr, BlockTransferThrottler throttlerArg, int numTargets) throws IOException {

mirrorOut = mirrOut; mirrorAddr = mirrAddr; throttler = throttlerArg;

try { // write data chunk header if (!finalized) { BlockMetadataHeader.writeHeader(checksumOut, checksum); } if (clientName.length() > 0) { responder = new Daemon(datanode.threadGroup, new PacketResponder(this, block, mirrIn, replyOut, numTargets, Thread.currentThread())); responder.start(); // start thread to processes reponses }

/* * Receive until packet length is zero. */ while (receivePacket() > 0) {}

// flush the mirror out if (mirrorOut != null) { try { mirrorOut.writeInt(0); // mark the end of the block mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } }

// wait for all outstanding packet responses. And then // indicate responder to gracefully shutdown. if (responder != null) { ((PacketResponder)responder.getRunnable()).close(); }

// if this write is for a replication request (and not // from a client), then finalize block. For client-writes, // the block is finalized in the PacketResponder. if (clientName.length() == 0) { // close the block/crc files close();

// Finalize the block. Does this fsync()? block.setNumBytes(offsetInBlock); datanode.data.finalizeBlock(block); datanode.myMetrics.incrBlocksWritten(); }

} catch (IOException ioe) { } finally { } }

org.apache.hadoop.hdfs.server.datanode.BlockReceiverprivate int receivePacket() throws IOException { int payloadLen = readNextPacket(); if (payloadLen bytesPerChecksum) { throw new IOException("Got wrong length during writeBlock(" + block + ") from " + inAddr + " " + "A packet can have only one partial chunk."+ " len = " + len + " bytesPerChecksum " + bytesPerChecksum); } partialCrc.update(pktBuf, dataOff, len); byte[] buf = FSOutputSummer.convertToByteStream(partialCrc, checksumSize); checksumOut.write(buf); LOG.debug("Writing out partial crc for data len " + len); partialCrc = null; } else { checksumOut.write(pktBuf, checksumOff, checksumLen); } datanode.myMetrics.incrBytesWritten(len);

/// flush entire packet before sending ack flush(); // update length only after flush to disk datanode.data.setVisibleLength(block, offsetInBlock); } } catch (IOException iex) { datanode.checkDiskError(iex); throw iex; } }

// put in queue for pending acks if (responder != null) { ((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock); } if (throttler != null) { // throttle I/O throttler.throttle(payloadLen); } return payloadLen; }receiveBlock()receivePacket()packet0clientqueuedatanodeackdatanodeclientreceivePacket()packetpacketdatanodeclientackorg.apache.hadoop.hdfs.DFSClient.DFSOutputStream.ResponseProcessor.run()packetack queueOP_READ_BLOCK (81)

FileSystemopen() DistributedFileSystemRPCnamenodenamenodedatanodeDistributedFileSystemFSDataInputStreamFSDataInputStreamdatanodenamenode I/ODFSInputStream read() datanodeDFSInputStreamdatanoderead()datanode DFSInputStreamdatanodedatanode FSDataInputStreamclose()

6 DataNodeDataNodeComment by USER-: DataNodeHDFS

public class DataNode extends Configured implements InterDatanodeProtocol, ClientDatanodeProtocol, FSConstants, Runnable, DataNodeMXBeanDataNodeDataNodeClientDatanodeProtocolClientInterDatanodeProtocolDataNodeipcServerDataNodeIPCDataNodeClientDatanodeProtocolInterDatanodeProtocolDataNodeDataNodestatic{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); }DataNodehdfs-default.xmlhdfs-site.xmlhdfs-site.xmlhdfs-default.xml hdfs-default.xmlsrc/hdfsmaindatanode1. mainsecureMaincreateDataNodedatanodedatanode

2. createDataNodeinstantiateDataNodedatanoderunDatanodeDaemondatanode

3. instantiateDataNode${dfs.network.script}${dfs.data.dir}datanodemakeInstance

4.makeInstanceDataNodeDataNode

5.DataNodestartDataNodedatanodeshutdowndatanode

6.startDataNodenamenodenamenodedatanodemachineName:port namenodeversionidDataNodeJMXJava Management ExtensionsJavadatanodess50010DataXceiverServerssDataBlockScannerFSDatasetdatanodeinfoServerhttp://0.0.0.0:50075httpshttps50475infoServerDataBlockScannerServlethttp://0.0.0.0:50075/blockScannerReport ipcRPC50020mainsecureMain secureMain createDataNodeDataNodecreateDataNodeinstantiateDataNodeDataNoderunDatanodeDaemonrunDatanodeDaemonNameNodeDataNodeDataNodeDataNodeinstantiateDataNodeDataNodestoragemakeInstancemakeInstancenew DataNode(conf, dirs);Comment by czhangmz: HDFSstoragestartDataNodeDataNodeDataNodeNameNodesocketNameNodeDatanodeProtocol.versionRequestNamespaceInfoFSDatasetstoragedataDataXceiverServerrunDataBlockScannerofferServiceDataNodeHttpServeripcServerDataNodeDataNodeDataNodeNameNodeDataXceiverServeripcServerDataNodeDataNoderunstartDistributedUpgradeIfNeeded()/offerService()offerServiceofferServiceofferServiceNameNodeBlockDataNodeBlockBlockNameNodeDataNodeheartBeatIntervalsendHeartbeatBlockreceivedBlockListdelHintsreceivedBlockListDataNodedelHintsDataXceiverreplaceBlockdatanode.notifyNamenodeReceivedBlock(block, sourceID)DataNodesourceIDBlocksourceIDBlockBlockDataNodeBlockNameNode.blockReceivedBlockblockReportIntervalBlockNameNodeDataNode DNA_TRANSFERDataNode DNA_INVALIDATE DNA_SHUTDOWNDataNode DNA_REGISTERDataNode DNA_FINALIZE DNA_RECOVERBLOCKDataNodetransferBlockstransferBlocksBlockDataTransferDataTransferDataNodeOP_WRITE_BLOCKNameNodeleaseDataNodeFSDataset:FSDataset http://caibinbupt.iteye.com/blog/284365DataXceiverServer:,DataXceiver http://caibinbupt.iteye.com/blog/284979DataXceiver: http://caibinbupt.iteye.com/blog/284979 http://caibinbupt.iteye.com/blog/286533BlockReceiver: http://caibinbupt.iteye.com/blog/286259BlockSender:DataBlockScanner:http://caibinbupt.iteye.com/blog/286650

7 NameNodeHDFSNameNodeDataNodeNameNodeinodeUNIXSecondaryNameNodeNameNodeHDFS=>=>DataNode=>NameNode=>DataNodeDataNodeNameNodeDataNodeDataNodeInterDatanodeProtocolClientDatanodeProtocolNameNode

ClientProtocolNameNodeHDFSGFSHDFSPOSIXorg.apache.hadoop.fs.FileSystemHDFSDatanodeProtocolDataNodeNameNodeDataNoderegisterDataNodesendHeartbeat/blockReport/blockReceivedDataNodeofferServiceerrorReportNameNodeBlockBlockReceiverDataBlockScannernextGenerationStampcommitBlockSynchronizationleaseleaseNamenodeProtocolNameNodeNameNode

namenode:bin/hadoop namenodebin/hadoopjava: org.apache.hadoop.hdfs.server.namenode.NameNodemain --> createNameNode --> NameNode --> initializeNameNode:public class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol,Comment by czhangmz: IPC RefreshUserMappingsProtocol { // static{ Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); }public static final int DEFAULT_PORT = 8020; //

public static final Log LOG = LogFactory.getLog(NameNode.class.getName()); public static final Log stateChangeLog = LogFactory.getLog("org.apache.hadoop.hdfs.StateChange"); public FSNamesystem namesystem; // TODO: This should private. Use getNamesystem() instead. // Datanode /** RPC server */ private Server server; /** RPC server for HDFS Services communication. BackupNode, Datanodes and all other services should be connecting to this server if it is configured. Clients should only go to NameNode#server */ private Server serviceRpcServer;

/** RPC server address */ private InetSocketAddress serverAddress = null; /** RPC server for DN address */ protected InetSocketAddress serviceRPCAddress = null; /** httpServer */ private HttpServer httpServer; /** HTTP server address */ private InetSocketAddress httpAddress = null; private Thread emptier; /** only used for testing purposes */ private boolean stopRequested = false; /** Is service level authorization enabled? */ private boolean serviceAuthEnabled = false; static NameNodeInstrumentation myMetrics;//

FSNamesystemorg.apache.hadoop.hdfs.server.namenodeNamenodeHttpServerorg.apache.hadoop.httpJettyNamenodeHTTPNamenode// public static NameNode createNameNode(String argv[], Configuration conf) throws IOException { ... StartupOption startOpt = parseArguments(argv); ... switch (startOpt) { case FORMAT: //namenodenamenode boolean aborted = format(conf, true); System.exit(aborted ? 1 : 0); case FINALIZE: //hadoop aborted = finalize(conf, true); System.exit(aborted ? 1 : 0); default: } ... //NameNodeinitialize NameNode namenode = new NameNode(conf); return namenode; }

private void initialize(Configuration conf) throws IOException { ... //fsimageedits log this.namesystem = new FSNamesystem(this, conf); .... //RPCServerrpc108020 this.server = RPC.getServer(this, socAddr.getHostName(), socAddr.getPort(), handlerCount, false, conf, namesystem .getDelegationTokenSecretManager()); startHttpServer(conf);//httphttp://namenode:50070 hdfs .... this.server.start(); //RPC server ....//fs.trash.interval60 startTrashEmptier(conf); }

public static void main(String argv[]) throws Exception { try { ... NameNode namenode = createNameNode(argv, null); if (namenode != null) namenode.join(); } ... }}org.apache.hadoop.hdfs.server.namenode.FSNamesystemNamenodeNameNodeFSNamesystemNameNodeFSNamesystem =>FSImage =>DataNodeDataNode DataNode DataNodeLRUFSNamesystempublic class FSNamesystem implements FSConstants, FSNamesystemMBean, NameNodeMXBean, MetricsSource {

// public FSDirectory dir;

//BlocksMapBlockinodeDatanode

final BlocksMap blocksMap = new BlocksMap(DEFAULT_INITIAL_MAP_CAPACITY,DEFAULT_MAP_LOAD_FACTOR);//

public CorruptReplicasMap corruptReplicas = new CorruptReplicasMap(); //datanode NavigableMap datanodeMap = new TreeMap(); //datanodeMapDatanodeDescriptorHeartbeatMonitor ArrayList heartbeats = new ArrayList();

// private UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks();

// private PendingReplicationBlocks pendingReplications;

// public LeaseManager leaseManager = new LeaseManager(this);

Daemon hbthread = null; // FSNamesystemheartbeatCheckDatanode public Daemon lmthread = null; // LeaseMonitor thread Daemon smmthread = null; // threshold public Daemon replthread = null; // :Datanode ;

private ReplicationMonitor replmon = null; // Replication metrics //Datanode -> DatanodeDescriptor private Host2NodesMap host2DataNodeMap = new Host2NodesMap();

//Data CenterRack NetworkTopology clusterMap = new NetworkTopology();

//DNS-name/IP-address -> RackID private DNSToSwitchMapping dnsToSwitchMapping;Comment by czhangmz: HDFS

// ReplicationTargetChooser replicator; //DatanodeDatanodeNamenodeNamenode private HostsFileReader hostsReader; }FSNamesystemprivate void initialize(NameNode nn, Configuration conf) throws IOException { this.systemStart = now(); setConfigurationParameters(conf); dtSecretManager = createDelegationTokenSecretManager(conf);

this.nameNodeAddress = nn.getNameNodeAddress(); this.registerMBean(conf); // register the MBean for the FSNamesystemStutus this.dir = new FSDirectory(this, conf);StartupOption startOpt = NameNode.getStartupOption(conf);//fsimageedits this.dir.loadFSImage(getNamespaceDirs(conf), getNamespaceEditsDirs(conf), startOpt); long timeTakenToLoadFSImage = now() - systemStart; LOG.info("Finished loading FSImage in " + timeTakenToLoadFSImage + " msecs"); NameNode.getNameNodeMetrics().setFsImageLoadTime(timeTakenToLoadFSImage); this.safeMode = new SafeModeInfo(conf); setBlockTotal(); pendingReplications = new PendingReplicationBlocks( conf.getInt("dfs.replication.pending.timeout.sec", -1) * 1000L); if (isAccessTokenEnabled) { accessTokenHandler = new BlockTokenSecretManager(true, accessKeyUpdateInterval, accessTokenLifetime); } this.hbthread = new Daemon(new HeartbeatMonitor());//Datanode this.lmthread = new Daemon(leaseManager.new Monitor());// this.replmon = new ReplicationMonitor(); this.replthread = new Daemon(replmon); // hbthread.start(); lmthread.start(); replthread.start();//datanode this.hostsReader = new HostsFileReader(conf.get("dfs.hosts",""), conf.get("dfs.hosts.exclude",""));//, this.dnthread = new Daemon(new DecommissionManager(this).new Monitor( conf.getInt("dfs.namenode.decommission.interval", 30), conf.getInt("dfs.namenode.decommission.nodes.per.interval", 5))); dnthread.start();

this.dnsToSwitchMapping = ReflectionUtils.newInstance( conf.getClass("topology.node.switch.mapping.impl", ScriptBasedMapping.class, DNSToSwitchMapping.class), conf); /* If the dns to swith mapping supports cache, resolve network * locations of those hosts in the include list, * and store the mapping in the cache; so future calls to resolve * will be fast. */ if (dnsToSwitchMapping instanceof CachedDNSToSwitchMapping) { dnsToSwitchMapping.resolve(new ArrayList(hostsReader.getHosts())); } InetSocketAddress socAddr = NameNode.getAddress(conf); this.nameNodeHostName = socAddr.getHostName(); registerWith(DefaultMetricsSystem.INSTANCE); }FSDirectoryFSNamesystemFSDirectoryhdfsINodefile/blockINodeinodeField Comment by czhangmz: INodeINodeDirectoryINodeDirectoryINodeINodeINodeFileINodeFileINodeINodeDirectoryINodeFileDatanodeINodeFileUnderConstructionHDFSNamenodeINodeFileINodeFileHadoopINodeFileUnderConstructionINodeFileINodeFileINodeFileUnderConstructionINodeFileUnderConstructionHDFSDatanodeFSDirectoryFSDirectoryfilename->blocksetFSImage fsImageclass FSDirectory implements FSConstants, Closeable { final INodeDirectoryWithQuota rootDir;// INodeDirectoryhdfs, FSImage fsImage; // FSImage,}FSDirectory(FSNamesystem ns, Configuration conf) { this(new FSImage(), ns, conf); ... }

FSDirectory(FSImage fsImage, FSNamesystem ns, Configuration conf) { rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME, ns.createFsOwnerPermissions(new FsPermission((short)0755)), Integer.MAX_VALUE, -1); this.fsImage = fsImage; .... namesystem = ns; .... }

//FSNamesystemFSDirectory dirloadFSImagefsimageedits void loadFSImage(Collection dataDirs,Collection editsDirs,StartupOption startOpt) throws IOException { // format before starting up if requested if (startOpt == StartupOption.FORMAT) {// FORMAT fsImage.setStorageDirectories(dataDirs, editsDirs);// FSImage${dfs.name.dir},/tmp/hadoop/dfs/name, fsImage.format();// FSImage startOpt = StartupOption.REGULAR; } try { if (fsImage.recoverTransitionRead(dataDirs, editsDirs, startOpt)) { // (${dfs.name.dir}) fsImage.saveNamespace(true); } FSEditLog editLog = fsImage.getEditLog(); assert editLog != null : "editLog must be initialized"; if (!editLog.isOpen()) editLog.open(); fsImage.setCheckpointDirectories(null, null); } ... }loadFSImageFSImageFSImageEditLogFSImageEditLogEditLogFSImageFSImageEditLogFSImagenamenodenamenodehdfsrpcnamenodenamenodeFSNamesystem namesystemnamesystemnamesystemFSDirectory dirdirdirFSImage fsImagefsImagehdfsEditLogSecondrary Namenoe()namenodeEditLogfsimagefsimageEditLog

INode*NameNodeinodeinodeINode*INode*Comment by czhangmz: inodeUnixinodesocket, http://zh.wikipedia.org/wiki/Inode

INodeINodeDirectoryINodeFileINodeDirectoryWithQuotaINodeFileUnderConstructionHDFSINodename/modificationTimeaccessTimeparentpermissionHDFSUNIX/LinuxUNIXgroupuserIDpermissionINodelongINodegetsetcollectSubtreeBlocksAndClearINodeBlockcomputeContentSummaryINodeINodeDirectoryHDFSprivate List children;/INodeDirectorygetsetINodeDirectoryWithQuotaINodeDirectoryINodeDirectoryNameSpaceINodeFileHDFSprotected BlockInfo blocks[] = null;BlockBlockInfoBlockINodeFileUnderConstructionclientNameclientMachineDataNodeclientNodetargets

8 LeaseLeaseLeaseLeaseNameNodeLeaseLease3NameNodecode

LeaseManagerLeaseLeaseManagerMonitorLeaseholderlastUpdatepathsLeaseManagerLeaseLeaseManageraddLeaseLeaserenewLeaseremoveaddLeaseManagerMonitorLeaseLeaseFSNamesysteminternalReleaseLeaseLeaseManager

HadoopUNIXFsActionorg.apache.hadoop.fs.permissionFsActionFsPermission/applyUMaskFsPermissionPermissionStatusFsPermissionINodePermissionStatuslongSerialNumberManagerPermissionStatusSerialNumberManagerFSImageSerialNumberManagerSerialNumberManagerINodelongFsPermissionMODEUSERGROUPPermissionCheckerLease ManagementhadoopleaseGFShadoopleaseclientGFSleaseclientdatanodehadoop--appendclientwriteclientappendleasewirtelease management1createwritecompleteleaselease2leaseleaselease managementcreateClientProtocolcreateINodeclientClientProtocolINodecompletedclientleaseclientleaseleasewriterclientleaseleaseclientcreateNamenodecreatepublic void create(String src, FsPermission masked, String clientName, boolean overwrite, boolean createParent, short replication, long blockSize ) throws IOException { String clientMachine = getClientMachine(); if (stateChangeLog.isDebugEnabled()) { stateChangeLog.debug("*DIR* NameNode.create: file " +src+" for "+clientName+" at "+clientMachine); } if (!checkPathLength(src)) { throw new IOException("create: Pathname too long. Limit " + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels."); } namesystem.startFile(src, new PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(), null, masked), clientName, clientMachine, overwrite, createParent, replication, blockSize); myMetrics.incrNumFilesCreated(); myMetrics.incrNumCreateFileOps(); }FsNamesystemstartFilestartFileInternalappendcreateprivate synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, boolean createParent, short replication, long blockSize ) throws IOException { if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: src=" + src + ", holder=" + holder + ", clientMachine=" + clientMachine + ", createParent=" + createParent + ", replication=" + replication + ", overwrite=" + overwrite + ", append=" + append); }

if (isInSafeMode()) throw new SafeModeException("Cannot create file" + src, safeMode); if (!DFSUtil.isValidName(src)) { throw new IOException("Invalid file name: " + src); }

// Verify that the destination does not exist as a directory already. boolean pathExists = dir.exists(src); if (pathExists && dir.isDir(src)) { throw new IOException("Cannot create file "+ src + "; already exists as a directory."); }

if (isPermissionEnabled) { if (append || (overwrite && pathExists)) { checkPathAccess(src, FsAction.WRITE); } else { checkAncestorAccess(src, FsAction.WRITE); } }

if (!createParent) { verifyParentDir(src); }

try { INode myFile = dir.getFileINode(src); recoverLeaseInternal(myFile, src, holder, clientMachine, false);

try { verifyReplication(src, replication, clientMachine); } catch(IOException e) { throw new IOException("failed to create "+e.getMessage()); } if (append) { if (myFile == null) { throw new FileNotFoundException("failed to append to non-existent file " + src + " on client " + clientMachine); } else if (myFile.isDirectory()) { throw new IOException("failed to append to directory " + src +" on client " + clientMachine); } } else if (!dir.isValidToCreate(src)) { if (overwrite) { delete(src, true); } else { throw new IOException("failed to create file " + src +" on client " + clientMachine +" either because the filename is invalid or the file exists"); } }

DatanodeDescriptor clientNode = host2DataNodeMap.getDatanodeByHost(clientMachine);

if (append) { // // Replace current node with a INodeUnderConstruction. // Recreate in-memory lease record. // INodeFile node = (INodeFile) myFile; INodeFileUnderConstruction cons = new INodeFileUnderConstruction( node.getLocalNameBytes(), node.getReplication(), node.getModificationTime(), node.getPreferredBlockSize(), node.getBlocks(), node.getPermissionStatus(), holder, clientMachine, clientNode); dir.replaceNode(src, node, cons); leaseManager.addLease(cons.clientName, src);

} else { // Now we can add the name to the filesystem. This file has no // blocks associated with it. // checkFsObjectLimit();

// increment global generation stamp long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); if (newNode == null) { throw new IOException("DIR* NameSystem.startFile: " + "Unable to add file to namespace."); } leaseManager.addLease(newNode.clientName, src); if (NameNode.stateChangeLog.isDebugEnabled()) { NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: " +"add "+src+" to namespace for "+holder); } } } catch (IOException ie) { NameNode.stateChangeLog.warn("DIR* NameSystem.startFile: " +ie.getMessage()); throw ie; } }newNodeleaseManager.addLease (newNode.clientName, src);/** * Adds (or re-adds) the lease for the specified file. */ synchronized Lease addLease(String holder, String src) {Comment by czhangmz: Lease lease = getLease(holder); if (lease == null) { lease = new Lease(holder); leases.put(holder, lease); sortedLeases.add(lease); } else { renewLease(lease); } sortedLeasesByPath.put(src, lease); lease.paths.add(src); return lease; }leaseleaseleaseleaselease managementwriteclientleasecompletedcreateprivate void finalizeINodeFileUnderConstruction(String src, INodeFileUnderConstruction pendingFile) throws IOException { NameNode.stateChangeLog.info("Removing lease on file " + src + " from client " + pendingFile.clientName); leaseManager.removeLease(pendingFile.clientName, src);

// The file is no longer pending. // Create permanent INode, update blockmap INodeFile newFile = pendingFile.convertToInodeFile(); dir.replaceNode(src, pendingFile, newFile);

// close file and persist block allocations for this file dir.closeFile(src, newFile);

checkReplicationFactor(newFile); }clientleasependingFileINodeFilecreateleaseclientcompleteINodeFilelease managementFsNamesysteminitializethis.lmthread = new Daemon(leaseManager.new Monitor());lmthread.start();/****************************************************** * Monitor checks for leases that have expired, * and disposes of them. ******************************************************/ class Monitor implements Runnable { final String name = getClass().getSimpleName();

/** Check leases periodically. */ public void run() { for(; fsnamesystem.isRunning(); ) { synchronized(fsnamesystem) { checkLeases(); }

try { Thread.sleep(2000); } catch(InterruptedException ie) { if (LOG.isDebugEnabled()) { LOG.debug(name + " is interrupted", ie); } } } } }

/** Check the leases beginning from the oldest. */ synchronized void checkLeases() { for(; sortedLeases.size() > 0; ) { final Lease oldest = sortedLeases.first(); if (!oldest.expiredHardLimit()) { return; // }

LOG.info("Lease " + oldest + " has expired hard limit");

final List removing = new ArrayList(); // need to create a copy of the oldest lease paths, becuase // internalReleaseLease() removes paths corresponding to empty files, // i.e. it needs to modify the collection being iterated over // causing ConcurrentModificationException String[] leasePaths = new String[oldest.getPaths().size()]; oldest.getPaths().toArray(leasePaths); for(String p : leasePaths) { try { fsnamesystem.internalReleaseLeaseOne(oldest, p); } catch (IOException e) { LOG.error("Cannot release the path "+p+" in the lease "+oldest, e); removing.add(p); } }

for(String p : removing) { removeLease(oldest, p); } } }fsnamesystem.internalReleaseLease(oldest, p);leaseleaseremoveLease(oldest, p);leaseleaseLeaseManagerprivate SortedMap leases = new TreeMap();holder->lease mapprivate SortedSet sortedLeases = new TreeSet(); leaseprivate SortedMap sortedLeasesByPath = new TreeMap();paths->leasesmapaddLeaseleaseleasessortedLeasesleaseclientleaseleasesortedLeasesByPath--lease recovery* Lease Recovery Algorithm* 1) Namenode retrieves lease information* 2) For each file f in the lease, consider the last block b of f* 2.1) Get the datanodes which contains b* 2.2) Assign one of the datanodes as the primary datanode p* 2.3) p obtains a new generation stamp form the namenode* 2.4) p get the block info from each datanode* 2.5) p computes the minimum block length* 2.6) p updates the datanodes, which have a valid generation stamp,* with the new generation stamp and the minimum block length* 2.7) p acknowledges the namenode the update results* 2.8) Namenode updates the BlockInfo* 2.9) Namenode removes f from the lease* and removes the lease once all files have been removed* 2.10) Namenode commit changes to edit log

9 HeartbeatHadoopRPCComment by : RPCRPC1. hadoopmaster/slavemasterNamenodeJobtrackerslaveDatanodeTasktracker2. masteripc serverslave3. slavemaster3master heartbeat.recheck.intervalmastermasterslave4. namenodedatanodejobtrackertasktrackerDatanodeNamenodeDatanodeofferService/** * Main loop for the DataNode. Runs until shutdown, * forever calling remote NameNode functions. */ public void offerService() throws Exception { LOG.info("using BLOCKREPORT_INTERVAL of " + blockReportInterval + "msec" + " Initial delay: " + initialBlockReportDelay + "msec");

// // Now loop for a long time.... //

while (shouldRun) { try { long startTime = now();

// // Every so often, send heartbeat or block-report // if (startTime - lastHeartbeat > heartBeatInterval) {Comment by : heartBeatIntervalDatanode // // All heartbeat messages include following info: // -- Datanode name // -- data transfer port // -- Total capacity // -- Bytes remaining // lastHeartbeat = startTime; DatanodeCommand[] cmds = namenode.sendHeartbeat(dnRegistration, data.getCapacity(), data.getDfsUsed(), data.getRemaining(), xmitsInProgress.get(), getXceiverCount()); myMetrics.addHeartBeat(now() - startTime); //LOG.info("Just sent heartbeat, with name " + localName); if (!processCommand(cmds))Comment by : namenodedatanode continue; } } // while (shouldRun) } // offerServiceHadoopDatanodeNamenode22JVMDatanodenamenodenamenodepublic DatanodeProtocol namenode = null;NameNodepublic class NameNode implements ClientProtocol, DatanodeProtocol, NamenodeProtocol, FSConstants, RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocolnamenodeDatanodeProtocolHadoop RPCDatanodeNamenodesendHeartbeat()DataNodeNameNodeDataNodestartDataNode// connect to name node this.namenode = (DatanodeProtocol) RPC.waitForProxy(DatanodeProtocol.class, DatanodeProtocol.versionID, nameNodeAddr, conf);namenodeNamenodeDatanodeNamenodeRPCDatanodeNamenodeheartbeat 1) datanodenamenodeproxy2) datanodenamenode proxysendHeartbeat3) datanodenamenode()Invocationclient.call4) client callInvocationCallComment by : CallJFrameEntryCommand5) client callnamenode6) namenodenamenodeCallprocessDatanodeCommand[]sendHeartbeat/** * Data node notify the name node that it is alive * Return an array of block-oriented commands for the datanode to execute. * This will be either a transfer or a delete operation. */ public DatanodeCommand[] sendHeartbeat(DatanodeRegistration nodeReg, long capacity, long dfsUsed, long remaining, int xmitsInProgress, int xceiverCount) throws IOException { verifyRequest(nodeReg); return namesystem.handleHeartbeat(nodeReg, capacity, dfsUsed, remaining, xceiverCount, xmitsInProgress); }DataNodeNameNodeDatanodeRegistrationDatanodeCommandDatanodeCommand

DatanodeProtocolDatanodeCommand/** * Determines actions that data node should perform * when receiving a datanode command. */ final static int DNA_UNKNOWN = 0; // unknown action final static int DNA_TRANSFER = 1; // transfer blocks to another datanode final static int DNA_INVALIDATE = 2; // invalidate blocks final static int DNA_SHUTDOWN = 3; // shutdown node final static int DNA_REGISTER = 4; // re-register final static int DNA_FINALIZE = 5; // finalize previous upgrade final static int DNA_RECOVERBLOCK = 6; // request a block recovery final static int DNA_ACCESSKEYUPDATE = 7; // update access key final static int DNA_BALANCERBANDWIDTHUPDATE = 8; // update balancer bandwidthFSNamesystem.handleHeartbeat1getDatanodeDatanodeDescriptornodeinfonullNameNodeStorageIDDatanodeCommand.REGISTERDataNode2isDecommissionedDisallowedDatanodeException3nodeinfoDatanodeCommand.REGISTERDataNode4capacityTotalcapacityUsedcapacityRemainingtotalLoad5DatanodeCommand/** * The given node has reported in. This method should: * 1) Record the heartbeat, so the datanode isn't timed out * 2) Adjust usage stats for future block allocation * * If a substantial amount of time passed since the last datanode * heartbeat then request an immediate block report. * * @return an array of datanode commands * @throws IOException */ DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg, long capacity, long dfsUsed, long remaining, int xceiverCount, int xmitsInProgress) throws IOException { DatanodeCommand cmd = null; synchronized (heartbeats) { synchronized (datanodeMap) { DatanodeDescriptor nodeinfo = null; try { nodeinfo = getDatanode(nodeReg); } catch(UnregisteredDatanodeException e) { return new DatanodeCommand[]{DatanodeCommand.REGISTER}; } // Check if this datanode should actually be shutdown instead. if (nodeinfo != null && shouldNodeShutdown(nodeinfo)) { setDatanodeDead(nodeinfo); throw new DisallowedDatanodeException(nodeinfo); }

if (nodeinfo == null || !nodeinfo.isAlive) { return new DatanodeCommand[]{DatanodeCommand.REGISTER}; }

updateStats(nodeinfo, false); nodeinfo.updateHeartbeat(capacity, dfsUsed, remaining, xceiverCount); updateStats(nodeinfo, true); //check lease recovery cmd = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE); if (cmd != null) { return new DatanodeCommand[] {cmd}; } ArrayList cmds = new ArrayList(); //check pending replication cmd = nodeinfo.getReplicationCommand( maxReplicationStreams - xmitsInProgress); if (cmd != null) { cmds.add(cmd); } //check block invalidation cmd = nodeinfo.getInvalidateBlocks(blockInvalidateLimit); if (cmd != null) { cmds.add(cmd); } // check access key update if (isAccessTokenEnabled && nodeinfo.needKeyUpdate) { cmds.add(new KeyUpdateCommand(accessTokenHandler.exportKeys())); nodeinfo.needKeyUpdate = false; } // check for balancer bandwidth update if (nodeinfo.getBalancerBandwidth() > 0) { cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth())); // set back to 0 to indicate that datanode has been sent the new value nodeinfo.setBalancerBandwidth(0); } if (!cmds.isEmpty()) { return cmds.toArray(new DatanodeCommand[cmds.size()]); } } }

//check distributed upgrade cmd = getDistributedUpgradeCommand(); if (cmd != null) { return new DatanodeCommand[] {cmd}; } return null; }

10 HDFSHDFSHDFSHDFS//FileSystempublic class FileCopyWithProgress { public static void main(String[] args) throws Exception { String localSrc = args[0]; String dst = args[1]; InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); //FileSystemHDFS FileSystem fs = FileSystem.get(URI.create(dst), conf); OutputStream out = fs.create(new Path(dst), new Progressable() { //MapReduce public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); }}hadoop FileCopyWithProgress input/1.txt hdfs://localhost/user/hadoop/1.txt

//FileSystem APIpublic class FileSystemCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); //FileSystemHDFS FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } }} hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

HDFSFileSystemFileSystemComment by :

FileSystemFileSystemCACHE: cachecachestatisticsTable: key: CACHEstatistics: deleteOnExit: JavaclientFinalizer: FileSystemFileSystemgetFileBlockLocations: exists: isFile: getContentSummary: listStatus: globStatus: LinuxgetHomeDirec