Zombie simulated datanode #81

fengnanli · 2019-02-27T01:00:58Z

After running start-dynamometer-cluster.sh and replay the prod audit log for some time, some simulated datanodes (containers) lost connection to the RM and when the Yarn application is killed, these containers are still running, which will sending their blocks to the Namenode.
In this case, since datanode has gone through some changes with the replay where Namenode started from a fresh fsimage. Below errors will show up in the webhdfs page after the Namenode starts up.

Safe mode is ON. The reported blocks 1526116 needs additional 395902425 blocks to reach the threshold 0.9990 of total blocks 397826363. The number of live datanodes 3 has reached the minimum number 0. Name node detected blocks with generation stamps in future. This means that Name node metadata is inconsistent.This can happen if Name node metadata files have been manually replaced. Exiting safe mode will cause loss of 7141 byte(s). Please restart name node with right metadata or use "hdfs dfsadmin -safemode forceExitif you are certain that the NameNode was started with thecorrect FsImage and edit logs. If you encountered this duringa rollback, it is safe to exit with -safemode forceExit.

and checking datanode tab in the webhdfs page, a list of a couple datanodes will show up.

xkrogen · 2019-02-27T17:35:35Z

Thanks for reporting this @fengnanli ! I think I asked before but I don't remember your answer, was this running within a secure environment using LinuxContainerExecutor / cgroups? I think that is what prevents such things from occurring in our environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zombie simulated datanode #81

Zombie simulated datanode #81

fengnanli commented Feb 27, 2019

xkrogen commented Feb 27, 2019

Zombie simulated datanode #81

Zombie simulated datanode #81

Comments

fengnanli commented Feb 27, 2019

xkrogen commented Feb 27, 2019