Can a Hadoop cluster have multiple name nodes?

14 answers

Anonymous users2024-02-15

One of the core features of Docker is the ability to package any application, including Hadoop, into a Docker image. This tutorial describes the detailed steps to quickly set up a multi-node Hadoop cluster on a single machine using Docker. After discovering the problems of the current Hadoopondocker project, the author developed a near-minimal Hadoop image and supported the rapid establishment of a Hadoop cluster with any number of nodes.

github:kiwanlau hadoop-cluster-dockerSetting up a Hadoop cluster directly from a machine is quite a painful process, especially for beginners. They haven't even started to run wordcounts, and they may be tormented by this question.

And not everyone has several machines, right? You can try to build with multiple virtual machines, as long as you have a machine with good performance. My goal is to run Hadoop clusters in Docker containers, making it quick and easy for Hadoop developers to set up multi-node Hadoop clusters natively.

In fact, there have been many implementations of this idea, but none of them are ideal, they either have too large images, or they are too slow to use, or they use third-party tools that make them too complicated to use. The following table shows some of the known Hadoopondocker projects and their problems. In addition, when adding nodes to the Alvinhenrick Hadoop-Mutino project, you need to manually modify the Hadoop configuration file, rebuild the Hadoop-nn-dn image, and then modify the container startup script to add nodes.

And I implemented it automatically through a shell script, and I was able to rebuild the hadoop-master image in less than 1 minute, and then run it immediately! This project starts a Hadoop cluster with 3 nodes by default and supports a Hadoop cluster with any number of nodes. In addition, starting Hadoop, running wordcount, and rebuilding images are all automated using shell scripts.

This makes the use and development of the whole project very convenient and fast. DevTest Environment: Operating System: and Kernel Version:

Version: With my friends, the hard disk is not enough, the memory is not enough, especially the kernel version is too low, which will cause the operation to fail.
Anonymous users2024-02-14

Yes, there should be more than one under normal circumstances, so that the whole cluster can continue to run if something goes wrong.
Anonymous users2024-02-13

Of course it will, if the datanode doesn't get up, you don't need to reformat the namenode, and it's not that the namenode is a problem. Of course, reformatting the namenode with no data for HDFS is not a problem, or think of other solutions, there will definitely be. If the datanode can't get up, I suggest you find a way to add it back in, this should not be the number of HDFS in the Hadoop cluster if the namenode is reformatted.
Anonymous users2024-02-12

ZooKeeper is a stand-alone component that works with HDFS, but it doesn't have to be deployed together, as long as it's networked.

In addition, ZooKeeper is recommended to be installed on a minimum of 3 nodes, and the number is an odd number.
Anonymous users2024-02-11

Computation time on Hadoop = Time spent by the Hadoop framework itself + Data processing time Calculate concurrency.

Among them, the Hadoop framework itself takes about 10s, and if the parameters are not set well, it may be relatively long, but it is estimated that it will only take about half a minute at most.

Calculating concurrency depends on two factors:

1.The number of chunks occupied by the data depends on the chunk size set when your file is stored on Hadoop, the default is 64M, you see if it is so big, the number of chunks = file size Block size.

2.The maximum number of parallel tasks set by Hadoop is the number of tasks in the running state at a certain time when running on jobtracker, and this value is usually relatively stable.

The distributed environment is complex, and if the above information is not enough, you need to consult the administrator.
Anonymous users2024-02-10

The default chunk is 64MB, more than 64MB will be chunked on the file, in blocks, namenode for data redundancy and distribution, you can see the distribution of your data blocks, how many data blocks are in each machine, in addition, you from a single computer to 7, the number of data replicas has been reset?
Anonymous users2024-02-09

As a leader, responsible for scheduling, for example, you need to save a 640m file, if you divide it according to 64m, then namenode will take these 10 blocks (not considering replicas here), it mainly maintains two maps, one is the correspondence between files and blocks, and the other is the correspondence between blocks and nodes. I don't know if you can understand this explanation!
Anonymous users2024-02-08

1. The directory of the cluster is the system entry.

2. Cluster task dispatch center.
Anonymous users2024-02-07

HDFS has two core NameNodes (a master node) and DataNode (multiple slave nodes), DataNode mainly stores data, and NameNode is to manage the metadata information of file system files (including file name, size, location, attributes, creation time, modification time, etc.), the second is to maintain the correspondence between files to blocks and blocks, and the third is to maintain the user's operation information on files (adding, deleting, modifying, and querying files).
Anonymous users2024-02-06

Bottom line: scheduling tasks in Hadoop.
Anonymous users2024-02-05

It is best to make two into は

About hard disks: 6T data capacity, depending on what the number of copies you set is, generally the default is 3, then only these need 18T hard disks, a little larger 20T; This is just HDFS storage; (I'm talking about a month here, and if your data is stored for a few months, it will be multiplied by several times).

If you want to run computing on the cluster, the data calculated by MR needs to be stored in HDFS, so you still have to make a judgment based on your result data, and the size depends on your computing task.

This is how the hard drive size is calculated.

Original data + intermediate data + result data) × Number of replicas = Total disk size.

About memory: namenode needless to say, the main thing is to use memory to save the correspondence between block and node, and it is also calculated according to the size of the data, 6t block size (default is 128m) = how many blocks - >m .

How much memory does a block occupy: Conservatively set 1000MB of memory per million data blocks.

Total memory of namenode (M) = M x 1000 MB 1 million

Datanode's memory: Generally not a big problem, it is generally used for MR calculations, this thing is set according to your performance needs.

About how many machines?

Make decisions based on the number of tasks and your performance metrics.

Actually test how many machines you want based on your metrics to run on x machines for a certain amount of data.

The performance of a Hadoop cluster is approximately positively related to the number of nodes.
Anonymous users2024-02-04

Make sure that the three machines can ping each other, and then make sure that the hosts file of each machine has a mapping of the IP and hostnames of the three machines.
Anonymous users2024-02-03

Try to see if you can log in to master:without a password on both machines B and C

ssh test86

If you can't log in, everything else is in vain.

Let's take a look at 1!
Anonymous users2024-02-02

command to start the Hadoop cluster, but the namenode is not visible. In solving this problem, many new problems continue to arise.

Although these problems have been solved in the previous learning of Hadoop. However, because it has been solved before, it is even more important to solve these problems now. You want to start a cluster as quickly as possible.

Finally, I came up with a bad idea. Format the namenode directly.

Let's get to the point: Reformatting namenodes in Hadoop

This is because the Hadoop cluster has been started normally before. So in the corresponding data directory of Hadoop, there are already a lot of related folders. We need to delete the folder before formatting.

1. Perform operations on the master master.

1. Delete the data, name, and namesecondary folders.

2. Delete the four folders in the mrlocal directory.

3. Delete all files in the logs folder.

2. Perform operations on nodes.

1. Delete all files in HDFS Data.

2. Delete all files in mrlocal.

3. Delete all files in logs.

After the basic deletion is complete. Start reformat the namenode

3. Format the namenode

hadoop@master hadoop]$ bin/hadoop namenode -forma

After successful formatting. Restart the cluster.

4. Restart the cluster.

hadoop@master hadoop]$ bin/

5. Check the startup status of the cluster.

hadoop@master hadoop]$ jps

3851 jps

3744 tasktracker

3622 jobtracker

3279 namenode

3533 secondarynamenode

3395 datanode

As seen in [hadoop@master hadoop]$ jps

3851 jps

3744 tasktracker

3622 jobtracker

3279 namenode

3533 secondarynamenode

3395 datanode

The startup is successful.