-
One of the core features of Docker is the ability to package any application, including Hadoop, into a Docker image. This tutorial describes the detailed steps to quickly set up a multi-node Hadoop cluster on a single machine using Docker. After discovering the problems of the current Hadoopondocker project, the author developed a near-minimal Hadoop image and supported the rapid establishment of a Hadoop cluster with any number of nodes.
github:kiwanlau hadoop-cluster-dockerSetting up a Hadoop cluster directly from a machine is quite a painful process, especially for beginners. They haven't even started to run wordcounts, and they may be tormented by this question.
And not everyone has several machines, right? You can try to build with multiple virtual machines, as long as you have a machine with good performance. My goal is to run Hadoop clusters in Docker containers, making it quick and easy for Hadoop developers to set up multi-node Hadoop clusters natively.
In fact, there have been many implementations of this idea, but none of them are ideal, they either have too large images, or they are too slow to use, or they use third-party tools that make them too complicated to use. The following table shows some of the known Hadoopondocker projects and their problems. In addition, when adding nodes to the Alvinhenrick Hadoop-Mutino project, you need to manually modify the Hadoop configuration file, rebuild the Hadoop-nn-dn image, and then modify the container startup script to add nodes.
And I implemented it automatically through a shell script, and I was able to rebuild the hadoop-master image in less than 1 minute, and then run it immediately! This project starts a Hadoop cluster with 3 nodes by default and supports a Hadoop cluster with any number of nodes. In addition, starting Hadoop, running wordcount, and rebuilding images are all automated using shell scripts.
This makes the use and development of the whole project very convenient and fast. DevTest Environment: Operating System: and Kernel Version:
Version: With my friends, the hard disk is not enough, the memory is not enough, especially the kernel version is too low, which will cause the operation to fail.
-
Yes, there should be more than one under normal circumstances, so that the whole cluster can continue to run if something goes wrong.
-
Of course it will, if the datanode doesn't get up, you don't need to reformat the namenode, and it's not that the namenode is a problem. Of course, reformatting the namenode with no data for HDFS is not a problem, or think of other solutions, there will definitely be. If the datanode can't get up, I suggest you find a way to add it back in, this should not be the number of HDFS in the Hadoop cluster if the namenode is reformatted.
-
ZooKeeper is a stand-alone component that works with HDFS, but it doesn't have to be deployed together, as long as it's networked.
In addition, ZooKeeper is recommended to be installed on a minimum of 3 nodes, and the number is an odd number.
-
Computation time on Hadoop = Time spent by the Hadoop framework itself + Data processing time Calculate concurrency.
Among them, the Hadoop framework itself takes about 10s, and if the parameters are not set well, it may be relatively long, but it is estimated that it will only take about half a minute at most.
Calculating concurrency depends on two factors:
1.The number of chunks occupied by the data depends on the chunk size set when your file is stored on Hadoop, the default is 64M, you see if it is so big, the number of chunks = file size Block size.
2.The maximum number of parallel tasks set by Hadoop is the number of tasks in the running state at a certain time when running on jobtracker, and this value is usually relatively stable.
The distributed environment is complex, and if the above information is not enough, you need to consult the administrator.
-
The default chunk is 64MB, more than 64MB will be chunked on the file, in blocks, namenode for data redundancy and distribution, you can see the distribution of your data blocks, how many data blocks are in each machine, in addition, you from a single computer to 7, the number of data replicas has been reset?
-
As a leader, responsible for scheduling, for example, you need to save a 640m file, if you divide it according to 64m, then namenode will take these 10 blocks (not considering replicas here), it mainly maintains two maps, one is the correspondence between files and blocks, and the other is the correspondence between blocks and nodes. I don't know if you can understand this explanation!
-
1. The directory of the cluster is the system entry.
2. Cluster task dispatch center.
-
HDFS has two core NameNodes (a master node) and DataNode (multiple slave nodes), DataNode mainly stores data, and NameNode is to manage the metadata information of file system files (including file name, size, location, attributes, creation time, modification time, etc.), the second is to maintain the correspondence between files to blocks and blocks, and the third is to maintain the user's operation information on files (adding, deleting, modifying, and querying files).
-
Bottom line: scheduling tasks in Hadoop.
-
It is best to make two into は
About hard disks: 6T data capacity, depending on what the number of copies you set is, generally the default is 3, then only these need 18T hard disks, a little larger 20T; This is just HDFS storage; (I'm talking about a month here, and if your data is stored for a few months, it will be multiplied by several times).
If you want to run computing on the cluster, the data calculated by MR needs to be stored in HDFS, so you still have to make a judgment based on your result data, and the size depends on your computing task.
This is how the hard drive size is calculated.
Original data + intermediate data + result data) × Number of replicas = Total disk size.
About memory: namenode needless to say, the main thing is to use memory to save the correspondence between block and node, and it is also calculated according to the size of the data, 6t block size (default is 128m) = how many blocks - >m .
How much memory does a block occupy: Conservatively set 1000MB of memory per million data blocks.
Total memory of namenode (M) = M x 1000 MB 1 million
Datanode's memory: Generally not a big problem, it is generally used for MR calculations, this thing is set according to your performance needs.
About how many machines?
Make decisions based on the number of tasks and your performance metrics.
Actually test how many machines you want based on your metrics to run on x machines for a certain amount of data.
The performance of a Hadoop cluster is approximately positively related to the number of nodes.
-
Make sure that the three machines can ping each other, and then make sure that the hosts file of each machine has a mapping of the IP and hostnames of the three machines.
-
Try to see if you can log in to master:without a password on both machines B and C
ssh test86
If you can't log in, everything else is in vain.
Let's take a look at 1!
-
command to start the Hadoop cluster, but the namenode is not visible. In solving this problem, many new problems continue to arise.
Although these problems have been solved in the previous learning of Hadoop. However, because it has been solved before, it is even more important to solve these problems now. You want to start a cluster as quickly as possible.
Finally, I came up with a bad idea. Format the namenode directly.
Let's get to the point: Reformatting namenodes in Hadoop
This is because the Hadoop cluster has been started normally before. So in the corresponding data directory of Hadoop, there are already a lot of related folders. We need to delete the folder before formatting.
1. Perform operations on the master master.
1. Delete the data, name, and namesecondary folders.
2. Delete the four folders in the mrlocal directory.
3. Delete all files in the logs folder.
2. Perform operations on nodes.
1. Delete all files in HDFS Data.
2. Delete all files in mrlocal.
3. Delete all files in logs.
After the basic deletion is complete. Start reformat the namenode
3. Format the namenode
hadoop@master hadoop]$ bin/hadoop namenode -forma
After successful formatting. Restart the cluster.
4. Restart the cluster.
hadoop@master hadoop]$ bin/
5. Check the startup status of the cluster.
hadoop@master hadoop]$ jps
3851 jps
3744 tasktracker
3622 jobtracker
3279 namenode
3533 secondarynamenode
3395 datanode
As seen in [hadoop@master hadoop]$ jps
3851 jps
3744 tasktracker
3622 jobtracker
3279 namenode
3533 secondarynamenode
3395 datanode
The startup is successful.
With the increase of age, the subcutaneous fat of the lower eyelids gradually atrophies, the eyelids are loose and sagging, and edema occurs. >>>More
For owners who have bought a balcony apartment, some people want to turn it into a leisure area, while some people want to turn it into a practical area, such as washing and drying clothes, etc., usually the practical balcony needs to be further transformed to better play its function. Many people want to know how to install two floor drains for balcony decoration and introduce the method to everyone. >>>More
It is recommended that you go to the wishing tree to make a wish to try, maybe you can get ingots, and you can also buy a gift package with Q coins, but it's a pity to die. If it dies, it must be buried, and after a while, you can adopt one again.
Look at whether this man respects you or not, is emotionally stable and unstable, and can you accept the lowest point of his humanity. These are very important factors, and only people with good character and emotional stability can live a lifetime. If the other party is a short-tempered and violent person, no matter how gentle you are, no matter how virtuous you are, you will become a shrew, or forbearance for the rest of your life. >>>More
IP binding is to bind an IP to the MAC address of your computer's network card one-to-one, even if you use other computers to access the Internet, even if you enter the IP, but because it is not the original bound network card, you still can't access the Internet. >>>More