Apache Hadoop
Setup Apache Hadoop on Windows, Linux & Mac OS ..
x
x
Download the Apache-Hadoop prebuilt image from Docker Hub repository:
These images are about 14GB .. so please be patient ..!

Once completed, deploy the Apache-Hadoop containers.
docker run -it -p 9870:9870 -p 8095:8088 -p 9864:9864 --name AHW jporeilly/apache-hadoop:amdThe YARN Resource Manager port is mapped to: port:8095 to prevent conflict with Cedalo Management Center - MQTT
Once completed a shell will open:
root@955b8d17f170:/#Enter: init
initThis stops all running processes, formats the HDFS namenodes, and starts all processes.
Enter: jps
jps

This completes the installation of all tools required for the Big Data course.
Just type exit to exit from the container - this will stop the container.
To start the container enter the following:
docker start -ai AHWOnce the Docker shell opens, just type restart to restart all processes.

Before you begin ensure Docker & Docker Compose have been installed & configured.
docker-compose --versionRun the docker containers using docker-compose
cd
cd ~/Hadoop
docker-compose up -d[+] Running 28/5
✔ datanode Pulled 32.7s
✔ namenode Pulled 32.6s
✔ nodemanager1 Pulled 32.5s
✔ resourcemanager Pulled 32.3s
✔ historyserver Pulled 32.5s
[+] Running 9/9
✔ Network hadoop_default Creat... 0.5s
✔ Volume "hadoop_hadoop_datanode" Created 0.0s
✔ Volume "hadoop_hadoop_historyserver" Created 0.0s
✔ Volume "hadoop_hadoop_namenode" Created 0.0s
✔ Container datanode Started 3.8s
✔ Container namenode Started 3.9s
✔ Container nodemanager Starte... 3.9s
✔ Container historyserver Star... 3.8s
✔ Container resourcemanager St... 3.9s
... Access the Cluster
Can login into any node by specifying the container.
docker exec -it datanode /bin/bash Navigate to mapped data volume.
cd hadoop/dfs/Accessing the UI
The Namenode UI can be accessed at:
ResourceManager UI can be accessed at:
History Server UI can be accessed at:
Shutdown Cluster
To shut down the cluster.
docker-compose downTest - Word Count Algorithm
List all the files in our HDFS system.
hdfs dfs -l /Create a /user/root/ file.
hdfs dfs -mkdir -p /user/rootVerify directory.
hdfs dfs -ls /user/Found 1 items
drwxr-xr-x - root supergroup 0 2024-08-10 13:59 /user/rootDownload the hadoop-mapreduce-examples-3.2.1-sources.jar file
Save - hadoop-mapreduce-examples-3.2.1-sources.jar to: ~/Hadoop
Download & Save text file - Canterbury Tales or Ulysses
Copy the files into the namenode container.
cd
cd ~/Hadoop/assets
docker cp hadoop-mapreduce-examples-3.2.1-sources.jar namenode:/tmp
docker cp pg2383.txt namenode:/tmpCreate the Input folder.
docker exec -it namenode bash
hdfs dfs -mkdir /user/root/inputCopy over /tmp/pg2383.txt to /user/root/input.
cd
cd /tmp
hdfs dfs -put pg2383.txt /user/root/input10 . Run MapReduce
hadoop jar hadoop-mapreduce-examples-3.2.1-sources.jar org.apache.hadoop.examples.WordCount input output2024-08-10 14:10:15,533 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.6:8032
2024-08-10 14:10:15,702 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.3:10200
2024-08-10 14:10:15,879 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1723287966223_0001
2024-08-10 14:10:15,969 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,068 INFO input.FileInputFormat: Total input files to process : 1
2024-08-10 14:10:16,088 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,101 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,107 INFO mapreduce.JobSubmitter: number of splits:1
2024-08-10 14:10:16,189 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1723287966223_0001
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-08-10 14:10:16,345 INFO conf.Configuration: resource-types.xml not found
2024-08-10 14:10:16,346 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-08-10 14:10:16,813 INFO impl.YarnClientImpl: Submitted application application_1723287966223_0001
2024-08-10 14:10:16,867 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1723287966223_0001/
2024-08-10 14:10:16,868 INFO mapreduce.Job: Running job: job_1723287966223_0001
2024-08-10 14:10:23,970 INFO mapreduce.Job: Job job_1723287966223_0001 running in uber mode : false
2024-08-10 14:10:23,971 INFO mapreduce.Job: map 0% reduce 0%
2024-08-10 14:10:30,048 INFO mapreduce.Job: map 100% reduce 0%
2024-08-10 14:10:34,065 INFO mapreduce.Job: map 100% reduce 100%
2024-08-10 14:10:35,074 INFO mapreduce.Job: Job job_1723287966223_0001 completed successfully
2024-08-10 14:10:35,163 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=187024
FILE: Number of bytes written=832593
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1692663
HDFS: Number of bytes written=438623
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=10968
Total time spent by all reduces in occupied slots (ms)=16448
Total time spent by all map tasks (ms)=2742
Total time spent by all reduce tasks (ms)=2056
Total vcore-milliseconds taken by all map tasks=2742
Total vcore-milliseconds taken by all reduce tasks=2056
Total megabyte-milliseconds taken by all map tasks=11231232
Total megabyte-milliseconds taken by all reduce tasks=16842752
Map-Reduce Framework
Map input records=36758
Map output records=282822
Map output bytes=2691784
Map output materialized bytes=187016
Input split bytes=112
Combine input records=282822
Combine output records=41330
Reduce input groups=41330
Reduce shuffle bytes=187016
Reduce input records=41330
Reduce output records=41330
Spilled Records=82660
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=237
CPU time spent (ms)=4180
Physical memory (bytes) snapshot=862277632
Virtual memory (bytes) snapshot=13577064448
Total committed heap usage (bytes)=1277165568
Peak Map Physical memory (bytes)=608587776
Peak Map Virtual memory (bytes)=5115801600
Peak Reduce Physical memory (bytes)=253689856
Peak Reduce Virtual memory (bytes)=8461262848
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1692551
File Output Format Counters
Bytes Written=438623View the output.
hdfs dfs -cat /user/root/output/*...
reserved, 1
reserved. 1
reserves, 1
resided 1
residence, 1
resign 1
resign, 1
resign. 1
resignation, 1
resist 1
resisted; 1
resistence, 1
resolution 4
resolved 8
resolves 1
resolving 2
resort, 3
resort; 1
resounded 1
resources 1
respect 2
respect. 1
respective 3
...Check the results accessing to the output folder.
hdfs dfs -ls /user/root/outputOutput the text file.
hdfs dfs -cat /user/root/output/part-r-00000 > /tmp/pg2383_wc.txt
exit
docker cp namenode:/tmp/pg2383_wc.txt Last updated
Was this helpful?



