Apache Hadoop

Setup Apache Hadoop on Windows, Linux & Mac OS ..

The following steps are intended for setting up a Pentaho Lab environment and need to be completed in order to complete the Workshops.

Ensure you have downloaded the Workshop--Installation:

cd
git clone https://github.com/jporeilly/Workshop--Installation

To install git:

sudo apt install git

Prerequisites

Ubuntu 24.04 LTS system (physical or virtual machine)
User account with sudo privileges
Internet connection
Basic familiarity with Linux command line

In pseudo-distributed mode, all Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager) run on a single machine. This setup is configured to mimic a multi-node cluster, allowing you to test HDFS operations and run MapReduce or YARN applications as if you had a small cluster. It's the ideal starting point for learning Hadoop.

Update system packages.

sudo apt update && sudo apt upgrade -y

Hadoop is built on Java so you need a Java Development Kit (JDK) installed. Confirm installation.

java --version

Before you begin ensure Docker & Docker Compose have been installed & configured.

docker-compose --version

Create Directories

Create directory structure.

// Some code

Run the docker containers using docker-compose

cd
cd ~/Hadoop
docker-compose up -d

[+] Running 28/5
 ✔ datanode Pulled                                                        32.7s 
 ✔ namenode Pulled                                                        32.6s 
 ✔ nodemanager1 Pulled                                                    32.5s 
 ✔ resourcemanager Pulled                                                 32.3s 
 ✔ historyserver Pulled                                                   32.5s 
[+] Running 9/9
 ✔ Network hadoop_default                Creat...                          0.5s 
 ✔ Volume "hadoop_hadoop_datanode"       Created                           0.0s 
 ✔ Volume "hadoop_hadoop_historyserver"  Created                           0.0s 
 ✔ Volume "hadoop_hadoop_namenode"       Created                           0.0s 
 ✔ Container datanode                    Started                           3.8s 
 ✔ Container namenode                    Started                           3.9s 
 ✔ Container nodemanager                 Starte...                         3.9s 
 ✔ Container historyserver               Star...                           3.8s 
 ✔ Container resourcemanager             St...                             3.9s 
...

Access the Cluster

Can login into any node by specifying the container.

docker exec -it datanode /bin/bash

Navigate to mapped data volume.

cd hadoop/dfs/

Accessing the UI

The Namenode UI can be accessed at:

http://localhost:9870/dfshealth.html#tab-overviewlocalhost

ResourceManager UI can be accessed at:

http://localhost:8088/localhost

History Server UI can be accessed at:

http://localhost:8188/applicationhistorylocalhost

Shutdown Cluster

To shut down the cluster.

docker-compose down

Time to check we can run some Hadoop Jobs.

So we're going to run a Job that counts the number of instances a word appears in the Canterbury Tales.

Test - Word Count Algorithm

List all the files in our HDFS system.

hdfs dfs -l /

Create a /user/root/ file.

hdfs dfs -mkdir -p /user/root

Verify directory.

hdfs dfs -ls /user/

Found 1 items
drwxr-xr-x   - root supergroup          0 2024-08-10 13:59 /user/root

Download the hadoop-mapreduce-examples-3.2.1-sources.jar file

We will use a .jar file containing the classes needed to execute MapReduce algorithm.

Central Repository: org/apache/hadoop/hadoop-mapreduce-examples/3.2.1repo1.maven.org

Save - hadoop-mapreduce-examples-3.2.1-sources.jar to: ~/Hadoop
Download & Save text file - Canterbury Tales or Ulysses
Copy the files into the namenode container.

cd
cd ~/Hadoop/assets
docker cp hadoop-mapreduce-examples-3.2.1-sources.jar namenode:/tmp
docker cp pg2383.txt namenode:/tmp

Create the Input folder.

docker exec -it namenode bash
hdfs dfs -mkdir /user/root/input

Copy over /tmp/pg2383.txt to /user/root/input.

cd
cd /tmp
hdfs dfs -put pg2383.txt /user/root/input

10 . Run MapReduce

hadoop jar hadoop-mapreduce-examples-3.2.1-sources.jar org.apache.hadoop.examples.WordCount input output

2024-08-10 14:10:15,533 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.6:8032
2024-08-10 14:10:15,702 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.3:10200
2024-08-10 14:10:15,879 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1723287966223_0001
2024-08-10 14:10:15,969 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,068 INFO input.FileInputFormat: Total input files to process : 1
2024-08-10 14:10:16,088 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,101 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,107 INFO mapreduce.JobSubmitter: number of splits:1
2024-08-10 14:10:16,189 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1723287966223_0001
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-08-10 14:10:16,345 INFO conf.Configuration: resource-types.xml not found
2024-08-10 14:10:16,346 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-08-10 14:10:16,813 INFO impl.YarnClientImpl: Submitted application application_1723287966223_0001
2024-08-10 14:10:16,867 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1723287966223_0001/
2024-08-10 14:10:16,868 INFO mapreduce.Job: Running job: job_1723287966223_0001
2024-08-10 14:10:23,970 INFO mapreduce.Job: Job job_1723287966223_0001 running in uber mode : false
2024-08-10 14:10:23,971 INFO mapreduce.Job:  map 0% reduce 0%
2024-08-10 14:10:30,048 INFO mapreduce.Job:  map 100% reduce 0%
2024-08-10 14:10:34,065 INFO mapreduce.Job:  map 100% reduce 100%
2024-08-10 14:10:35,074 INFO mapreduce.Job: Job job_1723287966223_0001 completed successfully
2024-08-10 14:10:35,163 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=187024
		FILE: Number of bytes written=832593
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1692663
		HDFS: Number of bytes written=438623
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=10968
		Total time spent by all reduces in occupied slots (ms)=16448
		Total time spent by all map tasks (ms)=2742
		Total time spent by all reduce tasks (ms)=2056
		Total vcore-milliseconds taken by all map tasks=2742
		Total vcore-milliseconds taken by all reduce tasks=2056
		Total megabyte-milliseconds taken by all map tasks=11231232
		Total megabyte-milliseconds taken by all reduce tasks=16842752
	Map-Reduce Framework
		Map input records=36758
		Map output records=282822
		Map output bytes=2691784
		Map output materialized bytes=187016
		Input split bytes=112
		Combine input records=282822
		Combine output records=41330
		Reduce input groups=41330
		Reduce shuffle bytes=187016
		Reduce input records=41330
		Reduce output records=41330
		Spilled Records=82660
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=237
		CPU time spent (ms)=4180
		Physical memory (bytes) snapshot=862277632
		Virtual memory (bytes) snapshot=13577064448
		Total committed heap usage (bytes)=1277165568
		Peak Map Physical memory (bytes)=608587776
		Peak Map Virtual memory (bytes)=5115801600
		Peak Reduce Physical memory (bytes)=253689856
		Peak Reduce Virtual memory (bytes)=8461262848
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1692551
	File Output Format Counters 
		Bytes Written=438623

View the output.

hdfs dfs -cat /user/root/output/*

...
reserved,	1
reserved.	1
reserves,	1
resided	1
residence,	1
resign	1
resign,	1
resign.	1
resignation,	1
resist	1
resisted;	1
resistence,	1
resolution	4
resolved	8
resolves	1
resolving	2
resort,	3
resort;	1
resounded	1
resources	1
respect	2
respect.	1
respective	3
...

Check the results accessing to the output folder.

hdfs dfs -ls /user/root/output

Output the text file.

hdfs dfs -cat /user/root/output/part-r-00000 > /tmp/pg2383_wc.txt
exit
docker cp namenode:/tmp/pg2383_wc.txt

Download the Apache-Hadoop prebuilt image from Docker Hub repository:

Windows

AMD-based chipset (Intel):

docker pull jporeilly/apache-hadoop:amd

MacOS

ARM-based (Mac M series):

docker pull jporeilly/apache-hadoop:arm

These images are about 14GB .. so please be patient ..!

Once completed, deploy the Apache-Hadoop containers.

docker run -it -p 9870:9870 -p 8095:8088 -p 9864:9864 --name AHW jporeilly/apache-hadoop:amd

The YARN Resource Manager port is mapped to: port:8095 to prevent conflict with Cedalo Management Center - MQTT

Once completed a shell will open:

root@955b8d17f170:/#

Enter: init

init

This stops all running processes, formats the HDFS namenodes, and starts all processes.

Enter: jps

jps

Now you can access the Hadoop services at:

NameNode Web UI: http://localhost:9870
YARN ResourceManager: http://localhost:8095 (instead of 8088)
DataNode Web UI: http://localhost:9864

This completes the installation of all tools required for the Big Data course.

Just type exit to exit from the container - this will stop the container.

To start the container enter the following:

docker start -ai AHW

Once the Docker shell opens, just type restart to restart all processes.

NameNode

The NameNode is the master node and central component of Hadoop's Distributed File System (HDFS). It acts as the "brain" of the file system.

Log into NameNode:

http://localhost:9870localhost

You can upload files to the root directory:

PreviousBig Data NextJupyter Notebook

Last updated 1 month ago

Was this helpful?