# Apache Hadoop

{% hint style="danger" %}
The following steps are intended for setting up a Pentaho Lab environment and need to be completed in order to complete the Workshops.

Ensure you have downloaded the Workshop--Installation:

```bash
cd
git clone https://github.com/jporeilly/Workshop--Installation
```

To install git:

```bash
sudo apt install git
```

{% endhint %}

{% hint style="info" %}
**Prerequisites**

* Ubuntu 24.04 LTS system (physical or virtual machine)
* User account with sudo privileges
* Internet connection
* Basic familiarity with Linux command line
  {% endhint %}

x

{% tabs %}
{% tab title="Linux" %}
{% hint style="info" %}
In pseudo-distributed mode, all Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager) run on a single machine. This setup is configured to mimic a multi-node cluster, allowing you to test HDFS operations and run MapReduce or YARN applications as if you had a small cluster. It's the ideal starting point for learning Hadoop.
{% endhint %}

1. Update system packages.

```bash
sudo apt update && sudo apt upgrade -y
```

2. Hadoop is built on Java so you need a Java Development Kit (JDK) installed. Confirm installation.&#x20;

```bash
java --version
```

3. Before you begin ensure Docker & Docker Compose have been installed & configured.

```bash
docker-compose --version
```

***

**Create Directories**

1. Create directory structure.

```
// Some code
```

2. Run the docker containers using docker-compose

```bash
cd
cd ~/Hadoop
docker-compose up -d
```

```bash
[+] Running 28/5
 ✔ datanode Pulled                                                        32.7s 
 ✔ namenode Pulled                                                        32.6s 
 ✔ nodemanager1 Pulled                                                    32.5s 
 ✔ resourcemanager Pulled                                                 32.3s 
 ✔ historyserver Pulled                                                   32.5s 
[+] Running 9/9
 ✔ Network hadoop_default                Creat...                          0.5s 
 ✔ Volume "hadoop_hadoop_datanode"       Created                           0.0s 
 ✔ Volume "hadoop_hadoop_historyserver"  Created                           0.0s 
 ✔ Volume "hadoop_hadoop_namenode"       Created                           0.0s 
 ✔ Container datanode                    Started                           3.8s 
 ✔ Container namenode                    Started                           3.9s 
 ✔ Container nodemanager                 Starte...                         3.9s 
 ✔ Container historyserver               Star...                           3.8s 
 ✔ Container resourcemanager             St...                             3.9s 
... 
```

***

**Access the Cluster**

1. Can login into any node by specifying the container.

```bash
docker exec -it datanode /bin/bash 
```

2. Navigate to mapped data volume.

```bash
cd hadoop/dfs/
```

***

**Accessing the UI**

The Namenode UI can be accessed at:

{% embed url="<http://localhost:9870/dfshealth.html#tab-overview>" %}

ResourceManager UI can be accessed at:

{% embed url="<http://localhost:8088/>" %}

History Server UI can be accessed at:

{% embed url="<http://localhost:8188/applicationhistory>" %}

***

**Shutdown Cluster**

To shut down the cluster.

```bash
docker-compose down
```

{% hint style="info" %}
Time to check we can run some Hadoop Jobs.

So we're going to run a Job that counts the number of instances a word appears in the Canterbury Tales.
{% endhint %}

**Test - Word Count Algorithm**

1. List all the files in our HDFS system.

```bash
hdfs dfs -l /
```

2. Create a /user/root/ file.

```bash
hdfs dfs -mkdir -p /user/root
```

3. Verify directory.

```
hdfs dfs -ls /user/
```

```
Found 1 items
drwxr-xr-x   - root supergroup          0 2024-08-10 13:59 /user/root
```

4. Download the **hadoop-mapreduce-examples-3.2.1-sources.jar** file

{% hint style="info" %}
We will use a .jar file containing the classes needed to execute MapReduce algorithm.
{% endhint %}

{% embed url="<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-examples/3.2.1/>" %}

5. Save - hadoop-mapreduce-examples-3.2.1-sources.jar to: \~/Hadoop
6. Download & Save text file - [**Canterbury Tales**](https://www.gutenberg.org/cache/epub/2383/pg2383.txt) **or** [**Ulysses**](https://www.gutenberg.org/cache/epub/4300/pg4300.txt)
7. Copy the files into the namenode container.

```bash
cd
cd ~/Hadoop/assets
docker cp hadoop-mapreduce-examples-3.2.1-sources.jar namenode:/tmp
docker cp pg2383.txt namenode:/tmp
```

8. Create the Input folder.

```docker
docker exec -it namenode bash
hdfs dfs -mkdir /user/root/input
```

9. Copy over /tmp/pg2383.txt to /user/root/input.

```bash
cd
cd /tmp
hdfs dfs -put pg2383.txt /user/root/input
```

10 . Run MapReduce

```
hadoop jar hadoop-mapreduce-examples-3.2.1-sources.jar org.apache.hadoop.examples.WordCount input output
```

```
2024-08-10 14:10:15,533 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.18.0.6:8032
2024-08-10 14:10:15,702 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.18.0.3:10200
2024-08-10 14:10:15,879 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1723287966223_0001
2024-08-10 14:10:15,969 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,068 INFO input.FileInputFormat: Total input files to process : 1
2024-08-10 14:10:16,088 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,101 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,107 INFO mapreduce.JobSubmitter: number of splits:1
2024-08-10 14:10:16,189 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1723287966223_0001
2024-08-10 14:10:16,200 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-08-10 14:10:16,345 INFO conf.Configuration: resource-types.xml not found
2024-08-10 14:10:16,346 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-08-10 14:10:16,813 INFO impl.YarnClientImpl: Submitted application application_1723287966223_0001
2024-08-10 14:10:16,867 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1723287966223_0001/
2024-08-10 14:10:16,868 INFO mapreduce.Job: Running job: job_1723287966223_0001
2024-08-10 14:10:23,970 INFO mapreduce.Job: Job job_1723287966223_0001 running in uber mode : false
2024-08-10 14:10:23,971 INFO mapreduce.Job:  map 0% reduce 0%
2024-08-10 14:10:30,048 INFO mapreduce.Job:  map 100% reduce 0%
2024-08-10 14:10:34,065 INFO mapreduce.Job:  map 100% reduce 100%
2024-08-10 14:10:35,074 INFO mapreduce.Job: Job job_1723287966223_0001 completed successfully
2024-08-10 14:10:35,163 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=187024
		FILE: Number of bytes written=832593
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1692663
		HDFS: Number of bytes written=438623
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=10968
		Total time spent by all reduces in occupied slots (ms)=16448
		Total time spent by all map tasks (ms)=2742
		Total time spent by all reduce tasks (ms)=2056
		Total vcore-milliseconds taken by all map tasks=2742
		Total vcore-milliseconds taken by all reduce tasks=2056
		Total megabyte-milliseconds taken by all map tasks=11231232
		Total megabyte-milliseconds taken by all reduce tasks=16842752
	Map-Reduce Framework
		Map input records=36758
		Map output records=282822
		Map output bytes=2691784
		Map output materialized bytes=187016
		Input split bytes=112
		Combine input records=282822
		Combine output records=41330
		Reduce input groups=41330
		Reduce shuffle bytes=187016
		Reduce input records=41330
		Reduce output records=41330
		Spilled Records=82660
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=237
		CPU time spent (ms)=4180
		Physical memory (bytes) snapshot=862277632
		Virtual memory (bytes) snapshot=13577064448
		Total committed heap usage (bytes)=1277165568
		Peak Map Physical memory (bytes)=608587776
		Peak Map Virtual memory (bytes)=5115801600
		Peak Reduce Physical memory (bytes)=253689856
		Peak Reduce Virtual memory (bytes)=8461262848
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1692551
	File Output Format Counters 
		Bytes Written=438623
```

11. View the output.

```bash
hdfs dfs -cat /user/root/output/*
```

```
...
reserved,	1
reserved.	1
reserves,	1
resided	1
residence,	1
resign	1
resign,	1
resign.	1
resignation,	1
resist	1
resisted;	1
resistence,	1
resolution	4
resolved	8
resolves	1
resolving	2
resort,	3
resort;	1
resounded	1
resources	1
respect	2
respect.	1
respective	3
...
```

12. Check the results accessing to the output folder.

```bash
hdfs dfs -ls /user/root/output
```

13. Output the text file.

```
hdfs dfs -cat /user/root/output/part-r-00000 > /tmp/pg2383_wc.txt
exit
docker cp namenode:/tmp/pg2383_wc.txt 
```

{% endtab %}

{% tab title="Windows & MacOS " %}

1. Download the Apache-Hadoop prebuilt image from Docker Hub repository:

{% hint style="info" %}
**Windows**

AMD-based chipset (Intel):

```docker
docker pull jporeilly/apache-hadoop:amd
```

{% endhint %}

{% hint style="info" %}
**MacOS**

ARM-based (Mac M series):

```
docker pull jporeilly/apache-hadoop:arm
```

{% endhint %}

{% hint style="warning" %}
These images are about 14GB .. so please be patient ..!
{% endhint %}

<figure><img src="/files/piFnfnGXEij9bFaHo472" alt=""><figcaption></figcaption></figure>

2. Once completed, deploy the Apache-Hadoop containers.

```docker
docker run -it -p 9870:9870 -p 8095:8088 -p 9864:9864 --name AHW jporeilly/apache-hadoop:amd
```

{% hint style="danger" %}
The YARN Resource Manager port is mapped to: port:8095 to prevent conflict with Cedalo Management Center - MQTT
{% endhint %}

3. Once completed a shell will open:

```
root@955b8d17f170:/#
```

4. Enter: init

```
init
```

{% hint style="warning" %}
This stops all running processes, formats the HDFS namenodes, and starts all processes.
{% endhint %}

5. Enter: jps

```
jps
```

<figure><img src="/files/iXXWzPfIpC1Llx6FNLHi" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/1lJAM0duFkJvpWrSbTCs" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Now you can access the Hadoop services at:

* **NameNode Web UI**: <http://localhost:9870>
* **YARN ResourceManager**: [http://localhost:8095](http://localhost:8089) (instead of 8088)
* **DataNode Web UI**: <http://localhost:9864>
  {% endhint %}

This completes the installation of all tools required for the Big Data course.

Just type `exit` to exit from the container - this will stop the container.

To start the container enter the following:

```docker
docker start -ai AHW
```

Once the Docker shell opens, just type `restart` to restart all processes.

<figure><img src="/files/0n9qW3xFOmtx74hIKvwu" alt=""><figcaption></figcaption></figure>

{% tabs %}
{% tab title="NameNode" %}
{% hint style="info" %}
**NameNode**

The **NameNode** is the master node and central component of Hadoop's Distributed File System (HDFS). It acts as the "brain" of the file system.
{% endhint %}

1. Log into NameNode:

{% embed url="<http://localhost:9870>" %}

2. You can upload files to the root directory:

<figure><img src="/files/BB7vjjkL3oA7GgZbNfw6" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="YARN" %}
{% hint style="info" %}
**YARN**

YARN acts as the operating system for Hadoop clusters by separating resource management from job scheduling and monitoring, allowing multiple data processing engines like MapReduce, Spark, Hive, and others to run simultaneously on the same cluster.
{% endhint %}

1. Log into YARN:

{% embed url="<http://localhost:8095>" %}

<figure><img src="/files/Vvl5OfpFajDuGy0n0GZi" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="DataNode" %}
{% hint style="info" %}
**DataNode**

A DataNode in Hadoop is a worker node in the Hadoop Distributed File System (HDFS) that stores the actual data blocks and serves read/write requests from clients. DataNodes communicate regularly with the NameNode through heartbeat messages to report their health status and the blocks they're storing.

They handle data replication by creating multiple copies of blocks across different nodes to ensure fault tolerance, and they perform block verification to detect corruption. DataNodes also participate in data pipeline operations during file writes and coordinate with other DataNodes to maintain data integrity and availability across the distributed cluster.
{% endhint %}

1. Log into the DataNode:

{% embed url="<http://localhost:9864>" %}

2. Useful for troubleshooting the Node.

<figure><img src="/files/REFmCREQLYg8n2oki3w4" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}
{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.pentaho.com/pentaho-data-integration/setup/data-sources/big-data/apache-hadoop.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
