Apache Hadoop

Setup Apache Hadoop on Windows, Linux & Mac OS ..

triangle-exclamation
circle-info

Prerequisites

  • Ubuntu 24.04 LTS system (physical or virtual machine)

  • User account with sudo privileges

  • Internet connection

  • Basic familiarity with Linux command line

x

circle-info

In pseudo-distributed mode, all Hadoop daemons (NameNode, DataNode, ResourceManager, NodeManager) run on a single machine. This setup is configured to mimic a multi-node cluster, allowing you to test HDFS operations and run MapReduce or YARN applications as if you had a small cluster. It's the ideal starting point for learning Hadoop.

  1. Update system packages.

sudo apt update && sudo apt upgrade -y
  1. Hadoop is built on Java so you need a Java Development Kit (JDK) installed. Confirm installation.

java --version
  1. Before you begin ensure Docker & Docker Compose have been installed & configured.

docker-compose --version

Create Directories

  1. Create directory structure.

// Some code

  1. Run the docker containers using docker-compose

cd
cd ~/Hadoop
docker-compose up -d
[+] Running 28/5
  datanode Pulled                                                        32.7s 
  namenode Pulled                                                        32.6s 
  nodemanager1 Pulled                                                    32.5s 
  resourcemanager Pulled                                                 32.3s 
  historyserver Pulled                                                   32.5s 
[+] Running 9/9
  Network hadoop_default                Creat...                          0.5s 
  Volume "hadoop_hadoop_datanode"       Created                           0.0s 
  Volume "hadoop_hadoop_historyserver"  Created                           0.0s 
  Volume "hadoop_hadoop_namenode"       Created                           0.0s 
  Container datanode                    Started                           3.8s 
  Container namenode                    Started                           3.9s 
  Container nodemanager                 Starte...                         3.9s 
  Container historyserver               Star...                           3.8s 
  Container resourcemanager             St...                             3.9s 
... 

Access the Cluster

  1. Can login into any node by specifying the container.

  1. Navigate to mapped data volume.


Accessing the UI

The Namenode UI can be accessed at:

ResourceManager UI can be accessed at:

History Server UI can be accessed at:


Shutdown Cluster

To shut down the cluster.

circle-info

Time to check we can run some Hadoop Jobs.

So we're going to run a Job that counts the number of instances a word appears in the Canterbury Tales.

Test - Word Count Algorithm

  1. List all the files in our HDFS system.

  1. Create a /user/root/ file.

  1. Verify directory.

  1. Download the hadoop-mapreduce-examples-3.2.1-sources.jar file

circle-info

We will use a .jar file containing the classes needed to execute MapReduce algorithm.

  1. Save - hadoop-mapreduce-examples-3.2.1-sources.jar to: ~/Hadoop

  2. Copy the files into the namenode container.

  1. Create the Input folder.

  1. Copy over /tmp/pg2383.txt to /user/root/input.

10 . Run MapReduce

  1. View the output.

  1. Check the results accessing to the output folder.

  1. Output the text file.

Last updated

Was this helpful?