New Sep 19, 2024

Introduction to HDFS Server Configuration in GBase 8a MPP Cluster

The Giants All from DEV Community View Introduction to HDFS Server Configuration in GBase 8a MPP Cluster on dev.to

Today, I would like to introduce the configuration of an HDFS server. For reference, you can check out the previous articles:

Setting Up an HDFS Server Using Apache Hadoop 2.6.0

  1. Preparing the Hadoop Cluster Environment

Example Configuration:

IP Hostname Role
192.168.10.114 ch-10-114 NameNode, DataNode
192.168.10.115 ch-10-115 DataNode
192.168.10.116 ch-10-116 DataNode

  1. Configuring Hostnames

Each node needs to have the correct hostname configuration. For example, on the node 192.168.10.114, the configuration should be as follows. Other nodes can directly copy this configuration.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1     localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.10.114 ch-10-114
192.168.10.115 ch-10-115
192.168.10.116 ch-10-116

Note: If the first line is configured as shown below, there will be an issue where the Hadoop DataNode cannot connect to the NameNode after installation.

127.0.0.1   ch-10-114 localhost localhost.localdomain localhost4 localhost4.localdomain4

If the cluster does not have a DNS server to resolve the hostnames of Hadoop's NameNode and DataNode, you need to configure the /etc/hosts file on every coordinator node executing the load task and every data node in the cluster. Add the mappings of the IP addresses and hostnames of the NameNode and DataNode as shown above. If the /etc/hosts file is not configured, an error like “Couldn't resolve hostname” will be reported when loading files from the HDFS server.

Check Method:

Use the jps command to check. If you find that the DataNode has started but its log shows continuous attempts to connect to the NameNode's port 9000 (HDFS's RPC port), check the NameNode node with netstat -an. You should see something like this:

$ netstat -an | grep 9000
tcp  0  0 127.0.0.1:9000    0.0.0.0:*        LISTEN  

Error Reason: The IP address for the TCP listener is 127.0.0.1, causing only the local machine to connect to port 9000. This is due to an incorrect configuration of the /etc/hosts file on the NameNode.

Solution: Remove the red text (ch-10-114) from the first line, or move the contents of the first line to a later position.

Correct configuration:

192.168.10.114 ch-10-114
192.168.10.115 ch-10-115
192.168.10.116 ch-10-116
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1     localhost localhost.localdomain localhost6 localhost6.localdomain6

Restart HDFS and check again with netstat -an | grep 9000. The port and IP should now be correct:

$ netstat -an | grep 9000
tcp  0  0 192.168.10.114:9000    0.0.0.0:*        LISTEN  

  1. Directory Planning

Directory Purpose
/home/gbase/bin Stores the Hadoop ecosystem, including Hadoop itself
/home/gbase/hdfs Stores HDFS files, including tmp, name, and data

Add the environment variable ${HADOOP_HOME}:

$ echo "export HADOOP_HOME=/home/gbase/bin/Hadoop-2.6.0" >> ~/.bashrc
$ . ~/.bashrc

Note: ${HADOOP_HOME} refers to /home/gbase/bin/Hadoop-2.6.0 below.

  1. Preparing Hadoop 2.6.0

Unzip hadoop-2.6.0.tar.gz to /home/gbase/bin on each node.

$ tar xfz hadoop-2.6.0.tar.gz -C /home/gbase/bin

  1. Configuring hadoop-env.sh

File path: ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh

$ cd ${HADOOP_HOME}
$ vi etc/hadoop/hadoop-env.sh

Configure both NameNode and DataNode as follows.

Change export JAVA_HOME=$JAVA_HOME to:

export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64

Change export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} to:

export HADOOP_CONF_DIR=/home/gbase/bin/hadoop-2.6.0/etc/hadoop

  1. Configuring core-site.xml

File path: ${HADOOP_HOME}/etc/hadoop/core-site.xml

$ cd ${HADOOP_HOME}
$ vi etc/hadoop/core-site.xml

Configure both NameNode and DataNode as follows:

<configuration>
   <property>
       <name>fs.default.name</name>
       <value>hdfs://ch-10-114:9000</value>
   </property>
   <property>
       <name>hadoop.tmp.dir</name>
       <value>file:/home/gbase/hdfs/tmp</value>
   </property>
</configuration>

  1. Configuring hdfs-site.xml

File path: ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml

$ cd ${HADOOP_HOME}
$ vi etc/hadoop/hdfs-site.xml

NameNode Configuration:

<configuration>
   <property>
       <name>dfs.replication</name>
       <value>2</value>
   </property>
   <property>
       <name>dfs.name.dir</name>
       <value>file:/home/gbase/hdfs/name</value>
       <description>name node dir </description>
   </property>
   <property>
       <name>dfs.permissions</name>
       <value>false</value>
   </property>
</configuration>

DataNode Configuration:

<configuration>
   <property>
       <name>dfs.data.dir</name>
       <value>file:/home/gbase/hdfs/data</value>
       <description>data node dir</description>
   </property>
</configuration>

  1. Configuring Masters and Slaves

File paths:

Only need to configure on the NameNode node.

$ cd ${HADOOP_HOME}
$ vi etc/hadoop/masters

Contents of ${HADOOP_HOME}/etc/hadoop/masters:

ch-10-114
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/slaves

Contents of ${HADOOP_HOME}/etc/hadoop/slaves:

ch-10-114
ch-10-115
ch-10-116

  1. Formatting the NameNode

NameNode formatting needs to be done before starting HDFS.

$ cexec rm -fr /home/gbase/hdfs/*
$ cd ${HADOOP_HOME}
$ bin/hdfs namenode -format

  1. Starting HDFS

$ cd ${HADOOP_HOME}
$ sbin/start-dfs.sh

After starting, use the jps command to check the processes on each node. The following output indicates successful startup:

$ cexec jps
************************* test *************************
--------- 192.168.10.114---------
31318 SecondaryNameNode
31133 NameNode
31554 Jps
--------- 192.168.10.115---------
10835 DataNode
11000 Jps
--------- 192.168.10.116---------
10145 DataNode
10317 Jps

  1. Stopping HDFS

$ cd ${HADOOP_HOME}
$ sbin/stop-dfs.sh
Scroll to top