Organizations that have embarked on a Big Data journey know how vital it is to keep Hadoop clusters up and running. To meet expected SLA/OLA and serve internal and external customers, keeping Hadoop clusters in good health is of utmost importance.
Hence, you must recognize even a small or simple issue like clock offset. The ripple effect of this could have dire consequences, disrupting all services.
Emergys provides Big Data managed services to multiple clients. While working for one of our clients a few months back, we faced a similar clock offset issue. It was a large Cloudera cluster using Cent OS environment, and this was impacting the health of the cluster and consequently, various projects of our client were also getting affected.
For those unaware, first, let’s understand why we require time synchronization.
Why do we require time synchronization?
Hadoop is a master-slave architecture wherein the slave node sends regular heartbeat signals to the controller node regarding its health. Hence, all the machines in the cluster must be time synchronized and refer to the same time.
Synchronization is the bridge between enslaver and slave to get health updates. Broken synchronization means no updates to the master about health.
The most effective way to synchronize time is by synchronizing the time of all the cluster machines to a shared NTP server.
Before going further, let’s first understand what NTP is.
What is NTP?
Network Time Protocol (NTP) is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP will synchronize all participating computers within a few milliseconds of coordinated universal time (UTC). NTP can usually maintain time within tens of milliseconds over the public internet and can achieve better than one millisecond accuracy on LAN under ideal conditions. Many reference NTP servers are available on the internet, which you can use for synchronization.
A most common way to resolve this is to, sync with reference NTP servers. However, our client security policies did not let us connect to an external server for time synchronization. Hence, to resolve this, we came up with a workaround.
We decided to use one of our controller servers as a reference server for time synchronization for all the other machines in the cluster.
How do we resolve the clock offset issue?
Organizations that have embarked on a Big Data journey know how vital it is to keep Hadoop clusters up and running. To meet expected SLA/OLA and serve internal and external customers, supporting Hadoop clusters in good health is of utmost importance.
Hence, you must recognize even a small or simple issue like clock offset. The ripple effect of this could have dire consequences, disrupting all services.
Emergys provides Big Data managed services to multiple clients. While working for one of our clients a few months back, we faced a similar clock offset issue. It was a large Cloudera cluster using Cent OS environment and this was impacting the health of the cluster and consequently, various projects of our client were also getting affected.
For those unaware, first, let’s understand why we require time synchronization.
We decided to use one of our controller servers as a reference server for time synchronization for all the other machines in the cluster.
Below are steps in brief, which we followed:
- Configure NTP server on all machine using the YUM command
- Select the name node as a reference NTP server for other machines
- Edit the -etc-ntp.conf file and comment out the below:
- server. ubuntu.pool.ntp.org
- server. ubuntu.pool.ntp.org
- server. ubuntu.pool.ntp.org
- server. ubuntu.pool.ntp.org
- Repeat the same thing on all machines
- Edit the —etc-ntp.conf file on the selected reference NTP server for example Name Node in our case, and copy the below:
#Use our own NTP Server, which is our name node server.
Server name node iburst
# server 127.127.1.0 # local clock
Note: Iburst is a configurable option. If an NTP server is unresponsive, the iburst mode sends frequent queries until the server responds and time synchronization starts. - Verify the status of NTP using the below command on Name Node
ntpq -p
Output should be similar to below:
# ntpq -p
remote refined st t when poll reach delay offset jitter
*elserver1 19.168.1.1 3 u 300 1024 377 1.225 -0.071 4.606 - Repeat steps 5 and 6 on all the remaining machines and verify the output
Using this simple approach, we achieved many objectives in one go! We did not have to depend on the external NTP server. Depending on external NTP servers could be risky as they are not in our control. Above all, using this approach, we could avoid any security policy breach for our client.
There might be many other approaches to resolve clock offset error, but we fixed this issue once and for all using this approach!