Installing a Storm cluster on CentOS hosts

Storm is a distributed, realtime computation system to reliably process unbounded streams of data. The following picture shows how data is processed in Storm:

storm-processing

This tutorial will show you how to install Storm on a cluster of CentOS hosts. A Storm cluster contains the following components:

storm-cluster

Nimbus is the name for the master node. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. The nodes that perform the work contain a supervisor and each supervisor is in control of one or more workers on that node. ZooKeeper is used for coordination between nimbus and the supervisors.

All nodes

We start with disabling SELinux and iptables on every host. This is a bad idea if you are running your cluster on publicly accessible machines, but makes it a lot easier to debug network problems. SELinux is enabled by default on CentOS. To disable it, we need to edit /etc/selinux/config:

SELINUX=disabled

We need to reboot the machine for this to take effect.

The firewall has some default rules we want to get rid of:

iptables --flush
iptables --table nat --flush
iptables --delete-chain
iptables --table nat --delete-chain
/etc/init.d/iptables save

Storm and ZooKeeper are both fail-fast systems, which means that a Storm or ZooKeeper process will kill itself as soon as an error is detected. It is therefore necessary to put the Storm and ZooKeeper processes under supervision. This will make sure that each process is restarted when needed. For supervision we will use supervisord. Installation is performed like this:

rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
yum install supervisor

ZooKeeper node

We will now create a single ZooKeeper node. Take a look at the ZooKeeper documentation to install a cluster.

yum -y install java-1.7.0-openjdk-devel wget
cd /opt
wget http://apache.xl-mirror.nl/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
mkdir /var/zookeeper
cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg

Now edit the zookeeper-3.4.5/conf/zoo.cfg file:

dataDir=/var/zookeeper

Edit the /etc/supervisord.conf file and add a section about ZooKeeper to it:

[program:zookeeper]
command=/opt/zookeeper-3.4.5/bin/zkServer.sh start-foreground
autostart=true
autorestart=true
startsecs=1
startretries=999
redirect_stderr=false
stdout_logfile=/var/log/zookeeper-out
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_events_enabled=true
stderr_logfile=/var/log/zookeeper-err
stderr_logfile_maxbytes=100MB
stderr_logfile_backups=10
stderr_events_enabled=true

Start the supervision and thereby the ZooKeeper service:

chkconfig supervisord on
service supervisord start

Running the supervisorctl command should result in something like this:

zookeeper      RUNNING    pid 1115, uptime 1 day, 0:07:33

Nimbus and Supervisor nodes

Every Storm node has a set of dependencies that need to be satisfied. We start with ZeroMQ and JZMQ:

yum -y install gcc gcc-c++ libuuid-devel make wget
cd /opt
wget http://download.zeromq.org/zeromq-2.2.0.tar.gz
tar zxvf zeromq-2.2.0.tar.gz
cd zeromq-2.2.0
./configure
make install
ldconfig

yum install java-1.7.0-openjdk-devel unzip libtool
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64
cd /opt
wget https://github.com/nathanmarz/jzmq/archive/master.zip
mv master master.zip
unzip master.zip
cd jzmq-master
./autogen.sh
./configure
make install

Then we move onto Storm itself:

cd /opt
wget https://github.com/downloads/nathanmarz/storm/storm-0.8.1.zip
unzip storm-0.8.1.zip
mkdir /var/storm

Now edit the storm-0.8.1/conf/storm.yaml file, replacing the IP addresses as needed:

storm.zookeeper.servers:
 - "10.20.30.40"
nimbus.host: "10.20.30.41"
storm.local.dir: "/var/storm"

Finally we edit the supervision configuration file /etc/supervisord.conf:

[program:storm_nimbus]
command=/opt/storm-0.8.1/bin/storm nimbus
autostart=true
autorestart=true
startsecs=1
startretries=999
redirect_stderr=false
stdout_logfile=/var/log/storm-nimbus-out
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_events_enabled=true
stderr_logfile=/var/log/storm-nimbus-err
stderr_logfile_maxbytes=100MB
stderr_logfile_backups=10
stderr_events_enabled=true

[program:storm_ui]
command=/opt/storm-0.8.1/bin/storm ui
autostart=true
autorestart=true
startsecs=1
startretries=999
redirect_stderr=false
stdout_logfile=/var/log/storm-ui-out
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_events_enabled=true
stderr_logfile=/var/log/storm-ui-err
stderr_logfile_maxbytes=100MB
stderr_logfile_backups=10
stderr_events_enabled=true

And start the supervision:

chkconfig supervisord on
service supervisord start

Running the supervisorctl command should result in something like this:

storm_nimbus   RUNNING    pid 1119, uptime 1 day, 0:20:14
storm_ui       RUNNING    pid 1121, uptime 1 day, 0:20:14

The Storm UI should now be accessible. Point a webbrowser at port 8080 on the Nimbus host, and you should get something like this:

storm-ui

Note that the screenshot also shows an active topology, which will not be available if you just followed the steps in this tutorial and haven’t deployed a topology to the cluster yet.

Advertisement
This entry was posted in Cloud computing, programming and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s