Storm is a distributed, realtime computation system to reliably process unbounded streams of data. The following picture shows how data is processed in Storm:
This tutorial will show you how to install Storm on a cluster of CentOS hosts. A Storm cluster contains the following components:
Nimbus is the name for the master node. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. The nodes that perform the work contain a supervisor and each supervisor is in control of one or more workers on that node. ZooKeeper is used for coordination between nimbus and the supervisors.
All nodes
We start with disabling SELinux and iptables on every host. This is a bad idea if you are running your cluster on publicly accessible machines, but makes it a lot easier to debug network problems. SELinux is enabled by default on CentOS. To disable it, we need to edit /etc/selinux/config:
SELINUX=disabled
We need to reboot the machine for this to take effect.
The firewall has some default rules we want to get rid of:
iptables --flush iptables --table nat --flush iptables --delete-chain iptables --table nat --delete-chain /etc/init.d/iptables save
Storm and ZooKeeper are both fail-fast systems, which means that a Storm or ZooKeeper process will kill itself as soon as an error is detected. It is therefore necessary to put the Storm and ZooKeeper processes under supervision. This will make sure that each process is restarted when needed. For supervision we will use supervisord. Installation is performed like this:
rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm yum install supervisor
ZooKeeper node
We will now create a single ZooKeeper node. Take a look at the ZooKeeper documentation to install a cluster.
yum -y install java-1.7.0-openjdk-devel wget cd /opt wget http://apache.xl-mirror.nl/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz tar zxvf zookeeper-3.4.5.tar.gz mkdir /var/zookeeper cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg
Now edit the zookeeper-3.4.5/conf/zoo.cfg file:
dataDir=/var/zookeeper
Edit the /etc/supervisord.conf file and add a section about ZooKeeper to it:
[program:zookeeper] command=/opt/zookeeper-3.4.5/bin/zkServer.sh start-foreground autostart=true autorestart=true startsecs=1 startretries=999 redirect_stderr=false stdout_logfile=/var/log/zookeeper-out stdout_logfile_maxbytes=10MB stdout_logfile_backups=10 stdout_events_enabled=true stderr_logfile=/var/log/zookeeper-err stderr_logfile_maxbytes=100MB stderr_logfile_backups=10 stderr_events_enabled=true
Start the supervision and thereby the ZooKeeper service:
chkconfig supervisord on service supervisord start
Running the supervisorctl command should result in something like this:
zookeeper RUNNING pid 1115, uptime 1 day, 0:07:33
Nimbus and Supervisor nodes
Every Storm node has a set of dependencies that need to be satisfied. We start with ZeroMQ and JZMQ:
yum -y install gcc gcc-c++ libuuid-devel make wget cd /opt wget http://download.zeromq.org/zeromq-2.2.0.tar.gz tar zxvf zeromq-2.2.0.tar.gz cd zeromq-2.2.0 ./configure make install ldconfig yum install java-1.7.0-openjdk-devel unzip libtool export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.9.x86_64 cd /opt wget https://github.com/nathanmarz/jzmq/archive/master.zip mv master master.zip unzip master.zip cd jzmq-master ./autogen.sh ./configure make install
Then we move onto Storm itself:
cd /opt wget https://github.com/downloads/nathanmarz/storm/storm-0.8.1.zip unzip storm-0.8.1.zip mkdir /var/storm
Now edit the storm-0.8.1/conf/storm.yaml file, replacing the IP addresses as needed:
storm.zookeeper.servers: - "10.20.30.40" nimbus.host: "10.20.30.41" storm.local.dir: "/var/storm"
Finally we edit the supervision configuration file /etc/supervisord.conf:
[program:storm_nimbus] command=/opt/storm-0.8.1/bin/storm nimbus autostart=true autorestart=true startsecs=1 startretries=999 redirect_stderr=false stdout_logfile=/var/log/storm-nimbus-out stdout_logfile_maxbytes=10MB stdout_logfile_backups=10 stdout_events_enabled=true stderr_logfile=/var/log/storm-nimbus-err stderr_logfile_maxbytes=100MB stderr_logfile_backups=10 stderr_events_enabled=true [program:storm_ui] command=/opt/storm-0.8.1/bin/storm ui autostart=true autorestart=true startsecs=1 startretries=999 redirect_stderr=false stdout_logfile=/var/log/storm-ui-out stdout_logfile_maxbytes=10MB stdout_logfile_backups=10 stdout_events_enabled=true stderr_logfile=/var/log/storm-ui-err stderr_logfile_maxbytes=100MB stderr_logfile_backups=10 stderr_events_enabled=true
And start the supervision:
chkconfig supervisord on service supervisord start
Running the supervisorctl command should result in something like this:
storm_nimbus RUNNING pid 1119, uptime 1 day, 0:20:14 storm_ui RUNNING pid 1121, uptime 1 day, 0:20:14
The Storm UI should now be accessible. Point a webbrowser at port 8080 on the Nimbus host, and you should get something like this:
Note that the screenshot also shows an active topology, which will not be available if you just followed the steps in this tutorial and haven’t deployed a topology to the cluster yet.