Apache Storm Tutorial - Architecture
In this tutorial we will be looking at Apache Storm Architecture, internal working of Apache Storm and it's components.
Apache Storm Tutorial :
- Apache Storm Architecture
- Install Apache Storm
- Apache Storm Interview Questions and Answers
We will cover the following topics:
What is a Apache Storm?
Apache Storm Use Cases
Apache Storm Features
Apache Storm Components
Apache Storm Architecture / Apache Storm Data Model
Apache Storm Topology
Apache Storm Operation modes
What is a Apache Storm?
Apache Storm is a free and open source distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter.
Storm can be used for the following use cases:
-
Stream processing
Apache Storm is used to process a stream of data in real time and update several databases. This processing takes place in real time, and the processing speed must match that of the input data speed. -
Continuous computation
Apache Storm can process data streams continuously and deliver the results to clients in real time. This could require processing each message when it arrives or creating in small batches over a short period of time. Streaming trending topics from Twitter into browsers is an example of continuous computation. -
Distributed RPC
Apache Storm can parallelize a complex query, allowing it to be computed in real time. -
Real-time analytics
Apache Storm will analyse and react to data as it comes in from various data sources in real time.
Features of Storm
The following are some of Storm's features that make it an ideal solution for processing real-time data streams:
- Fast: Storm can process up to 1 million tuples/records per second per node, according to reports.
- Horizontally scalable: Although speed is an important feature for building a high-volume/high-speed data processing platform, each node has a limit on the number of events it can process per second. A node in your setup is a single machine that runs Storm applications. Since Storm is a distributed platform, you can expand your Storm cluster and increase your application's processing capacity. It is also linearly scalable, which means that by doubling the nodes, you can double the processing capacity.
- Fault tolerant: Worker processes in a Storm cluster perform units of work. Storm will restart a worker when it dies, and if the node on which it is running dies, Storm will restart the worker on another node in the cluster.
- Guaranteed data processing: Storm gives you the assurance that any message you send to Storm will be processed at least once. Storm can replay the missing tuples/records if there are any failures. It can also be configured such that each message is only processed once.
- Easy to operate: Storm is easy to set up and maintain. If the cluster is deployed, it takes very little maintenance.
- Programming language agnostic: Despite the fact that the Storm platform runs on the Java virtual machine (JVM), the programmes that run on top of it can be written in any programming language that can read and write to standard input and output streams.
Apache Storm Architecture
A Storm cluster uses a master-slave model, with ZooKeeper coordinating the master and slave processes. A Storm cluster is made up of the following components.
-
Nimbus
In a Storm cluster, the Nimbus node is the master. It's in charge of distributing application code through multiple worker nodes, assigning tasks to various machines, monitoring tasks for errors, and restarting them as required.
Nimbus is stateless and uses ZooKeeper to store all of its records. In a Storm cluster, there is just one Nimbus node. If the active node fails, the passive node will take over as the active node. It is built to be fail-fast, when the active Nimbus dies, the passive node becomes active, or the down node can be restarted without affecting the tasks currently running on the worker nodes. In contrast to Hadoop, where if the JobTracker fails, all running jobs are left in an inconsistent state and must be re-run.
-
Supervisor nodes
In a Storm cluster, the supervisor nodes are the worker nodes. Each supervisor node runs a supervisor daemon, which is in charge of building, starting, and stopping worker processes in order to complete the tasks assigned to it. A supervisor daemon, like Nimbus, is fail-safe and saves all of its states in ZooKeeper so that it can be restarted without losing any data. Normally, a single supervisor daemon manages multiple worker processes on a single machine.
-
The ZooKeeper cluster
Various processes in a distributed application must communicate with one another and share certain configuration information. ZooKeeper is an application that reliably provides all of these services. Storm uses a ZooKeeper cluster to organise different processes as a distributed application. In ZooKeeper, all of the cluster's states, as well as the different tasks sent to Storm, are saved. Nimbus and supervisor nodes communicate with each other through ZooKeeper rather than directly. Since all data is stored in ZooKeeper, both Nimbus and the supervisor daemons can be killed without causing the cluster to fail.
Apache Storm Components
The following are the components of a Storm topology:
In Storm terminology, a topology is an abstraction that defines the graph of the computation. You create a Storm topology and deploy it on a Storm cluster to process data. A topology can be represented by a direct acyclic graph, where each node does some kind of processing and forwards it to the next node(s) in the flow. The following diagram is a sample Storm topology: diagram: A Storm topology
-
Spout
Receives data from the source, Emits it to the rest of the topology.
A spout is the source of tuples in a Storm topology. It is responsible for reading or listening to data from an external source, for example, by reading from a log file or listening for new messages in a queue and publishing them--emitting in Storm terminology into streams. A spout can emit multiple streams, each of a different schema. For example, it can read records of 10 fields from a log file and emit them as different streams of seven-fields tuples and four-fields tuples each.
The org.apache.storm.spout.ISpout interface is the interface used to define spouts. If you are writing your topology in Java, then you should use org.apache.storm.topology.IRichSpout as it declares methods to use with the TopologyBuilder API. Whenever a spout emits a tuple, Storm tracks all the tuples generated while processing this tuple, and when the execution of all the tuples in the graph of this source tuple is complete, it will send an acknowledgement back to the spout. This tracking happens only if a message ID was provided when emitting the tuple. If null was used as the message ID, this tracking will not happen.
-
Bolt
Bolts represent the processing logic unit in Storm. It processes the data. Basically, a bolt is the processing powerhouse of a Storm topology and is responsible for transforming a stream. Ideally, each bolt in the topology should be doing a simple transformation of the tuples, and many such bolts can coordinate with each other to exhibit a complex transformation.
The org.apache.storm.task.IBolt interface is preferably used to define bolts, and if a topology is written in Java, you should use the org.apache.storm.topology.IRichBolt interface. A bolt can subscribe to multiple streams of other components--either spouts or other bolts--in the topology and similarly can emit output to multiple streams. Output streams can be declared using the declareStream method of org.apache.storm.topology.OutputFieldsDeclarer.
-
Tuple
The basic unit of data that can be processed by a Storm application is called a tuple. Each tuple consists of a predefined list of fields. The value of each field can be a byte, char, integer, long, float, double, Boolean, or byte array. Storm also provides an API to define your own datatypes, which can be serialized as fields in a tuple.
A tuple is dynamically typed, that is, you just need to define the names of the fields in a tuple and not their datatype. The choice of dynamic typing helps to simplify the API and makes it easy to use. Also, since a processing unit in Storm can process multiple types of tuples, it's not practical to declare field types.
Representing Data in Apache Storm Topologies
Apache Storm Topology is a chain of stream transformations, with each node representing a spout or bolt. In a Storm topology, each node runs in parallel. You can decide how much parallelism you want for each node in your topology, and Storm will spawn that many threads across the cluster to complete the task.
-
Tuple
A single message/record that flows between the different instances of a topology is called a tuple. -
Stream
The key abstraction in Storm is that of a stream. A stream is an unbounded sequence of tuples that can be processed in parallel by Storm. Each stream can be processed by a single or multiple types of bolts. Storm can also be viewed as a platform to transform streams. In the preceding diagram, streams are represented by arrows. Each stream in a Storm application is given an ID and the bolts can produce and consume tuples from these streams on the basis of their ID. Each stream also has an associated schema for the tuples that will flow through it.
When data passes through a Storm topology, it is represented in a specific way. Let's say you have a storm topology with a spout and two bolts. When data comes to the Storm topology, it will first be received by the spout, which will then emit the data to the bolt. Bolt one will then emit it to bolt two and bolt two will give the final output. Here, each component inside the Storm topology is emitting something called a tuple.
A tuple represents a single element or a data point that has been processed and emitted by a Storm component. A Storm tuple is a named tuple where each element in the tuple is given a specific name. When data first comes to the Storm topology, the spout takes that data and emits it in the form of a tuple. A bolt will take that tuple, process it, and emit a new tuple, and so on.
Every Storm tuple consists of two parts. You have the names of the elements in the tuple, these are called fields. The fields represent the schema of the Storm tuple. The actual elements of the tuple are values. These values represent the data contained inside the tuple. The values in a tuple can contain any type of data. Given a Storm component, that Storm component would emit tuples which have a fixed schema.
Take a look at our suggested posts:
Setting up a Storm Component
-
Declare Fields
Implement this method in every storm component.
public void declareOutputFields(OutputFieldsDeclarer declarer){ declarer.declare(new Fields("Symbol","Date","Count"));}
-
Emit Values
All components use a collector object to emit values.
collector.emit(new Values("AAA","29-03-21",80))
Operation modes in Storm
Operation modes indicate how the topology is deployed in Storm. Storm supports two types of operation modes to execute the Storm topology:
Apache Storm Local Mode
In local mode, Storm topologies run on the local machine in a single JVM. This mode simulates a Storm cluster in a single JVM and is used for the testing and debugging of a topology.
Apache Storm Remote Mode
In remote mode, we will use the Storm client to submit the topology to the master along with all the necessary code required to execute the topology. Nimbus will then take care of distributing your code.