GrabDuck

Flume - Data Collection Framework | Treselle Systems | Big Data, Technology & ...

:

Introduction

This is first part in a multi part series that talks about Apache Flume – a distributed, reliable, and available system that helps in collecting, aggregating and moving large amounts of log data efficiently from many different sources to a centralized data store.

Flume versions:

  • Flume OG (Old Generation or 0.X)
  • Flume NG (New Generation or 1.X)

Flume OG is the first available Flume distribution, which was replaced by a complete refactoring, called Flume NG.

Flume runs as an agent. The agent is sub-divided into following categories:

  • Event: Payload of data or part of data that is transported by flume
  • Source: Writes events to one or more channels.
  • Channel: The holding area, as events are passed from a source to a sink.
  • Sink: Receives events from a single channel

0

Flume channels the data between sources and sinks which are either predefined or customized.

Using Flume we can aggregate multiple log files across the network and store it centrally. The aggregation and storing of data can be achieved by chaining agents together. The sink of one agent sends data to the source of another. The standard method of sending data across the network with Flume is, by using Avro which moves data over the network.

Flow of data across an agent as follows:00The focus of this series is to understand how Flume can be used for log data aggregation, and in transporting massive quantities of social media-generated data or any such data source. This series is broken down into the following multi-part blogs:

  • Flume – Data Collection Framework (this blog)
  • Flume with Cassandra
  • Flume with Kafka

Use Cases

We have two use cases to help us understand the process involved in Flume. Let’s start from the basics and develop it over in the next use case.

First use case

Let’s take the first use case as “Hello World” to describe a single-node Flume deployment. In this use case, Flume generates events and subsequently logs them to the console.

What we need to do:

  • Configure the Flume agent
  • Start an agent
  • Send events into Flume agent’s source

Second use case

In second use case, we will see how to collect 3 log files and store in HDFS using Flume. The same procedure is followed to collect multiple log files.

What we need to do:

  • Ensure Hadoop is set up
  • Configure the Flume agent
  • Start an agent

Solutions

Solution for first use case ”Helloworld”

Before solving our use case, let’s get some pre-requisites satisfied.

Pre-requisites:

  • JDK 1.6 +
  • This blog series uses Flume 1.4.0

Flume Downloads Page: https://flume.apache.org/download.html

  • Verify:
    • Download the apache-flume-1.4.0-bin.tar.gz from the above link and extract into a directory
    • Run help command as follows in the bin folder of Flume installed directory.

Configure the Flume Agent

  • The configuration file includes Flume agent properties like source, sink, and channel. This agent hosts the flow of data.
  • This configuration defines a single agent named hello_agent.
  • hello_agent consists of a source that listens for data on port 12345, a channel that buffers event data in memory, and a sink that logs event data to the console.

 Start an agent

  • Start an agent using a shell script called Flume-NG, which is located in the bin directory of the Flume distribution. Specify the agent name, the config directory, and the config file on the command line.

To start an agent, run the following command

The -Dflume.root.logger property overrides the root logger in conf/log4j.properties file to use the console appender. If we don’t override the root logger, everything would still work, but the output would go to a file log/flume.log. Of course, we can edit the conf/log4j.properties file and change the flume.root.logger property.

  • Once the entire configuration is parsed, we can see this message with the configured data.
    Now, we see that our source is listening for input on port 12345. Let’s send some data.

Send events into Flume agent’s source

  • Open a second terminal and run the following command to send Hello World string as an event
    The output indicates that the agent has accepted our input as a single Flume event. If we look at the agent log, we will see the following:
    This log message shows that the Flume event contains no headers (netcat source doesn’t add any itself). The body is shown in hexadecimal along with a string representation.
  • If we send another line as follows:
    We’ll see the following in the agent’s log:
    The event appears to have been truncated. The logger sink, by design, limits the body content to 16 bytes to avoid the screen congestion. If we want to see the full content, we should use a different sink, perhaps the file_roll sink, which will write to the local file system.

Solution for log data aggregation

Using HDFS sink requires the installation of Hadoop which helps Flume to use Hadoop jars to communicate with the HDFS cluster.

Ensure Hadoop is set up

There are so many blogs and articles explaining about how to install Hadoop, and the links are referenced below. For this use case, we use Ubuntu and Hadoop 2.2.0

http://tecadmin.net/steps-to-install-hadoop-on-centosrhel-6/#

http://codesfusion.blogspot.in/2013/10/setup-hadoop-2x-220-on-ubuntu.html

Configure the Flume agent

  • This configuration aggregates 3 log files and stores in HDFS sink.
  • In this configuration we have used Avro source and Avro sink to send the data across the network.
  • To use the Avro source, we have to specify the type property with a value as Avro. We  should also provide a bind address and port number to listen. The same is done for Avro sink also.
  • HDFS sink writes events into the Hadoop Distributed File System (HDFS).

 Start an agent

Before starting an agent, make sure that the log files are present in the specified location.
In this use case we aggregate input1.log, input2.log and input3.log files from var/log directory.

To start an agent run following command:

Once the entire configuration is parsed, we can see the following log message:

yamforblog

Log aggregation in Flume stores many files in HDFS and these files can be split based on properties of HDFS sink.

Use following command to examine the created files in HDFS:

logdatafiles1

Challenges

  • Flume expects proper names of components, and configuration file while running an agent. Otherwise, we will get org.apache.commons.cli.ParseException. For example: If we create helloworld.conf file in conf folder and try to run helloword.conf, then the following exception will be thrown.
     
  • If we miss the -c parameter of the command, which is used to start an agent, we will get the following error:
     
  • To avoid following exception use same host name, port for both Avro source and sink in the configuration.

Conclusion

  • We have successfully configured and deployed a Flume agent.
  • We can aggregate log data from multiple sources (hundreds to thousands) at high velocity and volume using Flume.
  • We can aggregate hundreds of logs by configuring a number of first tier agents with an Avro sink, pointing to an Avro source of a single agent.
  • The next blog will help us to create custom source and custom sink in Flume.

References

Websites:

Book:

  • Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

7436 Views 2 Views Today