Weekend with Flume - Part 1
I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume
[caption id=”” align=”aligncenter” width=”170”] Apache Flume[/caption]
Flume Primer
Quoting definition from Apache’s Flume Wiki:
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
When you’re looking for some resources for Flume, most of the time you will find two type of resources
- Flume 0.9.x. This version of Flume is sometimes referred as Flume OG (old generation, maybe :p). I have some introductory slides of this Flume in this post.
- Flume 1.x. It is referred as Flume NG (new generation). We are using this version in our project. Therefore the remaining content of this post will refer to Flume NG.
We found some comprehensive references in configuring Flume NG. They are
- Flume NG - Getting Started.
- User Guide, can be downloaded from here.
Flume Installation
In our project, we use Flume 1.x shipped with Cloudera Distribution with Hadoop (CDH) version 3. Installation instruction can be found here. It’s pretty comprehensive and we followed all the steps there.
Other alternative is by downloading the source and build it using Maven from here.
We are using Cloudera Manager (Free Edition of course :p) to setup our Flume and Hadoop cluster in Amazon Web Service. But, in the middle of the project, we found that *maybe* we can simplify cluster setup by using Amazon Virtual Private Cloud (VPC) and Elastic Map Reduce.
Important files and folders after installation using CDH 3:
/etc/flume-ng/conf/flume.conf
-> contains the configuration of our Flume agent. We can configuration our source, sink, and channel. This is the file that we will always change! Default configuration file can be copied from/etc/flume-ng/conf/flume-conf.properties.template
/var/log/flume-ng/
-> contains Flume log files. It is very useful to clear this folder from log file before you run your Flume agent, so you can easily see the log of your Flume agent execution usingtail
command./etc/init.d/flume-ng-agent
-> shell script to execute Flume agent easily. But be careful! The name of your Flume agent should beagent
in order to use this default script. This name can be set in theflume.conf
file. If you change the name influme.conf
you need to tweak this script.
Executing Flume for First Time
Use default configuration file (without modifying it), start Flume agent using the flume-ng-agent
script. Default configuration file will setup your source as Sequence Generator source, and the sink to Logger. You can observe the output of Sequence Generator in flume.log
. Observe that there should be no exception
detected in your log files.
Refer to this snapshot below for Flume default configuration file
Start running Flume using this command
You should see the output in flume.log
as shown in this link.
Configuring Executable as Flume Agent Source
After successfully running Flume using default configuration, our next step is trying to set our own executable as Flume Agent Source. We created this following simple C code to print line #runningNumber
every some interval. The C code is shown below. Note that \r
is used because we don’t want to make stdout
out of memory and the executable is able to execute in super long time.
We modified the Flume configuration file (flume.conf
) as shown below. Note that the name of our agent is exec-agent
which means if you need to modify flume-ng-agent
script to use exec-agent
as Flume agent name.
Lesson learned in configuring execution source is you need to provide absolute path to the executable so that you get rid of PATH
setting and issues.
The resulting output in flume.log
can be found in this following link. Note that in the logger sink, it displays line #runningNumber
as printed by the C code.
I think it is enough for today.. Hehe.. Next post I plan to covered how to configure Flume Agent’s source and sink using Avro plugin in this post.