Sending Logs Down the Flume
I've been looking at Flume in depth and it is a very powerful and useful platform that solves several operational problems at once. And the best part is that it is surprisingly simple to use and understand.
Application logs are a source of stress and contention in companies. The site operations team usually views them as being a painful resource to manage. They consume a lot of space, they are rarely where the people that can use them, the developer, can access them, and they are of nominal use to operations. Most companies wind up building tools and processes to gather them off application servers, push them to some repository, and try to keep the lifecycle under control so they don't consume limitless disk space.
Developers are equally frustrated because the logs are often not where they can view them, they have limited ability to intelligently manage the lifecycle, and limited tools for processing their logs (if you want something more than grep and PERL).
Flume addresses the frustration of both groups by providing a simple yet powerful framework for pushing logs from application servers to repositories through a highly configurable agent. The flexibility of Flume allows it to scale from environments with as few as 5 machines to environments with thousands of machines.
Before I go into how we will be using Flume at Rearden, I want to cover another useful aspect of Flume. Flume defines logs as a series of events. A Flume event will look familiar to Java developers as they consist of a priority and a string. And the Flume event priority maps very easily to the Java log level, so events are an obvious candidate for collecting Java logs. Flume events though also include the concept of a host which Java logs don't directly represent although it could be added through the use of a custom formatter. Flume events also can contain fields which are arbitrary key/value pairs that can contain structured context associated with the event. This is where Flume events can provide capabilities that go beyond the standard Java logging framework (although is part of slf4j's extensions).
One frustration I've always had with Java logging (and most language logging) is I often find myself formating metadata into a string for logging and then later using regular expressions to get it back out of the string into metadata. Beyond being an inefficient use of resources (including my limited skills with regex), it's error prone and frankly feels pointless. Wouldn't it be so much easier if I could send the metadata as structured content and have it be saved in a system that made efficient processing of such data simple. Like for example, something like Big Table. This is precisely what Flume offers.
How does Flume accomplish this? Flume has two basic concepts. The Flume master acts as a reliable configuration service that nodes use to retrieve their configuration. The master will dynamically update a node if the configuration for the node is changed on the master. A Flume node is simply an event pipe. It reads from a source and writes to a sink. The behavior of the source and sink determine the role and characteristics of the node. Flume is delivered with many source and sink options, but if none serve your needs, you can write your own. The Flume node may also be configured with a sink decorator that can annotate and transform the event as it passes through. With these basic primitives, a variety of topologies can be constructed to collect data on an application server and route it to a log repository.
Flume provides two patterns for nodes that most organizations will use. The first is the agent. Agents are end points for applications to send events to. An agent most often will run on the same application server as the application. This creates a simple availability model as the application and agent will have approximately the same availability (assuming the Flume agent itself doesn't crash more often than the application). Agents send their events to collectors which is the second pattern for a Flume node. Collectors often deliver events directly to the log repository although larger installations may have tiers of collectors for scale and routing reasons.
At Rearden, we are following a simple Flume architecture as illustrated below. Each application server will run an agent. The agents will send events to a pool of collectors which push the events into Hadoop HDFS. The agents and collectors are deployed in our production environments while HDFS is deployed in our QA environment. Developers are able to access HDFS directly in the QA environment. We have developed a log4j adapter that sends all Java logs directly to the agent. Additionally, a simplified Java Flume API is exposed that wraps the Flume Thrift API and manages configuration and common meta data consistently (time stamps, host name, etc.). Applications may leverage both interfaces to generate Flume events.
One challenge that companies face when passing logs from production to development environments is sensitive information my inappropriately wind up in a log which would leak it out of production. This is where a sink decorator on the collector can be employed. Our collector uses a decorator that looks for suspect patterns and replaces them with an HMAC of the original string. The events are logged in secure production repository as well for review by information security, allowing tickets to be filed against application code that is leaking sensitive data.
Space on the HDFS cluster is managed by developers. The lifecycle of logs can be managed anyway teams feel is appropriate. The tension over log space is greatly reduced because engineers are now in control of what to keep and how to manage the space appropriately.
Storing logs in HDFS opens up a richer toolset to engineers for analyzing and reporting on them. Obviously they can write their own map/reduce tools, but additionally, HIVE and PIG can be used directly. What may take hours of development PERL can be expressed in a single PIG query. The hope is that this will allow developers to uncover patterns in application behavior that previously was difficult, leading to better application quality.