Using Ambari Blueprints to automatically provision and install the Lambda ...

:

In this blog post we want to give a tutorial to the brand new Ambari Blueprints. These blueprints allow to automate the configuration of Hadoop clusters – and together with Vagrant, Foreman and Puppet they are the last missing component to completely describe a Hadoop cluster in code and have it run both on virtual machines and on Bare Metal automatically. This allows to quickly create development and test clusters that are (possibly with the exception of size) identical to the production environment.

As an example for this tutorial we use a realization of the Lambda Architecture. You can read about the Lambda Architecture here – but in the end you need to know nothing more than that we want a Hadoop 2 cluster with HDFS (to store files), HBase (to store precomputed views – both views created in batch processes as well as realtime views), Storm (to process data in realtime), Map Reduce (to process data as batches) and finally Pig to make it easiert to create views. Of course we want to use Tez to speed up our processing.

Prerequisites

In previous blog posts we described how to provision virtual or bare metal machines automatically to build your own Hadoop cluster. In both we provided and configured Ambari for you. So if you followed us there you will meet these requirements. Else check the following points. You will need:

  • an installed Ambari server on one node
  • an installed Ambari agent on all nodes
  • a working dns environment including a domain name for every node
  • disabled or properly configured services that could interfere with or block Ambari
  • the ntp service installed on all nodes

Ambari

When installing a Hadoop cluster, Ambari provides an easy way to install a customizable stack of Hadoop services without needing to worry about the details of the installation. You simply install Ambari and click your way through the user interface.

With Ambari 1.5.0 the new blueprint feature was introduced (but not publicized widely). It allows to programmatically set most of the configurations supported by the Ambari UI. You can define the Hadoop service combination for each host plus a set of cluster scoped configurations. Host scoped configurations are not yet available.

The feature consists of two main components, the blueprint itself and a host mapping. Both are JSON objects and can be exchanged with the Ambari REST API.

Blueprints

A blueprint defines the logical structure of a cluster, without needing informations about the actual infrastructure. Therefore you can use the same blueprint for different amount of nodes, different IPs and different domain names.

The base structure is defined by the top JSON elements “configurations”, “host_groups” and “Blueprints”: (The “configurations” element is optional.)

{
  "configurations" : [{ ... }, { ... }, ... ],
  "host_groups" : [{ ... }, { ... }, ... ],
  "Blueprints" : { ... }
}

{ "configurations" : [{ ... }, { ... }, ... ], "host_groups" : [{ ... }, { ... }, ... ], "Blueprints" : { ... } }

Configurations

This element allows to set the configuration options for the services running within the Hadoop cluster. It is structured as an array of configuration types. Every type is identified by an unique name and contains key value pairs of specific settings. For example: “global” is an configuration type which also contains the setting “namenode_heapsize”:

"configurations" : [ { "global" : { "namenode_heapsize" : "1536m", ...  } }, { ... } ]

"configurations" : [ { "global" : { "namenode_heapsize" : "1536m", ... } }, { ... } ]

You can read out the internal types and keys for a setting from the user interface. We also collected the most common settings here (look for the “type” and “properties” attributes). But keep in mind that some values are specific to the cluster we used. Also some dynamically generated property names (e.g. they contain a user name) might not work yet. We recommend to only specify the values that you really want to be different from the defaults.

If you want to read out your existing configuration of a running Ambari cluster, you can do that with the following HTTP calls (When using a browser make sure to log into Ambari first. All HTTP calls here are relative to the Ambari server’s base HTTP address.)

GET   /api/v1/clusters/c1/configurations 
	- shows you which configuration types and tags you are using
 
GET   /api/v1/clusters/c1/configurations?type=INSERT_TYPE&tag=INSERT_TAG
	- then shows you the setting values (tag normally equals 1)

GET /api/v1/clusters/c1/configurations - shows you which configuration types and tags you are usingGET /api/v1/clusters/c1/configurations?type=INSERT_TYPE&tag=INSERT_TAG - then shows you the setting values (tag normally equals 1)

Host Groups

A host group has a specified name (unique in the same blueprint) plus a cardinality and contains a combination of Hadoop service components. So a host group defines a server type in a cluster: every server in one host group will get the same service components installed. The cardinality is the number of servers that should be in a specific host group. It seems that this attribute is not restrictive. You can set it to a higher value or even to “*”.

For a one node HDFS setup, this would for example look like this:

"host_groups":[
  { "name":"host_group_1",
    "components":[
      { "name":"ZOOKEEPER_SERVER" },
      { "name":"ZOOKEEPER_CLIENT" },
      { "name":"AMBARI_SERVER" },
      { "name":"NAMENODE" },
      { "name":"HDFS_CLIENT" },
      { "name":"SECONDARY_NAMENODE" },
      { "name":"DATANODE" }, ... ],
    "cardinality":"1" }, ... ]

"host_groups":[ { "name":"host_group_1", "components":[ { "name":"ZOOKEEPER_SERVER" }, { "name":"ZOOKEEPER_CLIENT" }, { "name":"AMBARI_SERVER" }, { "name":"NAMENODE" }, { "name":"HDFS_CLIENT" }, { "name":"SECONDARY_NAMENODE" }, { "name":"DATANODE" }, ... ], "cardinality":"1" }, ... ]

The component names are Ambari specific, for convenience you can find the the HDP-2.1 services with their components below.

HDFS		DATANODE, HDFS_CLIENT, JOURNALNODE, NAMENODE, SECONDARY_NAMENODE, ZKFC
YARN		APP_TIMELINE_SERVER, NODEMANAGER, RESOURCEMANAGER, YARN_CLIENT
MAPREDUCE2	HISTORYSERVER, MAPREDUCE2_CLIENT
GANGLIA		GANGLIA_MONITOR, GANGLIA_SERVER
HBASE		HBASE_CLIENT, HBASE_MASTER, HBASE_REGIONSERVER
HIVE		HIVE_CLIENT, HIVE_METASTORE, HIVE_SERVER, MYSQL_SERVER
HCATALOG	HCAT
WEBHCAT		WEBHCAT_SERVER
NAGIOS		NAGIOS_SERVER
OOZIE		OOZIE_CLIENT, OOZIE_SERVER
PIG		PIG
SQOOP		SQOOP
STORM		DRPC_SERVER, NIMBUS, STORM_REST_API, STORM_UI_SERVER, SUPERVISOR
TEZ		TEZ_CLIENT
FALCON		FALCON_CLIENT, FALCON_SERVER
ZOOKEEPER	ZOOKEEPER_CLIENT, ZOOKEEPER_SERVER

HDFS DATANODE, HDFS_CLIENT, JOURNALNODE, NAMENODE, SECONDARY_NAMENODE, ZKFC YARN APP_TIMELINE_SERVER, NODEMANAGER, RESOURCEMANAGER, YARN_CLIENT MAPREDUCE2 HISTORYSERVER, MAPREDUCE2_CLIENT GANGLIA GANGLIA_MONITOR, GANGLIA_SERVER HBASE HBASE_CLIENT, HBASE_MASTER, HBASE_REGIONSERVER HIVE HIVE_CLIENT, HIVE_METASTORE, HIVE_SERVER, MYSQL_SERVER HCATALOG HCAT WEBHCAT WEBHCAT_SERVER NAGIOS NAGIOS_SERVER OOZIE OOZIE_CLIENT, OOZIE_SERVER PIG PIG SQOOP SQOOP STORM DRPC_SERVER, NIMBUS, STORM_REST_API, STORM_UI_SERVER, SUPERVISOR TEZ TEZ_CLIENT FALCON FALCON_CLIENT, FALCON_SERVER ZOOKEEPER ZOOKEEPER_CLIENT, ZOOKEEPER_SERVER

Now you could craft yourself a few host groups and try to provision them. But keep in mind that there is no validation of your component combinations when using a blueprint. Therefore you should take the requirements of each component into account.

To be safe it is possible to retrieve the blueprint of an existing cluster with the following HTTP call:

GET   /api/v1/clusters/YOUR_CLUSTER_NAME?format=blueprint
	- gives you the exact component combination of a cluster as raw blueprint
		(without the cluster configuration!)

GET /api/v1/clusters/YOUR_CLUSTER_NAME?format=blueprint - gives you the exact component combination of a cluster as raw blueprint (without the cluster configuration!)

Thus you could also configure the components in the user interface (where you are supported with a bit of validation logic). You can retrieve the blueprints already from the moment the installation begins.

Caution: In the 1.5.1 version of Ambari there is a issue with using any HBase component in blueprints! However, you can still manually install HBase afterwards and even automate by intercepting and reusing the HTTP calls of the user interface.

Other

The final missing JSON element “Blueprints” only contains the blueprint name, the stack (HDP) and the stack version. The name will be important later when mapping a blueprint to an actual cluster.

"Blueprints" : {
  "blueprint_name" : "blueprint-c1",
  "stack_name" : "HDP",
  "stack_version" : "2.1" }

"Blueprints" : { "blueprint_name" : "blueprint-c1", "stack_name" : "HDP", "stack_version" : "2.1" }

Host Mapping

For the actual cluster creation you also need a second JSON File. Basically the work left is to tell Ambari which blueprint it shoud use and which host should be in which host group. With the attribute “blueprint” you can define the name of the blueprint. Then you can define the hosts of each host group. e.g. we define the host “one.cluster” to be in “host_group_1” of “blueprint-c1” (ip is optional)

{ "blueprint":"blueprint-c1",
  "host-groups":[
    { "name":"host_group_1",
      "hosts":[
        { "fqdn":"one.cluster",
          "ip":"192.168.0.101" }, ... ] }, ... ] }

{ "blueprint":"blueprint-c1", "host-groups":[ { "name":"host_group_1", "hosts":[ { "fqdn":"one.cluster", "ip":"192.168.0.101" }, ... ] }, ... ] }

Now there is only one question left: How do you create the cluster? It’s as simple as two REST calls: (Ambari requires you to include the header: ‘X-Requested-By:MY_COMPANY’. See our example for how to trigger the requests.)

POST  /api/v1/blueprints/BLUEPRINT_NAME    blueprint.json
	- makes the blueprint available to ambari
 
POST  /api/v1/clusters/CLUSTER_NAME        hostmapping.json
	- merges the blueprint with the host mapping into a cluster

POST /api/v1/blueprints/BLUEPRINT_NAME blueprint.json - makes the blueprint available to ambariPOST /api/v1/clusters/CLUSTER_NAME hostmapping.json - merges the blueprint with the host mapping into a cluster

After these calls, your cluster begins to install. You can log into Ambari and watch the installation or do other stuff. To see this in action, follow us through our example:

Target Cluster

For demonstration purposes we continue the three virtual machines example from our first blog post. We will provide you with a Lambda Architecture blueprint and host mapping fitting to these virtual machines. You can also follow our example with your own infrastructure. Simply adapt it where needed.

Our target cluster will therefore consist of three (virtual) machines. On such a small amount of machines we will need a host group for each one and distribute the “heavy” services equally among them. Also every machine will get the standard and client services.

To recall, we want our cluster to fulfill the Lambda Architecture functionalities. In general this means to combine a realtime and batch computation to one consistent realtime big data context. The realtime results (from e.g. Storm) and the batch results (from e.g. Map Reduce 2 + Pig) can be combined and stored in HBase. Every other service that we specifiy in the blueprint provides the base for Storm, Map Reduce 2, Pig and HBase: Distributed file storage (HDFS), resource management (YARN),  execution engine (Tez), coordination service (ZooKeeper) and monitoring + metrics (Nagios + Ganglia).

The first VM is the monitoring and resource management node. The second node contains the Storm service components and the third node should handle the HBase master component. (The HBase components are omitted from the example blueprint, because they lead to a failure while installing in Ambari version 1.5.1.)

Example

The result of this consideration is the following blueprint:

{ "host_groups" : [
    { "name" : "host_group_1",
      "components" : [
        { "name" : "ZOOKEEPER_SERVER" },
        { "name" : "ZOOKEEPER_CLIENT" },
        { "name" : "PIG" },
        { "name" : "HISTORYSERVER" },
        { "name" : "SUPERVISOR" },
        { "name" : "NAGIOS_SERVER" },
        { "name" : "TEZ_CLIENT" },
        { "name" : "AMBARI_SERVER" },
        { "name" : "APP_TIMELINE_SERVER" },
        { "name" : "GANGLIA_SERVER" },
        { "name" : "HDFS_CLIENT" },
        { "name" : "NODEMANAGER" },
        { "name" : "YARN_CLIENT" },
        { "name" : "MAPREDUCE2_CLIENT" },
        { "name" : "DATANODE" },
        { "name" : "GANGLIA_MONITOR" },
        { "name" : "RESOURCEMANAGER" } ],
      "cardinality" : "1" },
    { "name" : "host_group_2",
      "components" : [
        { "name" : "ZOOKEEPER_SERVER" },
        { "name" : "ZOOKEEPER_CLIENT" },
        { "name" : "PIG" },
        { "name" : "STORM_REST_API" },
        { "name" : "STORM_UI_SERVER" },
        { "name" : "SUPERVISOR" },
        { "name" : "SECONDARY_NAMENODE" },
        { "name" : "TEZ_CLIENT" },
        { "name" : "HDFS_CLIENT" },
        { "name" : "NODEMANAGER" },
        { "name" : "YARN_CLIENT" },
        { "name" : "MAPREDUCE2_CLIENT" },
        { "name" : "DATANODE" },
        { "name" : "GANGLIA_MONITOR" },
        { "name" : "DRPC_SERVER" },
        { "name" : "NIMBUS" } ],
      "cardinality" : "1" },
    { "name" : "host_group_3",
      "components" : [
        { "name" : "ZOOKEEPER_SERVER" },
        { "name" : "ZOOKEEPER_CLIENT" },
        { "name" : "PIG" },
        { "name" : "NAMENODE" },
        { "name" : "SUPERVISOR" },
        { "name" : "TEZ_CLIENT" },
        { "name" : "HDFS_CLIENT" },
        { "name" : "NODEMANAGER" },
        { "name" : "YARN_CLIENT" },
        { "name" : "MAPREDUCE2_CLIENT" },
        { "name" : "DATANODE" },
        { "name" : "GANGLIA_MONITOR" } ],
      "cardinality" : "1" } ],
  "Blueprints" : {
    "blueprint_name" : "blueprint-c1",
    "stack_name" : "HDP",
    "stack_version" : "2.1" } }

{ "host_groups" : [ { "name" : "host_group_1", "components" : [ { "name" : "ZOOKEEPER_SERVER" }, { "name" : "ZOOKEEPER_CLIENT" }, { "name" : "PIG" }, { "name" : "HISTORYSERVER" }, { "name" : "SUPERVISOR" }, { "name" : "NAGIOS_SERVER" }, { "name" : "TEZ_CLIENT" }, { "name" : "AMBARI_SERVER" }, { "name" : "APP_TIMELINE_SERVER" }, { "name" : "GANGLIA_SERVER" }, { "name" : "HDFS_CLIENT" }, { "name" : "NODEMANAGER" }, { "name" : "YARN_CLIENT" }, { "name" : "MAPREDUCE2_CLIENT" }, { "name" : "DATANODE" }, { "name" : "GANGLIA_MONITOR" }, { "name" : "RESOURCEMANAGER" } ], "cardinality" : "1" }, { "name" : "host_group_2", "components" : [ { "name" : "ZOOKEEPER_SERVER" }, { "name" : "ZOOKEEPER_CLIENT" }, { "name" : "PIG" }, { "name" : "STORM_REST_API" }, { "name" : "STORM_UI_SERVER" }, { "name" : "SUPERVISOR" }, { "name" : "SECONDARY_NAMENODE" }, { "name" : "TEZ_CLIENT" }, { "name" : "HDFS_CLIENT" }, { "name" : "NODEMANAGER" }, { "name" : "YARN_CLIENT" }, { "name" : "MAPREDUCE2_CLIENT" }, { "name" : "DATANODE" }, { "name" : "GANGLIA_MONITOR" }, { "name" : "DRPC_SERVER" }, { "name" : "NIMBUS" } ], "cardinality" : "1" }, { "name" : "host_group_3", "components" : [ { "name" : "ZOOKEEPER_SERVER" }, { "name" : "ZOOKEEPER_CLIENT" }, { "name" : "PIG" }, { "name" : "NAMENODE" }, { "name" : "SUPERVISOR" }, { "name" : "TEZ_CLIENT" }, { "name" : "HDFS_CLIENT" }, { "name" : "NODEMANAGER" }, { "name" : "YARN_CLIENT" }, { "name" : "MAPREDUCE2_CLIENT" }, { "name" : "DATANODE" }, { "name" : "GANGLIA_MONITOR" } ], "cardinality" : "1" } ], "Blueprints" : { "blueprint_name" : "blueprint-c1", "stack_name" : "HDP", "stack_version" : "2.1" } }

Mapping this to the three virtual machines is then quite easy to describe:

{ "blueprint":"blueprint-c1",
  "host-groups":[
    { "name":"host_group_1",
      "hosts":[ { "fqdn":"one.cluster" } ] },
    { "name":"host_group_2",
      "hosts":[ { "fqdn":"two.cluster" } ] },
    { "name":"host_group_3",
      "hosts":[ { "fqdn":"three.cluster" } ] } ] }

{ "blueprint":"blueprint-c1", "host-groups":[ { "name":"host_group_1", "hosts":[ { "fqdn":"one.cluster" } ] }, { "name":"host_group_2", "hosts":[ { "fqdn":"two.cluster" } ] }, { "name":"host_group_3", "hosts":[ { "fqdn":"three.cluster" } ] } ] }

Procedure

If you want to use this in action:

  1. Start up three virtual machines managed by Ambari and wait for them to start up (to do this you can use the resources provided here).
  2. Choose your favorite way to trigger the already described POST requests (including the given JSON objects and needed header) or simply execute the following commands: (You can also run them in one of the virtual machines.)
curl http://vzach.de/data/lamba-blueprint.json -o lamba-blueprint.json
 
curl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/blueprints/blueprint-c1 -d @lamba-blueprint.json
 
curl http://vzach.de/data/lamda-hostmapping.json -o lamda-hostmapping.json
 
curl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/clusters/c1 -d @lamda-hostmapping.json

curl http://vzach.de/data/lamba-blueprint.json -o lamba-blueprint.jsoncurl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/blueprints/blueprint-c1 -d @lamba-blueprint.jsoncurl http://vzach.de/data/lamda-hostmapping.json -o lamda-hostmapping.jsoncurl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/clusters/c1 -d @lamda-hostmapping.json

To further automate this, you could wrap the commands in a script and execute it once every needed machine is provisioned. You can query whether a machine is ready to use by Ambari with: GET /api/v1/hosts

Conclusion

We have seen how the configuration of a Hadoop cluster can be described in blueprints and how this makes it possible to manage this configuration together with the rest of the codebase. Together with Foreman and Puppet it is possible to go from bare metal to an installed cluster without the need for manual actions. However, we have to admit that right now you can try and experiement with it, but you best wait for Ambari version 1.6 before you fully integrate it into your work environment (its still experimental and some features – like provisioning of HBase – do not work yet).

The presented solution is particulary well suited for cases where you already have a cluster and only need to represent changes to this configuration. In such cases it is easy to generate an initial blueprint from the existing configuration and to apply changes.

Authors

Valentin Zacharias and Malte Nottmeyer