The infrastructure that powers a product and all of the services that it provides can be huge and complex because the product is scaled to serve millions of users. In most cases, each service might depend on various components for seamless functioning. With a product that houses a variety of features with critical infrastructure components and services powering these features, it becomes vital to monitor these components and services and keep them running at any cost.
This monitoring system has to handle the following:
Initially when we began facing problems because of some of our machines and services going down, we wrote ad hoc monitoring scripts that ran as crons to send email alerts. We also set up AWS Cloudwatch alarms to send notifications via email. There came a time when we had a very high number of components to be monitored and we realised that we were not getting enough insight into the load and usage of our machines. This is when we decided to put a system in place to collect data from these monitors and services. We also added a monitoring component to send alerts in more reliable ways (through phone calls) to the product owners and our on-call team, in case of any downtime.
This system consists of the following components that work together,
There are a lot of tools out there for each of these components that can be put together to build this system rather than writing it from the scratch. We were very happy with what Cabot had to offer as a monitoring and alert system. Since it used Graphite APIs for the monitoring checks, we decided to use Graphite to record and serve all our metrics.
Graphite is usually used with the following components at its core:
While these are useful, there are many issues that surface as you try to scale this system, Some of the major limitations seen with this approach include:
InfluxDB is a database that is written in Go, primarily for time-series metric data. This database is by far the best option for a database to be used with Graphite. We also found alternative components for:
The components that we found to be better include:
Collectd is used for the metrics-collection component. You can get collectd from here. Collectd provides a lot of plugins for collecting various metrics. Some important plugins to look at are CPU, memory, and disk. It also provides plugins for other tools like MySQL, Redis, PostgreSQL and amqp.
Custom metrics can be reported to carbon-relay-ng using the Python plugin that runs a Python script to collect metrics. This is an example of writing custom-service plugins (in Python) that are used by collectd. Most importantly the write_graphite plugin is used to report metrics to carbon-relay-ng in the graphite line format.
This is the format of the write_graphite plugin that should be added in the collectd config file
<Plugin write_graphite> <Node "graphite"> Host "<carbon-server-ip>" Port "<carbon-server-port>" Protocol "tcp" LogSendErrors true Prefix "random_prefix." StoreRates true AlwaysAppendDS false EscapeCharacter "_" </Node> </Plugin> # Important: You must use the right prefix. If it does not match with # any of the carbon routes that have been configured, then carbon-relay-ng will # drop the metrics.
The intervals in which collectd sends metrics is configured a the global level in the config file for all the plugins by using the Interval parameter. It can be specified inside each plugin to override the interval for that plugin.
Collectd is one of the tools that is used to report metrics. Many other tools can be used to report metrics to graphite with this architecture. Some alternatives can be found here.
You can install carbon-relay-ng from here. The following points have to be kept in mind while using it:
Format : addRoute <type> <key> [opts] <dest> [<dest>[...]] 'addRoute sendAllMatch collectd 127.0.0.1:2005 prefix=hemetrics'
Multiple such routes can be added to the init list in the config file. Here the command addRoute adds a new route of the type sendAllMatch, which sends metrics to all destinations that match! Metrics can be matched by using prefixes, substring, or regex matches.
In this case the route has the key collectd and sends metric data with the prefix hemetrics to all matching destinations where the destination is expected to be a TCP endpoint (IP:port combination).
Carbon or carbon-relay-ng in our case, can be fed externally too by using the line-text protocol. For more information, read the relevant documentation here.
carbon-relay-ng provides a lot more options for matching and transforming data. For more information, read the documentation. :)
InlfuxDB is a very good choice for storing our metrics data as it is a time-series database and has a very high performance (written in Go). Some of the other features that make InfluxDB awesome:
# Config for the graphite listener [[graphite]] # Determines whether the graphite endpoint is enabled. enabled = true database = "graphite" # This is what the carbon-relay-ng route should use as the destination bind-address = "localhost:2005" protocol = "tcp"
Grafana as a visualization component, provides a plethora of options like histograms, heat maps, and graphs. Grafana provides alerting options and notifications on alerts to PagerDuty, Slack, and a few other services. However, since we wanted something more than that we decided to deploy Cabot as the monitoring and alerting component.
Grafana enables us to create beautiful dashboards and renders it seamlessly for the on-call team to monitor various components in a single dashboard. Installing grafana and creating dashboards is a cakewalk. Start creating your dashboards here.
Graphite-API is grahpite-web with just the HTTP APIs and not the web interface. It implements most of the APIs. To make Graphite-API use InfluxDB as the datasource, install Influx Graph (storage plugin for Graphite-API).
The configuration for Graphite-API resides in /etc/graphite-api.yaml. An example for it can be found here
Installing Influx Graph installs Graphite-API and a lot more dependencies including InfluxDB. Graphite-API provides a lot of configuration options for using standard templates for queries, caching queries in memcache, aggregation functions, and grouping data bases on intervals combined with query-retention policies to cache queries of certain intervals for specific amounts of time.
When templates are added, the API expects queries for the metrics of the format that is specified in these templates only. Any other queries will return zero results.
Graphite-API can be deployed in various ways. At HackerEarth we have deployed it using NGINX and uWSGI.
Cabot is an open-source monitoring and alert system written in Python and Django. It provides the best features of PagerDuty free of cost.
The documentation is pretty good for setting it up and getting it work without any hassles. Some of the best features of Cabot include:
All of the cabot configuration options can be found here.
Note: Cabot uses Celery to run all the service check tasks. The configuration parameter CELERY_BROKER_URL requires a redis host with the right authentication. Also set the WWW_HTTP_HOST to the FQDN of the server.
Check if all the checks are running by visiting /status/ endpoint.
It has been two months since we started using this system at HackerEarth. It has been very stable and completely reliable. We collect hundreds of metrics from various machines and services every minute. This has given us a better understanding of the load on our critical machines and services, thus helping us manage our infrastruture more efficiently and minimising our downtime.
Posted by Karthik Srivatsa