No time for monitoring? - codecentric AG Blog

:

Monitoring big, distributed Java landscapes helps tremendously to keep complex applications under control. But many administrators spare the effort to set up monitoring: No time. Now a timesaving solution is in sight.

„We are maxed out anyway. We need a solution that helps to make our work more effective and not something that if we are lucky saves about as much time as it needs to set it up and maintain it“.
I hear statements like those again and again from IT administrators. With the effect that APM solutions are mainly used for firefighting by experts.

So, what is needed? A solution which allows to monitor a large number of applications with a minimum of configuration effort and identifies the root cause of problems quickly.

Indeed I found and tested a tool which fulfills those requirements. AppDynamics developed a product which is convincing not least because of its ease of use. I was sceptical at first but haven’t been disappointed in a couple of evaluations. Almost as easy as an iPhone- or Android-App. Simply use it.

The 3 steps towards 24×7 monitoring

Let’s take a look at the steps needed to establish application monitoring and how the AppDynamics solution adds value and saves time.

1. What to measure? – Measuring Points

The definition of measuring points (or sensors, probes) is the first challenge. Most APM solutions for Java or .NET use BCI (bytecode instrumentation) to get performance data. The measuring points need to be determined very carefully to prevent a big influence on the results (overhead) as additional code is executed. This usually asks for the assistance of an expert, an architect or developer. For every application that needs to be monitored.

If agile development processes are used this is an exhausting process as classes can change daily or new frameworks are added. A “trial-and-error” approach in production is prohibitive due to the necessity to restart the application servers most of the times. Additionally the overhead can be increased inadvertently to an unbearable level for the users.

AppDynamics uses a patent pending technology which needs only a minimum of BCI and still is capable of delivering information on method level to identify “loitering” components. And that without any configurations effort. The architect / developer can do his day job without being bothered by the admin.

2. How to get an overview? – Visualization

Dashboards are commonly used to provide an overview over the architecture (which component talks to whom and how often?) and the business transactions (which transaction is behaving cumbersome and who is affected) for all involved applications.

Most vendors use “customizable dashboards” for visualization as a kind of panacea where every view can be adjusted for every type of user. And that is exactly what needs to be done for every detail and every application – so to say “mustomizable dashboards”. Any change in the environment or the business functionalities requires additional effort.

AppDynamics dashboards are created automatically and determine business transactions based on the “inner” values of an application (e.g. strus actions, URI patterns or HTTP parameters). If the default settings are not matching they can be changed with a few clicks and the system is ready for action.

AppDynamics - Application Overview

AppDynamics - Application Overview

AppDynamics Application Flow Map

AppDynamics Application Flow Map

3. Red Alert! Something is going wrong. – Thresholds

What defines a problem in production? Usually something out of the norm, e.g. a user login takes 3-times the time that is normal for that time of the day or a JVM uses excessive amounts of CPU. Such abnormalities are visible with the help of predefined thresholds where a violation thereof results in an incident or alert.

Now what I see in the real world are 100 and more applications with a multitude of different business transactions which have very divers “normal” response times: Sometimes 2 seconds are very good (cost calculation for an isurance policy), sometimes 200 ms are a catastrophe (placing a bet on an online beting patform). Or worse: There are no non-functional requirements defined at all, so that the thresholds have to be set using a dice initially and later adjusted. With only 50 applications with 50 transactions each we have a stunning 2500 thresholds that need to be set and checked. On a regular basis. And we only looked at response times so far…

With AppDynamics this is not needed. A slick baselining and statistical methods like standard deviation are used to automate this work. You can adjust each value individually if needed but 95% of all thresholds are already covered with the default rules. This includes time of day and weekly differences;  e.g. on monday mornings the login process takes longer because of the load and will not raise an alert though the same response time causes an incident 2 hours later or on tuesday morning as it is above the norm for that timeframe.

4. And what about root cause analysis? (Bonusstep)

Alerting in case of problems is nice and needed the admin knows that something went wrong or is about to go wrong in advance but who to notify for remedy? Triage and root cause analysis capabilities complete the monitoring. This means identifiying the responsible person to resolve the problem and additionally given them the details to return to normality quickly.

I stated before that AppDynamics instruments very little bytecode. How are the necessary details retrieved then? AppDynamics uses so called snapshots, which include a call stack with timings and details about the transcation itself. Snapshots are taken automatically of abnormal transactions (too slow, erroneous, etc.), on demand and time based (like every 10 minutes or every 100th occurence). With this technology an administrator is spared a tsunami of data but is equipped with exactly the necessary information when he / she needs it.

In the coming weeks we will publish a series of blog posts on how to diagnose different kinds of performance problems in detail.

Simple and effective

In summary:  AppDynamics created an easy to use and effective solution in which I see the promises of the last seven years kept. A simple to use system which was developed specifically for the monitoring of highly distributed, business critical Java applications.

Revolutionary? No, rather evolutionary. AppDynamics paid attention to the shortcomings of existing solutions and put a lot of thought into automation. “2-3-100” is the goal. 2 administrators take 3 days to setup 100 applications for monitoring.

While the first providers of APM solutions for Java and .NET had the goal to open the blackbox and get some data at all the second generation expanded this to transactions in order to be able to x-ray modern SOA/SBA based applications. What was missing was the usability and automation. How can I effortlessly sort my data and turn it into valueable information?

Let’s take a look into the next generation of APM!

Put an agent into an application (see AppDynamics Lite Screencast by Fabian), let it send data to the central controller and simply wait for the first results to reveal themselves.