Still playing the Blame Game or already solving problems? - codecentric AG Blog
There’s a fire. The most important Java application of the organisation has crashed or is unbearable slow; management demands a fast resolution. Time to call the firefighters, often external troubleshooters. They can tackle the problems uninfluenced by politics or prejudice and bring in a wealth of experience and / or tools.
As a consulting house we at codecentric know those situations all too well. We bring in the experience and the tools, if not already there: Production monitors like AppDynamics, profiler like JProfiler and other little helpers e.g. Eclipse Memory Analyzer. More important though is that as externals we are allowed to take an unobstructed view and ask tons of questions.
Questions which, for many reasons, are no longer asked by the ones knowing the application for a long time already or that have never been asked. Usually we find the root cause pretty fast together with the customer and then sometimes something surprising happens. Nothing.
Performance Troubleshooting – a symptomatic example
One advantage of Java applications is the automatic memory management; though there is a multitude of problems that can come along with it. Let’s take e.g. a rich client application which communicates stateless with a server and processes a lot of data. That is gathered from a database for simplicity (!) via Hibernate. That causes a lot of object creation on the server, usually small objects which are dumped again almost immediately (memory thrashing). With the production monitor we can watch this easily (see the 3 example graphis; this is not an actual case!). With that we spot the critical component almost instantly.
Together with the overview graph of the memory of the production system we can visually confirm that Hibernate uses a lot of transient memory for simple objects. That means a lot of CPU usage for garbage collection and loads of SQL statements.
Excellent: We found the performance bottleneck. And we know that this system has some natural limits for its scalability. Imaging a coop that was originally designed for chickens but now should be populated by ostriches.
Of course we can get bigger cages and optimize the waste management but we will never fit as many ostriches in the coop as chickens before. To achieve that you either have to shrink the ostriches or completely redesign the coop. A simple configuration change will definitely not work.
Sounds logical? Yes! But not everyone is a vulcan.
Problem found – measures taken
We see frequently the following measures (in that order):
- Add more memory (with 32-Bit JVMs limited possibilities)
- Scheduled restarts (cron job)
- Add even more memory
- More frequent restarts (though once per hour seems to be annoying for the users even more)
- Get an APM tool for monitoring and have the restarts more targeted
- Get a tuning specialist – often from the participating software vendors who then shows that it is not their software which causes the troubles and there is nothing more to tune
- Curse, measure, curse…
Did you notice something? Exactly: The underlying problem was not tackled; the root cause not removed
The problem is circumvented cunningly but why is that so?
There is a variety of reasons therefore I list only the most common ones:
- The application was developed by external consultants and documentation wasn’t done at all or rudimentary. But as the development was done clearly by the specifications and it was signed off – this can not be the root cause.
- The different departments (development, test, operations) are strictly separated and the interfaces clearly defined – there is no communication
- Frameworks were selected by business need and the performance aspects were neglected. That usually works fine on the developers workstation but often not under production load
- Open source components were extended, ideally by external consultants hence Google doesn’t help – there is a lot of effort and expertise needed to understand what happens
- It is a completely home grown solution and the responsible developer has left the company – the documentation consists mainly of TODO tags
- The architecture was developed and optimized by experts for a long time and it simply can not be wrong. A chicken coop was planed and successfully implemented, it works and was turned into a blueprint. Ostriches are birds too hence the architecture must fit
Additionally to those reasons there a various combinations and a lot more others.
What now? Alternatives to the traditional Blame Game
A monitoring tool helps, but gathers only raw data. It needs a human to draw conclusions. The supervising- and profilingtools may be really good but in my view only communication and openness can overcome the issues. Operating and the business have a problem but only the technical people (developers, architects, etc.) can solve it – it is inevitable to get them into a dialog. Often that is prevented by political decisions or older conflicts.
Instead of the dialog there are tedious war room meetings where everyone is trying to identify THE culprit who is responsible for the problem (blame game). The focus is on the person or department not on the real root cause of the problem. That is counterproductive. The participants (developers, architects, software vendors, sysadmins, business, etc.) are forced to defend themselves and explain why the problem is not in there domain or the issue was an unavoidable constraint. That does not solve the problem.
An objective dialog can be started using the measurements and facts as a catalyst. This depends also very much on the instinct of the moderator of such a meeting. As external consultant one has the possibility to lay the finger on the critical issues directly not influenced by any internal history and asked the crucial questions: Who can solve the one issue best and who the other? Can the other departments add anything to the solution? If the result is that the solution that really helps will need time and money, another discussion on a different level can be started. But as a result we do have the real reason for the issues and the assurance that one can either live with that problems or take some money and get it fixed. That is a decision that at the end management has to take for the company.