Error Solving In Java And The JVM
I want to say up front that this is not an application code-level article; there are many excellent articles out there relating to try-catch use, etc, therefore I am not giving this blog piece the title of "Error Handling" (which I almost did). Rather I will go through exactly what I do when a client has reached out to me for assistance and when that is troubleshooting rather than capacity planning or migrations. Typically the main troubleshooting needs relate to dramatically degrading responses, at certain times. There is our first clue "at certain times", in other words not all the time; here is what I typically do.
Firstly as we are talking here of Java, I always enable garbage collection (GC) logging of the JVM. In my opinion, if the JVM is not optimized it is akin to starting out on a 1,000 drive in a 4 cylinder car with only 3 cylinders firing. Another critical point here, no JVM on any system is correctly optimized "out-of-the-box", that would be impossible; "why is that?" you may ask. Because JVM's can be running on a set-top box (even on a raspbery pi which I encountered recently) right up to massive multi-processor fintech systems. When we enable GC logging we also want to make sure the logs rotate and do not get lost on a system restart; here are the arguments I use.
Here is an important note, the resulting log output will vary depending on the garbage collector in use, here are the main ones I come across.
- The Parallel Collector: Enabled by using this argument -XX:+UseParallelGC.
- The Concurrent Mark Sweep (CMS) collector: Enabled by using this argument -XX:+UseConcMarkSweepGC (I also recommend using -XX:+UseParNewGC with the CMS collector).
- The Garbage First Collector (G1) collector: -XX:+UseG1GC
In my experience, and for web applications in particular, the order above is how I found them to most beneficial, -XX:+UseG1GC being the best selection for current JavaEE applications.
So, using the logging settings above we get a GC log called cfwhispererGC.log, we can certainly look in there in a typical text editor, I prefer to use a great little utility called "gcviewer" which can be downloaded here. What I home in on, in that tool , firstly is how long has the system been running since last start up, in the case of this screenshot it is just over 21 hours which is a reasonable time from which to glean good information. We can see here the Summary View that the first trouble peaked about 7 hours and 18 minutes in also I look at the number of full garbage collections, Full GC's, as each is a "stop the world"event; here there have been 27...
The next thing I look at is how the garbage collection "generations" are under the Memory tab in gcviewer, I realize this screenshot might be hard to read, basically what it is telling us is that the total heap, the tenured and young generations are all at less more or 100% used up. So my first recommendation here would be to increase the overall heap size and change from the parallel collector to CMS, to reduce Full GC's...
After making those changes we get a much more stable JVM, now maximum total heap use is 68.2% and the tenured generation is at 41% used up. The tenured generation should kept with a good amount of free space whenever possible as once objects are in there they have nowhere else to go and a failure to collect them there will result in a memory leak...
These are the steps I go through to tune the JVM, optimally. Ideally any major code changes should involve at least double-checking that the JVM is performing well with those changes and I recommend leaving GC logging running.