Within our Openshift cluster, we use one main application based on Java. The Splunk Connect for Kubernetes integration worked pretty well out of the box, as all our pods are now logging to Splunk. There is just one problem: The Fluentd log forwarder simply reads your logs and forwards them to Splunk without interpretation. In this blogpost I’ll explain how we had to tune the fluentd configuration to handle stacktraces and how this was configured in the Helm chart you use to install the Splunk Connect in the first place.
By default, the Splunk Connect for Kubernetes integration grabs every pod log and forwards it to Splunk. Splunk then interprets these messages and generates ‘events’ for each line. This is fine when every line in your log has it’s own meaning. For example, look at the example below to see a webserver access.log which states on each line which webpage was requested, how it was handled and so on
192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395 127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET / HTTP/1.0" 200 2216 127.0.0.1 - - [28/Jul/2006:10:27:32 -0300] "GET /hidden/ HTTP/1.0" 404 7218
In this example each line is an event on it’s own, as they represent different requests to different web pages. These events show up in Splunk as separate ‘events’ as well.
However when you have a Java stacktrace, the message gets pushed on to multiple lines:
2021-08-30T09:45:08,525 ERROR [org.springframework.web.context.ContextLoader] (ServerService Thread Pool -- 74) Context initialization failed: [...] at org.springframework.beans.factory.support.... at org.springframework.beans.factory.support.ConstructorResolver... at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory... at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory...
And this results in each line becoming an individual ‘event’ in Splunk:
As you can imagine, this makes it very difficult to read the logs and also to search for any errors. To solve this, you have to use a feature called ‘multiline‘. This feature recognises the start of an event and stiches all lines together (using concat) untill the beginning of a new event is found. To do this, Fluentd needs to recognise the start of an event. In the example above, we know that the java (EAP7) log looks like this:
2021-08-30T09:45:08
To figure out the proper filter (which can be a trial-and-error process) you can add the following filter in the configmap splunk-kubernetes-logging in your splunk-logging namespace.
<filter tail.containers.var.log.containers.my-application-name-*.log> @type concat key log timeout_label @SPLUNK stream_identity_key stream multiline_start_regexp /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\s/ flush_interval 5 separator "\n" use_first_timestamp true </filter>
The filter starts with the tail.containers.var.log.containers. line where you can specify the name of the logfile generated by your pod. This way the multiline filter will only work on the output of your pod.
Inside the filter you can see the type concat which means that fluentd will stitch lines together. The multiline event is recognised by a regular expression, which matches the timestamp at the start of the stacktrace. In the example above, this is format 2021-08-30T09:45:08
In our case we needed the timeout_label @splunk to prevent fluentd to hit an error:
#0 dump an error event: error_class=ThreadError error="deadlock; recursive locking"
This prevented fluentd from uploading the stacktrace entry to Splunk. When the multiline filter works, all lines from the stacktrace are stitched together as one line and they appear in Splunk as a single event:
Even though it’s much easier to search for errors this way, the log entry is still very unreadable. To fix this, use the seperator “\n” to tell Splunk to reconstruct the linebreaks. The result looks like this:
Great, your Java stacktraces appear as normal stacktraces in Splunk and you’re done right?
Unfortunatelly, Helm doesnt know that you changed the configmap inside openshift so whenever you install a newer version of the helm chart, all your progress is overwritten.
To fix this, simply head over to your helm chart and replace
customFilters: {}
to:
customFilters: My-first-StacktraceFilter: tag: tail.containers.var.log.containers.my-application-name-*.log type: concat body: |- key log timeout_label @SPLUNK stream_identity_key stream multiline_start_regexp /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\s/ flush_interval 5 separator "\n" use_first_timestamp true
And that’s it! Whenever you install your helm chart the customFilter is automatically added to the configmap.