Splunk Connect for Kubernetes: ‘reading’ stacktraces in your pod.

Within our Openshift cluster, we use one main application based on Java. The Splunk Connect for Kubernetes integration worked pretty well out of the box, as all our pods are now logging to Splunk. There is just one problem: The Fluentd log forwarder simply reads your logs and forwards them to Splunk without interpretation. In this blogpost I’ll explain how we had to tune the fluentd configuration to handle stacktraces and how this was configured in the Helm chart you use to install the Splunk Connect in the first place.

By default, the Splunk Connect for Kubernetes integration grabs every pod log and forwards it to Splunk. Splunk then interprets these messages and generates ‘events’ for each line. This is fine when every line in your log has it’s own meaning. For example, look at the example below to see a webserver access.log which states on each line which webpage was requested, how it was handled and so on

192.168.2.20 - - [28/Jul/2006:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395
127.0.0.1 - - [28/Jul/2006:10:22:04 -0300] "GET / HTTP/1.0" 200 2216
127.0.0.1 - - [28/Jul/2006:10:27:32 -0300] "GET /hidden/ HTTP/1.0" 404 7218

In this example each line is an event on it’s own, as they represent different requests to different web pages. These events show up in Splunk as separate ‘events’ as well.

multiple Splunk events, one for each access.log line

However when you have a Java stacktrace, the message gets pushed on to multiple lines:

2021-08-30T09:45:08,525 ERROR [org.springframework.web.context.ContextLoader] (ServerService Thread Pool -- 74) Context initialization failed: [...]
	at org.springframework.beans.factory.support....
	at org.springframework.beans.factory.support.ConstructorResolver...
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory...
	at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory...

And this results in each line becoming an individual ‘event’ in Splunk:

As you can imagine, this makes it very difficult to read the logs and also to search for any errors. To solve this, you have to use a feature called ‘multiline‘. This feature recognises the start of an event and stiches all lines together (using concat) untill the beginning of a new event is found. To do this, Fluentd needs to recognise the start of an event. In the example above, we know that the java (EAP7) log looks like this:

2021-08-30T09:45:08

To figure out the proper filter (which can be a trial-and-error process) you can add the following filter in the configmap splunk-kubernetes-logging in your splunk-logging namespace.

<filter tail.containers.var.log.containers.my-application-name-*.log>
    @type concat
    key log
    timeout_label @SPLUNK
    stream_identity_key stream
    multiline_start_regexp /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\s/
    flush_interval 5
    separator "\n"
    use_first_timestamp true
  </filter>

The filter starts with the tail.containers.var.log.containers. line where you can specify the name of the logfile generated by your pod. This way the multiline filter will only work on the output of your pod.

Inside the filter you can see the type concat which means that fluentd will stitch lines together. The multiline event is recognised by a regular expression, which matches the timestamp at the start of the stacktrace. In the example above, this is format 2021-08-30T09:45:08

In our case we needed the timeout_label @splunk to prevent fluentd to hit an error:

#0 dump an error event: error_class=ThreadError error="deadlock; recursive locking"

This prevented fluentd from uploading the stacktrace entry to Splunk. When the multiline filter works, all lines from the stacktrace are stitched together as one line and they appear in Splunk as a single event:

stacktrace on a single line, some sensitive logentries are removed

Even though it’s much easier to search for errors this way, the log entry is still very unreadable. To fix this, use the seperator “\n” to tell Splunk to reconstruct the linebreaks. The result looks like this:

properly formatted stacktrace, some sensitive logentries are removed

Great, your Java stacktraces appear as normal stacktraces in Splunk and you’re done right?
Unfortunatelly, Helm doesnt know that you changed the configmap inside openshift so whenever you install a newer version of the helm chart, all your progress is overwritten.

To fix this, simply head over to your helm chart and replace

customFilters: {}

to:

  customFilters:
    My-first-StacktraceFilter:
      tag: tail.containers.var.log.containers.my-application-name-*.log
      type: concat
      body: |-
        key log
        timeout_label @SPLUNK
        stream_identity_key stream
        multiline_start_regexp /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}\s/
        flush_interval 5
        separator "\n"
        use_first_timestamp true

And that’s it! Whenever you install your helm chart the customFilter is automatically added to the configmap.

Featured

Setting up Splunk Connect for Kubernetes on Openshift 4.x with Helm

As part of our daily operations, our team and our customers use the company-wide Splunk application. Splunk is used to search through application logs and check the status of file transfers. So naturally when we were moving our application to Openshift, one of the main prerequisites was to set up all container logs to Splunk. This required a bit of digging, as the Splunk team usually installed an agent on our Linux hosts and configured the agents to pick-up the logs we want to add. In the world of pods where containers can live very briefly, installing such agents would never work. Luckily for us we could rely on Splunk Connect for Kuberetes.

The concept of this project is rather simple: a helm chart is generated where you can tweak any Splunk-specific forwarders you wish. Out of the box there are 3 components available:

splunk-kubernetes-logging – this chart simply configures Splunk to read all container (stdout) logs, usually this is the only one you’ll really need. Installing this chart will result in a daemonset of forwarders (each node gets one) and all crio logs from that node are forwarded to Splunk.
splunk-kubernetes-objects – this chart will upload all kubernetes objects, such as creation of projects, deployments etc. Installing this chart will result in a single objects pod which talks to the api.
splunk-kubernetes-metrics – a specific metric chart, just in case you’d rather use Splunk metrics instead of the built-in Grafana. Installing this chart will also create a daemonset.

For each of these charts you can set the Splunk host, port, certificate, HEC token and index. This means you can use different indexes for each component (e.g. a logging index for users and an objects index for the OPS team). This blog assumes that your Splunk team has created the required HEC tokens and Indexes which will be used in the Helm chart.

To start, create a new project where all Splunk forwarders live. This is quite simple:

oc new-project splunk-logging --description="log forwarder to Company-wide Splunk"

If you work with infranodes on your cluster and adjusted your default cluster scheduler to ignore infranodes, by default no Splunk forwarders will be installed there. This might be exactly what you’d want, but if you also want Splunk forwarders on these nodes, type:

oc edit project splunk-logging

and add

openshift.io/node-selector: ""

The next part is a bit scary: Splunk logging forwarders simply look at the filesystem of the Openshift worker (specifically at /var/log/containers/ ) as this is the default location where CRI-O logs are stored in Openshift. There is no ‘sidecar’ approach here (an additional container on each of your pods) to push logs to Splunk.

It is a straightforward approach, but of course pods are not allowed to go on the worker filesystem out of the box. We’ll need to create a security context to allow this inside the splunk-logging namespace.

--- contents of scc.yaml ---
kind: SecurityContextConstraints
apiVersion: security.openshift.io/v1
metadata:
name: scc-splunk-logging
allowPrivilegedContainer: true
allowHostDirVolumePlugin: true
runAsUser:
type: RunAsAny
runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
volumes:
- "*"

Note the ‘allowPriviledContainer: true’ and ‘allowHostDirVolumePlugin: true’ which allows (privileged) Splunk pods to look on the worker filesystem. Setting up the scc is only half the puzzle though, you’ll need to create a Service Account and map this to the security context.

oc apply -f ./scc.yaml
oc create sa splunk-logging
oc adm policy add-scc-to-user scc-splunk-logging -z splunk-logging

Next, get the helm binary and run

helm repo add splunk https://splunk.github.io/splunk-connect-for-kubernetes

If your bastion host cannot go to splunk.github.io like due to a firewall policy, you can download the Splunk-connect-for-kubernetes repository here in .tar.gz format and use:

helm install my-first-splunk-repo splunk-connect-for-kubernetes-1.4.9.tgz

Great, now you’ll need a chart to tweak into. To generate the chart, type

helm show values splunk/splunk-connect-for-kubernetes > values.yaml

Note: the values.yaml is generated based on the repository. You can tweak this file as much as you want, but please know that the values are based on the repository version you are currently using. This means that the accepted values in the helm chart might change over time. Always generate a vanilla chart after upgrading the splunk-connect-for-kubernetes repository and compare to your specific helm chart to the new template.

Your values.yaml will have 4 sections: a ‘global’ section and the 3 component section listed above. At the global section you can set generic values such as Splunk host, Splunk Port, caFile and Openshift clustername. At each specific section you can set the appropriate HEC token and Splunk index. The helm chart is too large to discuss here, but some words of advice:

To disable a section, simply set enabled: false, e.g.

splunk-kubernetes-objects:
enabled: false

Your pods will need to run with privileged=true. Use this in each of the components, it doesn’t work in the ‘global’ section of the helm chart

# this used to be: openshift: true 
securityContext: true

you’ve already created the serviceaccount with the mapped scc, so make sure Helm uses it:

serviceAccount:
  create: false
  name: splunk-logging

The default log location of Openshift is /var/log/containers, so you’ll need to set this in each section:

fluentd:
  path: /var/log/containers/*.log
  containers:
    path: /var/log
    pathDest: /var/log/containers
    logFormatType: cri
    logFormat: "%Y-%m-%dT%H:%M:%S.%N%:z"

if you don’t want all the openshift-pod stdout logs (which can be a HUGE amount of logs), exclude them like this:

exclude_path:
  - /var/log/containers/*-splunk-kubernetes-logging*
  - /var/log/containers/downloads-*openshift-console*
  - /var/log/containers/tekton-pipelines-webhook*
  - /var/log/containers/node-ca-*openshift-image-registry*
  - /var/log/containers/ovs-*openshift-sdn*
  - /var/log/containers/network-metrics-daemon-*
  - /var/log/containers/sdn*openshift-sdn*

if you don’t want all the etcd and apiserver logs, remove this line so no forwarder pods are installed on the masternodes:

  tolerations:
#    - key: node-role.kubernetes.io/master
#      effect: NoSchedule

If you’re happy with the helm chart (which can be very trial-and-error), simply type:

helm install my-first-splunk-repo -f your-values-file.yaml splunk/splunk-connect-for-kubernetes

$ or in an offline environment:

helm install my-first-splunk-repo -f your-values-file.yaml splunk-connect-for-kubernetes-1.4.9.tgz

If your chart is valid, you’ll see something like:

Splunk Connect for Kubernetes is spinning up in your cluster.
After a few minutes, you should see data being indexed in your Splunk

to see the daemon set pods spinning up, simply type

watch oc get pods -n splunk-logging

In case you want to make more changes to the helm chart (for example to add more filters), you can always modify your values.yaml and then hit:

helm upgrade my-first-splunk-repo  -f your-values-file.yaml splunk/splunk-connect-for-kubernetes

Helm will detect the change and only modify the affected parts. For example if you’ve added more logs to the exclude_path, helm will update the configmap containing the Fluentd config and then terminate the daemonset one-by-one.

That’s it for now, in the next blog I’ll show you how to add a filter that prevents Java Stacktraces to become multiple Splunk events!

Adding USB devices to your containers

While this seems an uncommon scenario, it’s very usefull to add USB devices to your containers. Imagine running Home Assistant in your container and having a Z-wave or Zigbee USB stick in the container host. There are multiple ways to add this, and it can be a bit confusing if you don’t know the difference between the various methods. In this blog I’ll descibe two ways of setting up USB devices to your container.

The first method is by far the easiest: simply add ‘privileged: true’ to your docker-compose file or add –privileged as a flag to the docker run command. This allows the container to access all components of the host which runs your docker runtime engine. Obviously this is not preferred, as your container can reach every block device on your system and could even reach passwords kept in memory. If you don’t care about this level of security, this method is by far the easiest solution.

The second method is much more secure but this can be a bit confusing. Simply add the device you need (e.g. /dev/ttyUSB1) as a virtual device inside your container. In docker compose this looks like this:

devices:
  - /dev/ttyUSB1:/dev/ttyUSB1

Note that this is a simple mapping, using <actual_path_on_host>:<virtual_path_in_container>. To keep things simple here, we used the same virtual path as the actual path. Easy right? There is a drawback here: Linux will randomly generate the devicename of your USB stick on reboot or after adding a USB stick. This is why you should never add a /dev/<device> directly, but rather use the symlinks which contain a unique identifier for your USB device.

Let’s take a look at my ZZH! stick on the docker host. Using ‘dmesg’ you can see it’s mapped to ‘ttyUSB1’

[1199309.256461] usb 2-1.2: New USB device strings: Mfr=0, Product=2, SerialNumber=0
 [1199309.256463] usb 2-1.2: Product: USB Serial
 [1199309.256893] ch341 2-1.2:1.0: ch341-uart converter detected
 [1199309.257777] usb 2-1.2: ch341-uart converter now attached to ttyUSB1

so you can find it at /dev/ttyUSB1. However this can change randomly as stated above, so rather let’s look at /dev/serial/by-id

ls -lthra /dev/serial/by-id
total 0
drwxr-xr-x 4 root root 80 Dec 20 16:46 ..
lrwxrwxrwx 1 root root 13 Dec 20 16:46 usb-0658_0200-if00 -> ../../ttyACM1
lrwxrwxrwx 1 root root 13 Dec 20 16:46 usb-RFXCOM_RFXtrx433_A1YUV98W-if00-port0 -> ../../ttyUSB0
lrwxrwxrwx 1 root root 13 Jan 3 13:54 usb-1a86_USB_Serial-if00-port0 -> ../../ttyUSB1
drwxr-xr-x 2 root root 100 Jan 3 13:54 .

here you can see I have multiple USB adapters connected, and

/dev/serial/by-id/usb-1a86_USB_Serial-if00-port0

is the device which always points to my ZZH! stick.
So let’s add this symlink to the docker compose file and to prevent confusion, let’s rename the target virtual mapping inside the container:

devices:
  - /dev/serial/by-id/usb-1a86_USB_Serial-if00-port0:/dev/Virtual_ZZH_stick

Now the container will have a “/dev/Virtual_ZZH_stick” device which automatically maps to the actual USB stick, even after reboot or swapping USB ports. Note that any configuration you might have in the container, must now point to /dev/Virtual_ZZH_stick. For example, my zigbee2mqtt container will now have:

serial:
  port: /dev/Virtual_ZZH_stick

Note that you could add both methods 1 and 2 above, but it would make little sense: using privileged: true you wouldn’t need any device mappings and such it seems to be ignored during my many debugging hours. As such, I’d recommend the following:

after plugging in your usb stick, run ‘dmesg’ to see which mapping it contains.
add privileged: true to see if the container can reach the device. If this is not the case, check if the privileges are set properly on the /dev/<mapping> location. You could temporarily set ‘chmod 777 /dev/<mapping>’ to see if this fixes your issue.
If the container can reach the device, remove privileged: true from your container configuration and use the device mapping as suggested by device 2.