Karan Sharma

Analyzing CoreDNS logs with Clickhouse and Vector

9 minutes (2331 words)

I’ve been toying around Clickhouse and Vector at my day job and find both of these tools pretty interesting at what they do. A short summary for those unaware of these tools:

Now, for the context, I use coredns on my personal dev machine as it supports Split DNS (routing certain zones to a particular resolver) which I need for accessing internal domains at work. Yep, systemd-resolved can also do this, but I find coredns easier to configure and manage with OpenVPN as well.

Anyway, so one random evening, I got the idea of dumping CoreDNS Logs to Clickhouse. Maaaybe I was still hungover from the Vector/Clickhouse work I was doing at work but nevertheless I was interested in it.

πŸ”—Overview

Here’s how the pipeline looks:

image

πŸ”—Collecting Logs

First, let’s look at how the raw logs are structured by CoreDNS:

[INFO] 127.0.0.1:55678 - 21963 "A IN xpui.app.spotify.com. udp 38 false 512" NXDOMAIN qr,rd,ra 121 0.061416978s
[INFO] 127.0.0.1:59333 - 22742 "AAAA IN incoming.telemetry.mozilla.org. udp 48 false 512" NOERROR qr,aa,rd,ra 106 0.049235139s
[INFO] 127.0.0.1:39609 - 47247 "AAAA IN ping.archlinux.org. udp 36 false 512" NOERROR qr,rd,ra 140 0.056721154s

Vector provides a variety of sources to collect these logs. Since I am running coredns as a Docker container, the following config shows how to collect logs from a particular container:

# Filter coredns logs from Docker logs
[sources.coredns_logs]
  type = "docker_logs" # required
  docker_host = "unix:///var/run/docker.sock"
  include_images = ["coredns/coredns"] # optional, no default

The above config basically specifies a Docker Host variable and an image name filter. Vector talks to the Docker API over a unix socket and gathers metadata about the container (like container_created_at, container_name, label etc).

After collecting logs from Vector, it enriches with the following metadata.

{
    "container_created_at": "2021-06-04T14:18:03.967143133Z",
    "container_id": "00c5c4d36ea5b4772b517d3cca7d397c92f72be2a2bf45bb8c430f717fbd331e",
    "container_name": "coredns_coredns_1",
    "host": "iris",
    "image": "coredns/coredns",
    "label": {
        "com.docker.compose.config-hash": "928d71143c2af6553d551dbbf14140304d53f92378746454fbfeb0382a896d5b",
        "com.docker.compose.container-number": "1",
        "com.docker.compose.oneoff": "False",
        "com.docker.compose.project": "coredns",
        "com.docker.compose.project.config_files": "/home/karan/Code/Infra/coredns/hydra-vpn-compose.yml",
        "com.docker.compose.project.working_dir": "/home/karan/Code/Infra/coredns",
        "com.docker.compose.service": "coredns",
        "com.docker.compose.version": "1.29.2"
    },
    "message": "[INFO] 127.0.0.1:38266 - 20341 \"A IN open.spotify.com. udp 34 false 512\" NOERROR qr,rd,ra 160 0.300268123s",
    "source_type": "docker",
    "stream": "stdout",
    "timestamp": "2021-06-04T16:13:07.454601872Z"
}

(NOTE: I am using console sink to dump these logs to STDOUT. It’s pretty handy for inspecting logs).

# Print parsed logs to stdout
[sinks.print]
type = "console"
inputs = ["coredns_logs"]
encoding.codec = "json"

As you can see from the above JSON object, Vector has transformed the log with its own Data Model. The log that we care about now is inside .message key. It’s nice to have other metadata as well.

πŸ”—Transforming the logs

Our objectives at this step:

Now, CoreDNS can emit two kinds of logs (INFO and ERROR). The error usually happens when the upstream resolver is unreachable or there’s an issue with any of the CoreDNS plugins.

We need to write regex for both cases:

  1. INFO logs:
\[(?P<level>[^]]+)]\s(?P<server_addr>[^:]+):(?P<server_port>\S+)\s+-\s+(?P<id>\S+)\s+(?P<type>\S+)\s+(?P<class>\S+)\s+(?P<name>\S+)\s+(?P<proto>\S+)\s+(?P<size>\S+)\s+(?P<do>\S+)\s+(?P<bufsize>[^"]+)"\s+(?P<rcode>\S+)\s+(?P<rflags>\S+)\s+(?P<rsize>\S+)\s+(?P<duration>[\d\.]+).*

image

  1. ERROR logs:
\[(?P<level>ERROR)]\s+(?P<component>plugin\/errors):\s+(?P<code>\S)+\s+(?P<name>\S+)\s+(?P<type>[^:]*):\s+(?P<error_msg>.*)

image

Combining a bunch of other things to remove some fields and constructing the final payload, the config looks like this:


# Parse coredns logs
[transforms.parse_logs]
type = "remap"
inputs = ["coredns_logs"]
source = '''
# parse the log event.
ts = .timestamp
log,err = parse_regex(.message,r'\[(?P<level>[^]]+)]\s(?P<server_addr>[^:]+):(?P<server_port>\S+)\s+-\s+(?P<id>\S+)\s+"(?P<type>\S+)\s+(?P<class>\S+)\s+(?P<name>\S+)\s+(?P<proto>\S+)\s+(?P<size>\S+)\s+(?P<do>\S+)\s+(?P<bufsize>[^"]+)"\s+(?P<rcode>\S+)\s+(?P<rflags>\S+)\s+(?P<rsize>\S+)\s+(?P<duration>[\d\.]+).*')
if err !=null {
    # capture the error log. If the error log also fails to get parsed, the log event is dropped.
  log = parse_regex!(.message,r'\[(?P<level>ERROR)]\s+(?P<component>plugin/errors):\s+(?P<code>\S)+\s+(?P<name>\S+)\s+(?P<type>[^:]*):\s+(?P<error_msg>.*)')
}
. = log
# add timestamp
.timestamp = ts
# remove fields we dont care about
del(.do)
'''
drop_on_error = true

Apart from regex matching, we store the timestamp as received from Vector (since CoreDNS logs don’t contain any timestamp information). We delete some fields that we don’t care about.

Vector uses a powerful DSL (called VRL) to do such kind of transformations on the fly. It has a lot of functions to do almost any kind of transformation to your original event payload. You can invoke vector vrl from the terminal and get a shell to write the above transformations and debug quickly. It proved to be really useful when dealing with such a long regex pattern.

πŸ”—Storing in Clickhouse

Finally we get to the part where we need to dump these logs to our Clickhouse DB. Here’s the schema for the table where we will be storing these records:

CREATE DATABASE IF NOT EXISTS `coredns`;

CREATE TABLE IF NOT EXISTS `coredns`.`logs` (
    `timestamp` DateTime('Asia/Kolkata'),
    `bufsize` Int32,
    `class` LowCardinality(String),
    `duration` Float64,
    `id` Int32,
    `level` LowCardinality(String),
    `name` String,
    `proto` LowCardinality(String),
    `rcode` LowCardinality(String),
    `rflags` String,
    `server_addr` String,
    `server_port` Int32,
    `rsize` Int32,
    `size`  Int32,
    `type` LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY toYYYYMMDD(timestamp)
TTL timestamp + INTERVAL 1 WEEK;

Key things to note:

UPDATE:

Clickhouse on Twitter clarified that ORDER BY timestamp will have better performance in this context. Usually if your queries are β€œlast 1h”, β€œlast 5m” based, it is better to not store the the sort key as YYYYMMDD format.

image

Now, we need to instruct Vector to send these logs to Clickhouse:


[sinks.coredns_logs_clickhouse_output]
  type = "clickhouse"
  inputs = ["parse_logs"]
  compression = "gzip"
  database = "coredns"
  endpoint = "http://localhost:8123"
  table = "logs"
  encoding.timestamp_format = "unix"
  batch.timeout_secs = 10

Clickhouse offers an HTTP API (which is running on port 8123 by default). Vector takes the input from the previous step (parse_logs transformation) and sends it to Clickhouse over the HTTP interface. Clickhouse stores datetimes in UNIX, so before sending the data, Vector can encode certain fields in the payload to a different data type as well (isn’t that cool 😎)

πŸ”—Query Examples

I’ve been running this pipeline for 3-4 days, so I have a decent amount of data collected to show for the blog post.

Well, that’s all I could think of really. If you’ve some more interesting analysis to get from this data, let me know!

πŸ”—Summary

The intention behind writing this post is to give an overview of how the entire log collection and processing pipeline works. Using Vector has been an amazing experience however the sad bit is that I don’t know Rust and I cannot contribute to some of the issues I’ve opened (even though they are presumably trivial). Maybe I should pick up Rust, finally? 🀭

Thanks for reading!

Tags: #Devops #Vector #Clickhouse #coredns