Bridge Networking in Nomad

To set the stage, it’s crucial to understand what we mean by “bridge networking”. In a nutshell, it is a type of network connection in Linux that allows virtual interfaces, like the ones used by virtual machines and containers, to share a physical network interface.

With Nomad, when a task is allocated, it creates a network namespace with its own network stack. Within this, a virtual ethernet (veth) pair is established, one end of which is assigned to the network namespace of the allocation, and the other remains in the host namespace.

🔗The Network Journey

To illustrate this practically, let’s assume a packet is sent from a task within an allocation. The packet would first be received by the local end of the veth pair, it would then traverse to the other end residing in the host’s namespace. From there, it is sent to the bridge on the host (in this case, the “nomad” bridge), which finally sends the packet out to the world via the host’s physical network interface (typically “eth0” or equivalent in your machine).

The journey of a packet from the outside world to a task inside an allocation is the exact mirror image. The packet reaches “eth0” first, then the nomad bridge, it is then forwarded to the appropriate veth interface in the host’s namespace. From there, it crosses over to the other end of the veth pair in the allocation’s network namespace and finally gets routed to the destination task.

🔗To bridge or not to

Let’s take a look at the following jobspec which is for deploying my tiny side project - Cloak on Nomad

job "cloak" {
  datacenters = ["dc1"]
  type        = "service"

  group "redis" {
    network {
      mode = "host"
      port "redis" {
        to = 6379
      }
    }

    service {
      name     = "cloak-redis"
      port     = "redis"
      provider = "nomad"
    }


    task "redis" {
      driver = "docker"


      config {
        image                  = "redis:7"
        advertise_ipv6_address = false

        ports = [
          "redis",
        ]

        volumes = [
          "/data/cloak/redis:/data",
        ]
      }

      resources {
        cpu    = 500 # MHz
        memory = 256 # MB
      }
    }
  }

  group "cloak" {
    network {
      mode = "host"
      port "cloak" {
        static = 7000
        to     = 7000
      }
    }

    task "cloak" {
      driver = "docker"

      config {
        image   = "ghcr.io/mr-karan/cloak:v0.2.0"
        command = "--config=config.toml"
        ports = [
          "cloak",
        ]
      }

      template {
        data        = <<EOH
# Configuration for 1 redis instances, as assigned via rendezvous hashing.
{{$allocID := env "NOMAD_ALLOC_ID" -}}
{{range nomadService 1 $allocID "cloak-redis"}}
CLOAK_REDIS__address={{ .Address }}:{{ .Port }}
{{- end}}
EOH
        destination = "secrets/file.env"
        env         = true
      }


      resources {
        cpu    = 500 # MHz
        memory = 700 # MB
      }
    }
  }
}

Our focus should be on the network.mode stanza. To illustrate what happens behind the scenes when an alloc runs in network.mode=host (host network), we can run the above job.

On the machine, we can see that port 7000 (static) and port 27042 (dynamic) are allocated on the host network interface (eth0):

We can also see the port and process details using ss:

sudo ss -ltpn 'sport = :7000'
State    Recv-Q   Send-Q      Local Address:Port       Peer Address:Port   Process
LISTEN 0 4096      95.216.165.210:7000    0.0.0.0:*  users:(("docker-proxy",pid=67068,fd=4))

This config is more suitable for specific workloads - like load balancers or similar deployments where you want to expose the network interface on the host. It’s also helpful for applications running outside of Nomad on that host to connect via the host network interface.

However, typically in a job where you want to connect to multiple different allocs - you’d want to set up a bridge network. This generally avoids exposing the workload on the host network directly. It’s a typical setup where you want to put applications behind a reverse proxy (NGINX/Caddy).

Let’s change network.mode=bridge in the above job spec and see the changes.

$ nomad job plan cloak.nomad

+/- Job: "cloak"
+/- Task Group: "cloak" (1 create/destroy update)
  + Network {
      Hostname: ""
    + MBits:    "0"
    + Mode:     "bridge"
    + Static Port {
      + HostNetwork: "default"
      + Label:       "cloak"
      + To:          "7000"
      + Value:       "7000"
      }
    }
  - Network {
      Hostname: ""
    - MBits:    "0"
    - Mode:     "host"
    - Static Port {
      - HostNetwork: "default"
      - Label:       "cloak"
      - To:          "7000"
      - Value:       "7000"
      }
    }
    Task: "cloak"

+/- Task Group: "redis" (1 create/destroy update)
  + Network {
      Hostname: ""
    + MBits:    "0"
    + Mode:     "bridge"
    + Dynamic Port {
      + HostNetwork: "default"
      + Label:       "redis"
      + To:          "6379"
      }
    }
  - Network {
      Hostname: ""
    - MBits:    "0"
    - Mode:     "host"
    - Dynamic Port {
      - HostNetwork: "default"
      - Label:       "redis"
      - To:          "6379"
      }
    }
    Task: "redis"

Now we don’t see the ports forwarded on the host network:

Similarly, ss also shows no process listening on the host network

🔗IPTables and Routing

To understand what happened when we switched the networking mode to bridge, we need to take a look at the Nomad iptables magic which comes into play when using bridge network.

I pulled up the iptables and saw specific rules under the chains CNI-FORWARD and NOMAD-ADMIN. These rules, in essence, allow all traffic to and from the allocation’s network namespace.

$ sudo iptables -L CNI-FORWARD
Chain CNI-FORWARD (1 references)
target     prot opt source               destination         
NOMAD-ADMIN  all  --  anywhere             anywhere             /* CNI firewall plugin admin overrides */
ACCEPT     all  --  anywhere             172.26.64.5          ctstate RELATED,ESTABLISHED
ACCEPT     all  --  172.26.64.5          anywhere            
ACCEPT     all  --  anywhere             172.26.64.6          ctstate RELATED,ESTABLISHED
ACCEPT     all  --  172.26.64.6          anywhere

sudo iptables -L NOMAD-ADMIN
Chain NOMAD-ADMIN (1 references)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             172.26.64.0/20

Nomad uses 172.26.64.0/20 as the default subnet for the bridge network. The IPs 172.26.64.5 and 172.26.64.6 are assigned to 2 different allocs in this CIDR. The iptables rules allow complete traffic to flow on this subnet.

To check the routing,ip route command can be used.

$ ip route show 172.26.64.0/20
172.26.64.0/20 dev nomad proto kernel scope link src 172.26.64.1

It uses the nomad network interface for routing packets related to the default bridge network.

Using nsenter we can find more details about the network namespace created for an alloc. Let’s find details about the redis alloc:

sudo nsenter -t $(pgrep redis) --net ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if113: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 76:47:6d:49:00:c0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.26.64.5/20 brd 172.26.79.255 scope global eth0
       valid_lft forever preferred_lft forever

We can see that one end of the pair is eth0 (container’s default gateway) which is connected to a network interface with an index 113. For the tunnel to actually work, the veth pair should also exist on the host:

$ ip a
113: veth3402deda@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master nomad state UP group default 
    link/ether 3a:85:1b:37:75:17 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::3885:1bff:fe37:7517/64 scope link 
       valid_lft forever preferred_lft forever

So, when we see veth3402deda@if2 in the host’s network namespace (with the index 113), and then we see eth0@if113 inside the Redis container, we can infer that these two interfaces form a veth pair: veth3402deda@if2 on the host side and eth0 inside the container. This connection enables the container to communicate with the external network through the host’s network stack.

🔗Capturing packets

We can capture TCP packets on the veth interface to see the routing work:

sudo tcpdump -i veth971858d5 -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on veth971858d5, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:51:27.801319 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [S], seq 1331933249, win 65495, options [mss 65495,sackOK,TS val 248300785 ecr 0,nop,wscale 7], length 0
10:51:27.801549 IP 172.26.64.6.7000 > 172.26.64.1.35826: Flags [S.], seq 107697826, ack 1331933250, win 65160, options [mss 1460,sackOK,TS val 3965422857 ecr 248300785,nop,wscale 7], length 0
10:51:27.801616 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [.], ack 1, win 512, options [nop,nop,TS val 248300785 ecr 3965422857], length 0
10:51:27.801737 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [P.], seq 1:79, ack 1, win 512, options [nop,nop,TS val 248300786 ecr 3965422857], length 78
10:51:27.801751 IP 172.26.64.6.7000 > 172.26.64.1.35826: Flags [.], ack 79, win 509, options [nop,nop,TS val 3965422858 ecr 248300786], length 0
10:51:27.802022 IP 172.26.64.6.7000 > 172.26.64.1.35826: Flags [P.], seq 1:4097, ack 79, win 509, options [nop,nop,TS val 3965422858 ecr 248300786], length 4096
10:51:27.802059 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [.], ack 4097, win 491, options [nop,nop,TS val 248300786 ecr 3965422858], length 0
10:51:27.802120 IP 172.26.64.6.7000 > 172.26.64.1.35826: Flags [P.], seq 4097:5396, ack 79, win 509, options [nop,nop,TS val 3965422858 ecr 248300786], length 1299
10:51:27.802135 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [.], ack 5396, win 502, options [nop,nop,TS val 248300786 ecr 3965422858], length 0
10:51:27.803484 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [F.], seq 79, ack 5396, win 512, options [nop,nop,TS val 248300787 ecr 3965422858], length 0
10:51:27.803567 IP 172.26.64.6.7000 > 172.26.64.1.35826: Flags [F.], seq 5396, ack 80, win 509, options [nop,nop,TS val 3965422859 ecr 248300787], length 0
10:51:27.803597 IP 172.26.64.1.35826 > 172.26.64.6.7000: Flags [.], ack 5397, win 512, options [nop,nop,TS val 248300787 ecr 3965422859], length 0
10:53:08.523431 IP 172.26.64.6.53042 > 95.216.165.210.27372: Flags [.], ack 2169295212, win 501, options [nop,nop,TS val 735542538 ecr 4133067854], length 0
10:53:08.523551 IP 95.216.165.210.27372 > 172.26.64.6.53042: Flags [.], ack 1, win 509, options [nop,nop,TS val 4133379150 ecr 735231242], length 0
10:53:08.523554 IP 95.216.165.210.27372 > 172.26.64.6.53042: Flags [.], ack 1, win 509, options [nop,nop,TS val 4133379150 ecr 735231242], length 0
10:53:08.523562 IP 172.26.64.6.53042 > 95.216.165.210.27372: Flags [.], ack 1, win 501, options [nop,nop,TS val 735542538 ecr 4133379150], length 0

To summarize the output, we can see that the log is showing a TCP connection between 172.26.64.1 (source) and 172.26.64.6 (destination), specifically on port 7000. 172.26.64.1 happens to be the gateway for nomad subnet.

🔗Summary

Hope this post clarified some networking internals and behind the scenes magic when using Nomad bridge networking. Refer to my other post - Nomad networking explained for a practical breakdown of all the different ways to expose and connect applications in a Nomad cluster.

Fin!