Skip to content

Fabio running using Nomad system scheduler breaks Docker.  #192

@michaelmcguinness

Description

@michaelmcguinness

I realise how unlikely the title to this issue seems but if there is an obvious error in my set up I can't spot it. I want to run Fabio as a Nomad managed service using the Nomad system scheduler (type = "system"). When I do then any subsequent pulls from our private Docker registry fails with the error
failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory
From that point on I am not able to recover Docker.

Some details about our set up:
Ubuntu 14-04
Kernel = 3.13.0-53-generic
Docker = 1.12.2
Nomad = 0.5.0
Fabio = 1.3.4

I have a 3 x servers with 2 x clients. I am trying to run Fabio using the exec driver and the system scheduler. I am running Nomad as the root user on which I believe is required for the exec driver.

I do not see the issue if I run Fabio using the service scheduler.
I do not see the issue if I run a Docker container using the system scheduler .
I do not see the issue if I run another job (sleep binary) using the system scheduler.
I do not see the issue if I run Fabio using the system scheduler but using the raw_exec driver.

Docker is using the LVM storage option but I see the same issue if I drop back to the devicemapper storage option.

Below is a repeatable test case. After that are copies of the job specs used in the test case.

  1. Go to Nomad user
    ubuntu@ip-10-75-70-27:~$ sudo su - nomad

  2. Software versions

$ uname -a
Linux ip-10-75-70-27 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 1.12.2, build bb80604
$ nomad version
Nomad v0.5.0
  1. Nomad running as root with no running jobs
$ ps -ef | grep nomad
root     17416     1  0 12:57 ?        00:00:00 /usr/local/bin/nomad agent -config /etc/nomad.d/config.json -rejoin -node=nomad_client_poc1
consul   17540     1  0 12:57 ?        00:00:00 /usr/local/bin/consul agent -config-file /etc/consul.d/config.json -rejoin -node nomad_client_poc1
root     17617  2602  0 12:58 pts/1    00:00:00 sudo su - nomad
  1. Demonstrate Docker pull
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Pull complete
fd2731e4c50c: Pull complete
28a2f68d1120: Pull complete
a3ed95caeb02: Pull complete
87f9029820c8: Pull complete
7582f6d126ab: Pull complete
Digest: sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Status: Downloaded newer image for dockerregistry.adm.myprivatecloud.net/bti1003:latest
  1. Remove pulled image
$ docker rmi dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003@sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Deleted: sha256:3fee2600d434e469b6d4ac0e468bd988ebc105536886d6624dc9566577fcafbe
Deleted: sha256:e5fcd939dd4a2a9b9543dea61ca90d2def7c92cd983108916895a39f239799b8
Deleted: sha256:57bd7c9432ae86d63f2342e442eebd0f4dfc340ca61c6a4c7d702b17a315865f
Deleted: sha256:0aaccda2aadfc70ab2248437568fd17f4e8860cf612cc4b7e154b97222dccf91
Deleted: sha256:9dcfe19e941956c63860afee1bec2e2318f6fbd336bc523094ed609a9c437a01
Deleted: sha256:6ff1ee6fc8a0358aeb92f947fb7125cd9e3d68c05be45f5375cb59b98c850b4d
Deleted: sha256:56abdd66ba312859b30b5629268c30d44a6bbef6e2f0ebe923655092855106e8
  1. Run 'sleep' test job
$ ps -ef | grep nomad.*executor
root     17897 17416  6 13:00 ?        00:00:02 /usr/local/bin/nomad executor /var/nomad/alloc/87a48081-5d8e-b1b1-1538-a1e79a3f4152/sleep-task/sleep-task-executor.out
nomad    18299 17619  0 13:07 pts/1    00:00:00 grep nomad.*executor
  1. Pull Docker image
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Pull complete
fd2731e4c50c: Pull complete
28a2f68d1120: Pull complete
a3ed95caeb02: Pull complete
87f9029820c8: Pull complete
7582f6d126ab: Pull complete
Digest: sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Status: Downloaded newer image for dockerregistry.adm.myprivatecloud.net/bti1003:latest
  1. Remove pulled image
$ docker rmi dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003@sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Deleted: sha256:3fee2600d434e469b6d4ac0e468bd988ebc105536886d6624dc9566577fcafbe
Deleted: sha256:e5fcd939dd4a2a9b9543dea61ca90d2def7c92cd983108916895a39f239799b8
Deleted: sha256:57bd7c9432ae86d63f2342e442eebd0f4dfc340ca61c6a4c7d702b17a315865f
Deleted: sha256:0aaccda2aadfc70ab2248437568fd17f4e8860cf612cc4b7e154b97222dccf91
Deleted: sha256:9dcfe19e941956c63860afee1bec2e2318f6fbd336bc523094ed609a9c437a01
Deleted: sha256:6ff1ee6fc8a0358aeb92f947fb7125cd9e3d68c05be45f5375cb59b98c850b4d
Deleted: sha256:56abdd66ba312859b30b5629268c30d44a6bbef6e2f0ebe923655092855106e8
  1. Stop 'sleep' job
$ ps -ef | grep nomad.*executor
nomad    18239 17619  0 13:06 pts/1    00:00:00 grep nomad.*executor
  1. Start Fabio job
$ ps -ef | grep nomad.*executor
root     18262 17416 33 13:07 ?        00:00:04 /usr/local/bin/nomad executor /var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/fabio-task-executor.out
nomad    18299 17619  0 13:07 pts/1    00:00:00 grep nomad.*executor
  1. Pull docker image
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Extracting [==================================================>] 65.69 MB/65.69 MB
fd2731e4c50c: Download complete
28a2f68d1120: Download complete
a3ed95caeb02: Download complete
87f9029820c8: Download complete
7582f6d126ab: Download complete
failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory
  1. Fabio job dies (10 minutes later), from syslog
Nov 23 13:17:56 ip-10-75-70-27 nomad[17416]: driver.exec: error destroying executor: 1 error(s) occurred:#012#012* 1 error(s) occurred:#012#012* failed to unmou
nt shared alloc dir "/var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/alloc": invalid argument
Nov 23 13:17:57 ip-10-75-70-27 nomad[17416]: client: failed to destroy context for alloc '5729f45b-185c-fa7b-7b05-866a774b8c73': 2 error(s) occurred:#012#012* 1 error(s) occurred:#012#012* failed to remove the secret dir "/var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/secrets": unmount: invalid argument#012* remove /var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task: directory not empty

From Docker log

time="2016-11-23T13:07:59.287783575Z" level=error msg="Error trying v2 registry: failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory"
time="2016-11-23T13:07:59.287830271Z" level=error msg="Attempting next endpoint for pull after error: failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory"

Fabio Job Spec

job "fabio-job" {
  region = "eu"
  datacenters = ["vpc-poc"]
  type = "system"
  update {
    stagger = "5s"
    max_parallel = 1
  }

  group "fabio-group" {
     ephemeral_disk {
      size    = "500"
    }
    task "fabio-task" {
      driver = "exec"
      config {
        command = "fabio-1.3.4-go1.7.3-linux_amd64"
      }

      artifact {
        source = "https://github.com/eBay/fabio/releases/download/v1.3.4/fabio-1.3.4-go1.7.3-linux_amd64"
     }
     logs {
	max_files = 2
	max_file_size = 5
      }
      resources {
        cpu = 500
        memory = 64
        network {
          mbits = 1

          port "http" {
            static = 9999
          }
          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

Sleep Job Spec

job "sleep-job" {
  region = "eu"
  datacenters = ["vpc-poc"]
  type = "system"
  update {
    stagger = "5s"
    max_parallel = 1
  }

  group "sleep-group" {
     ephemeral_disk {
      size    = "500"
    }
    task "sleep-task" {
      driver = "exec"
      config {
        command = "/bin/sleep"
        args = ["1000"]
      }

     logs {
	max_files = 2
	max_file_size = 5
      }
      resources {
        cpu = 500
        memory = 64
        network {
          mbits = 1

        }
      }
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions