Last Updated: May 25, 2016
·
4.504K
· aeas44

How to investigate "Failed Units" in CoreOS

When I created a Kubernetes cluster with CoreOS, one of the CoreOS nodes claimed "Failed Units" when I logged in to it:

$ ssh -i ~/.ssh/key.pem core@xxx.xxx.xxx.xxx
Last login: Mon May 23 04:43:57 2016 from yyy.yyy.yyy.yyy
CoreOS beta (1010.3.0)
Update Strategy: No Reboots
Failed Units: 5
  var-lib-kubelet-plugins-kubernetes.io-aws\x2debs-mounts-aws-ap\x2dnortheast\x2d1c-vol\x2d8c3b1734.mount
  var-lib-rkt-pods-run-bf6f1c19\x2d7bc0\x2d4931\x2d885a\x2d811cc236973a-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-kubelet-plugins-kubernetes.io-aws\x2debs-mounts-aws-ap\x2dnortheast\x2d1c-vol\x2d8c3b1734.mount
  docker-0bd19e353b40194e4bcc35172fa5b954ef2ba366121de2163f07f557bbcd170a.scope
  locksmithd.service
  polkit.service

You can get more info with systemctl --failed:

$ systemctl --failed
  UNIT                                                                                                        LOAD   ACTIVE SUB    DESCRIPTION
● var-lib-kubelet-plugins-kubernetes.io-aws\x2debs-mounts-aws-ap\x2dnortheast\x2d1c-vol\x2d8c3b1734.mount     loaded failed failed /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/ap-northeast-1c/vol-8c3b1734
● var-lib-rkt-pods-run-bf6f1c19\x2d7bc0\x2d4931\x2d885a\x2d811cc236973a-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-kubelet-plugins-kubernetes.io-aws\x2debs-mounts-aws-ap\x2dnortheast\x2d1c-vol\x2d8c3b1734.mount loaded failed failed
● docker-0bd19e353b40194e4bcc35172fa5b954ef2ba366121de2163f07f557bbcd170a.scope                               loaded failed failed docker container 0bd19e353b40194e4bcc35172fa5b954ef2ba366121de2163f07f557bbcd170a
● locksmithd.service                                                                                          masked failed failed locksmithd.service
● polkit.service                                                                                              loaded failed failed Authorization Manager

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

5 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

You can get the status of the specific unit by systemctl status ...:

$ systemctl status locksmithd.service
● locksmithd.service
   Loaded: masked (/dev/null)
   Active: failed (Result: resources) since Tue 2016-05-17 06:34:35 UTC; 6 days ago
 Main PID: 758 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

You can list all the units by systemctl list-units:

$ systemctl list-units

If you think that the failure is just a temporary glitch, then run this:

$ sudo systemctl reset-failed

Then check that everything is ok:

$ systemctl --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

There's more you can do with systemctl. Check systemctl --help for detail. Enjoy it!