Contents
English version: Understanding Systemd Timer: Troubleshooting D-Bus connection terminated Error – Frank’s Weblog

我们的服务是一个SaaS服务,每当客户创建一个实例时,我们会使用terraform在云环境中创建一个EC2实例以及相关的基础设施。这个EC2使用一个预先构建的AMI创建,这个AMI当中已经安装了我们所维护的软件,这个软件在本文中将使用代号sinatra指代。在EC2被创建后,会自动运行一个脚本setup.sh来配置sinatra,期间会使用Systemctl重启sinatra几次。
我们发现一位客户在创建实例时有一定几率会失败,错误信息如下(一些无关信息已经被省略):
...Warning! D-Bus connection Terminated. Failed to wait for response: Connection reset by peer...
排查
这个错误信息可以说是没头没尾,我们只能先从D-Bus入手。
什么是D-Bus?
D-Bus是一个进程间通信(IPC)机制,让不同的程序能够互相传递消息或下达指令。Systemd本身也使用D-Bus wire protocol用于Systemctl和Systemd之间的通讯。
什么导致了D-Bus connection terminated?
我们审查了系统日志,重点部分如下:
$ journalctl -b
Jan 01 00:23:01 vm systemd[1]: Starting sinatra Service...
Jan 01 00:23:01 vm sinatractl[33072]: [ Info] Starting XXX
Jan 01 00:23:01 vm systemd[1]: Reloading.
Jan 01 00:23:01 vm CRON[33195]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 01 00:23:01 vm CRON[33196]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 01 00:23:01 vm CRON[33197]: (root) CMD (/usr/bin/tsar --cron > /dev/null 2>&1)
Jan 01 00:23:01 vm CRON[33198]: (root) CMD (root /usr/bin/tsar --cron &>/dev/null)
Jan 01 00:23:01 vm CRON[33196]: (CRON) info (No MTA installed, discarding output)
Jan 01 00:23:01 vm CRON[33196]: pam_unix(cron:session): session closed for user root
Jan 01 00:23:01 vm CRON[33195]: pam_unix(cron:session): session closed for user root
Jan 01 00:23:02 vm sinatractl[33072]: [ Info] Starting XXX
Jan 01 00:23:03 vm sinatractl[33072]: [ Info] Starting XXX XXX XXX
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm systemd[1]: Reexecuting.
Jan 01 00:23:11 vm systemd[1]: systemd 245.4-4ubuntu3.20 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Jan 01 00:23:11 vm systemd[1]: Detected virtualization kvm.
Jan 01 00:23:11 vm systemd[1]: Detected architecture x86-64.
Jan 01 00:23:11 vm sudo[33067]: pam_unix(sudo:session): session closed for user root
Jan 01 00:23:11 vm CRON[699]: pam_unix(cron:session): session closed for user ubuntu
Jan 01 00:23:11 vm google_guest_agent[858]: GCE Agent Stopped
Jan 01 00:23:11 vm systemd[1]: Stopping Google Compute Engine Guest Agent...
Jan 01 00:23:11 vm systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Jan 01 00:23:11 vm systemd[1]: systemd-networkd-wait-online.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Wait for Network to be Configured.
Jan 01 00:23:11 vm systemd[1]: Stopping Wait for Network to be Configured...
Jan 01 00:23:11 vm systemd[1]: google-guest-agent.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Google Compute Engine Guest Agent.
Jan 01 00:23:11 vm systemd[1]: Stopping Network Service...
Jan 01 00:23:11 vm systemd[1]: systemd-networkd.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Network Service.
Jan 01 00:23:11 vm systemd[1]: Starting Network Service...
其中有几个关键时间点
- 00:23:01 Systemd启动Sinatra
- 00:23:11 Systemd Reexecute
- 00:23:11 Systemd 重新启动
通常情况下Sinatra的启动需要10-20秒的时间,而在第10秒时发生了Systemd Reexecution。我们可以初步推测Reexecution打断了Sinatra的启动过程。
什么是Systemd Reexecution?
Reexecute the systemd manager. This will serialize the manager state, reexecute the process and deserialize the state again. This command is of little use except for debugging and package upgrades. Sometimes, it might be helpful as a heavy-weight daemon-reload. While the daemon is being reexecuted, all sockets systemd listening on behalf of user configuration will stay accessible.
systemctl
Systemd Reexecution相当于Systemd自身的重启,可以通过systemctl daemon-reexec命令手动触发。那么我们来试一下上面的错误信息是否可以手动复现。
我们在一个机器中同时开两个terminal,在第一个terminal中启动sinatra,同时在第二个terminal中执行systemctl daemon-reexec。
# terminal 1
$ sudo systemctl start sinatra
Warning! D-Bus connection terminated.
Failed to wait for response: Connection reset by peer
# terminal 2
$ sudo systemctl daemon-reexec
我们完美地复现出了原始的错误信息。
什么造成了Systemd Reexecution?
那么问题来了,为什么Systemd会莫名其妙地自己Reexecute?再次检查日志之后发现了如下的内容,
$ journalctl -b
Jan 01 00:18:16 vm systemd[1]: Started Daily apt upgrade and clean activities.
......
Jan 01 00:28:16 vm systemd[1]: apt-daily-upgrade.service: Succeeded.
Jan 01 00:28:16 vm systemd[1]: Finished Daily apt upgrade and clean activities.
检查这个sevice之后发现这是一个由Systemd Timer控制的定时任务
user@vm:~$ cat /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer
[Unit]
Description=Daily apt upgrade and clean activities
After=apt-daily.timer
[Timer]
OnCalendar=*-*-* 6:00
RandomizedDelaySec=60m
Persistent=true
[Install]
WantedBy=timers.target
使用systemctl list-timers,可以证实这个timer的上次执行时间与Reexecution的时间相吻合。
user@vm:~$ sudo systemctl list-timers
NEXT LEFT LAST PASSED UNIT ACTIVATES
...
Tue 2025-01-02 06:28:09 UTC 4h 51min left Mon 2025-01-01 00:18:40 UTC 3h 18min ago apt-daily-upgrade.timer apt-daily-upgrade.service
...
我们进一步检查APT日志,发现Systemd在这个任务中被更新过。
$ cat /var/log/apt/history.log
Start-Date: 2025-01-01 00:23:10
Commandline: /usr/bin/unattended-upgrade
Upgrade: libsystemd0:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), systemd-sysv:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), libpam-systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), libnss-systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20)
End-Date: 2025-01-01 00:23:14
$ cat /var/log/apt/term.log
Log started: 2025-01-01 00:23:10
(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 121809 files and directories currently installed.)
Preparing to unpack .../systemd-sysv_245.4-4ubuntu3.20_amd64.deb ...
Unpacking systemd-sysv (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libnss-systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libnss-systemd:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libpam-systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libpam-systemd:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking systemd (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libsystemd0_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libsystemd0:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Setting up libsystemd0:amd64 (245.4-4ubuntu3.20) ...
Setting up systemd (245.4-4ubuntu3.20) ...
Setting up systemd-sysv (245.4-4ubuntu3.20) ...
Setting up libnss-systemd:amd64 (245.4-4ubuntu3.20) ...
Setting up libpam-systemd:amd64 (245.4-4ubuntu3.20) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
Processing triggers for libc-bin (2.31-0ubuntu9.18) ...
Log ended: 2025-01-01 00:23:14
我们这里可以确定,这个自动更新任务更新了Systemd,导致其Reexecute。
什么触发了自动更新?
我们进一步研究了日志中出现的Unit apt-daily-upgrade.service 。这是一个用于自动升级apt软件包的服务,由 /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer 负责调度。
$ cat /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer
[Unit]
Description=Daily apt upgrade and clean activities
After=apt-daily.timer
[Timer]
OnCalendar=--* 6:00
RandomizedDelaySec=60m
Persistent=true
[Install]
WantedBy=timers.target
我们可以看出这个任务被安排在每天的早上6点,但是我们创建这个实例的时间并不是早上6点,无论在哪个相关的时区。
Persistent=
Takes a boolean argument. If true, the time when the service unit was last triggered is stored on
disk. When the timer is activated, the service unit is triggered immediately if it would have been
triggered at least once during the time when the timer was inactive. This is useful to catch up on
missed runs of the service when the machine was off. Note that this setting only has an effect on
timers configured with OnCalendar=. Defaults to false.
Ubuntu Manpage: systemd.timer – Timer unit configuration
因为这个Timer被配置为Persistent,它会在机器启动后补上在关机期间没能执行的运行。从这个Timer的角度,从AMI被创建,到首次启动的这段时间都会被视为“关机”。因此在实例被创建之后,这个Timer会自动在一个“合适的时间”运行。
结论
灾难不会凭空发生,而是关键事件的连锁效应。
Disasters don’t just happen. They’re a chain of critical events.
——国家地理频道「重返危机现场」
综上所述,这个问题来自于一系列原因组成的巧合。
- EC2 实例从AMI创建并启动 。
setup.sh开始配置sinatra服务,期间通过systemctl若干次重启sinatra服务,每次重启需耗时约10-20秒 。- 由于
Persistent=true,系统同时启动了了之前错过的apt-daily-upgrade任务 。 apt更新了systemd包并触发daemon-reexec。- 当
setup.sh正在重启服务时,Systemd因Reexecution而断开D-Bus连接,导致重启命令失败。 - 实例创建失败。
解决方法
我们的解决办法是将Persistent设置为False,这样可以确保自动更新不会在实例刚创建的时候执行,从而将发生冲突的的可能性降低到几乎没有。

发表回复/Leave a Reply