Understanding Systemd Timer: Troubleshooting `D-Bus connection terminated` Error

中文版:理解Systemd Timer:一次D-Bus connection terminated错误排查 – Frank’s Weblog

We run a SaaS service, when a customer creates an instance, we use terraform to create an EC2 instance and relevant cloud infrastructures. This EC2 is created with a prebuilt AMI, with our software installed. This software will be referred to as codename sinatra. After the EC2 is created, a script setup.sh will run to configure sinatra, and it will restart sinatra a few times with Systemctl.

We noticed one of the customers has a possibility to fail when creating an instance, with the error message below(non-relevant messages are omitted):

...Warning! D-Bus connection Terminated. Failed to wait for response: Connection reset by peer...

Troubleshooting

This error message provides absolutely no context. We had to start the investigation with D-Bus.

What is D-Bus?

D-Bus is a Inter-process Communication(IPC) mechanism that allows different programs to communicate and send commands with each other. Systemd itself also uses D-Bus wire protocol for the communication between Systemctl and Systemd.

What caused `D-Bus connection terminated`?

We inspected the system journal, here are the important parts:

$ journalctl -b
Jan 01 00:23:01 vm systemd[1]: Starting sinatra Service...
Jan 01 00:23:01 vm sinatractl[33072]: [   Info] Starting XXX
Jan 01 00:23:01 vm systemd[1]: Reloading.
Jan 01 00:23:01 vm CRON[33195]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 01 00:23:01 vm CRON[33196]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 01 00:23:01 vm CRON[33197]: (root) CMD (/usr/bin/tsar --cron > /dev/null 2>&1)
Jan 01 00:23:01 vm CRON[33198]: (root) CMD (root /usr/bin/tsar --cron &>/dev/null)
Jan 01 00:23:01 vm CRON[33196]: (CRON) info (No MTA installed, discarding output)
Jan 01 00:23:01 vm CRON[33196]: pam_unix(cron:session): session closed for user root
Jan 01 00:23:01 vm CRON[33195]: pam_unix(cron:session): session closed for user root
Jan 01 00:23:02 vm sinatractl[33072]: [   Info] Starting XXX
Jan 01 00:23:03 vm sinatractl[33072]: [   Info] Starting XXX XXX XXX
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm dbus-daemon[701]: [system] Reloaded configuration
Jan 01 00:23:11 vm systemd[1]: Reexecuting.
Jan 01 00:23:11 vm systemd[1]: systemd 245.4-4ubuntu3.20 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Jan 01 00:23:11 vm systemd[1]: Detected virtualization kvm.
Jan 01 00:23:11 vm systemd[1]: Detected architecture x86-64.
Jan 01 00:23:11 vm sudo[33067]: pam_unix(sudo:session): session closed for user root
Jan 01 00:23:11 vm CRON[699]: pam_unix(cron:session): session closed for user ubuntu
Jan 01 00:23:11 vm google_guest_agent[858]: GCE Agent Stopped
Jan 01 00:23:11 vm systemd[1]: Stopping Google Compute Engine Guest Agent...
Jan 01 00:23:11 vm systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Jan 01 00:23:11 vm systemd[1]: systemd-networkd-wait-online.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Wait for Network to be Configured.
Jan 01 00:23:11 vm systemd[1]: Stopping Wait for Network to be Configured...
Jan 01 00:23:11 vm systemd[1]: google-guest-agent.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Google Compute Engine Guest Agent.
Jan 01 00:23:11 vm systemd[1]: Stopping Network Service...
Jan 01 00:23:11 vm systemd[1]: systemd-networkd.service: Succeeded.
Jan 01 00:23:11 vm systemd[1]: Stopped Network Service.
Jan 01 00:23:11 vm systemd[1]: Starting Network Service...

Here are the key timestamps.

  • 00:23:01 Systemd started Sinatra
  • 00:23:11 Systemd Reexecute
  • 00:23:11 Systemd restarted

Normally, Sinatra takes 10-20 seconds to start, and Systemd Reexecution happened at 10th second. Our preliminary hypothesis was that the Reexecution interrupted Sinatra’s startup process.

What is Systemd Reexecution?

Reexecute the systemd manager. This will serialize the manager state, reexecute the process and deserialize the state again. This command is of little use except for debugging and package upgrades. Sometimes, it might be helpful as a heavy-weight daemon-reload. While the daemon is being reexecuted, all sockets systemd listening on behalf of user configuration will stay accessible.

systemctl

Systemd Reexecution is a restart of Systemd itself. It can be triggered manually by running systemctl daemon-reexec. Let’s try if we can reproduce the error message above.

Open two terminals in one machine. Start sinatra in the first terminal, then run systemctl daemon-reexec immediately in the second terminal.

# terminal 1
$ sudo systemctl start sinatra
Warning! D-Bus connection terminated.
Failed to wait for response: Connection reset by peer
# terminal 2
$ sudo systemctl daemon-reexec

We perfectly reproduced the error message.

What caused Systemd Reexecution?

Here is the question: Why would Systemd Reexecute? After examining the system journal again, we discovered the following entries.

$ journalctl -b
Jan 01 00:18:16 vm systemd[1]: Started Daily apt upgrade and clean activities.
......
Jan 01 00:28:16 vm systemd[1]: apt-daily-upgrade.service: Succeeded.
Jan 01 00:28:16 vm systemd[1]: Finished Daily apt upgrade and clean activities.

Upon checking this Service, we found out it’s a cron job managed by Systemd Timer.

user@vm:~$ cat /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer
[Unit]
Description=Daily apt upgrade and clean activities
After=apt-daily.timer

[Timer]
OnCalendar=*-*-* 6:00
RandomizedDelaySec=60m
Persistent=true

[Install]
WantedBy=timers.target

Using systemctl list-timers, we can confirm the last time this timer was executed matched the time of the Reexecution.

user@vm:~$ sudo systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                                 ACTIVATES
...
Tue 2025-01-02 06:28:09 UTC  4h 51min left Mon 2025-01-01 00:18:40 UTC  3h 18min ago apt-daily-upgrade.timer              apt-daily-upgrade.service
...

We further examined the APT logs and discovered that Systemd was updated during this job.

$ cat /var/log/apt/history.log
Start-Date: 2025-01-01  00:23:10
Commandline: /usr/bin/unattended-upgrade
Upgrade: libsystemd0:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), systemd-sysv:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), libpam-systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20), libnss-systemd:amd64 (245.4-4ubuntu3.19, 245.4-4ubuntu3.20)
End-Date: 2025-01-01  00:23:14

$ cat /var/log/apt/term.log
Log started: 2025-01-01  00:23:10
(Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 121809 files and directories currently installed.)
Preparing to unpack .../systemd-sysv_245.4-4ubuntu3.20_amd64.deb ...
Unpacking systemd-sysv (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libnss-systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libnss-systemd:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libpam-systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libpam-systemd:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../systemd_245.4-4ubuntu3.20_amd64.deb ...
Unpacking systemd (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Preparing to unpack .../libsystemd0_245.4-4ubuntu3.20_amd64.deb ...
Unpacking libsystemd0:amd64 (245.4-4ubuntu3.20) over (245.4-4ubuntu3.19) ...
Setting up libsystemd0:amd64 (245.4-4ubuntu3.20) ...
Setting up systemd (245.4-4ubuntu3.20) ...
Setting up systemd-sysv (245.4-4ubuntu3.20) ...
Setting up libnss-systemd:amd64 (245.4-4ubuntu3.20) ...
Setting up libpam-systemd:amd64 (245.4-4ubuntu3.20) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
Processing triggers for libc-bin (2.31-0ubuntu9.18) ...
Log ended: 2025-01-01  00:23:14

Now we can confirm that this automatic upgrade upgraded Systemd, and caused it to Reexecute.

What Triggered the Automatic Upgrade?

We further inspected the Unit apt-daily-upgrade.service that showed up in the log. This is a service that automatically upgrades the apt packages, and is scheduled by /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer.

$ cat /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer
[Unit]
Description=Daily apt upgrade and clean activities
After=apt-daily.timer

[Timer]
OnCalendar=--* 6:00
RandomizedDelaySec=60m
Persistent=true

[Install]
WantedBy=timers.target

We can see this job is scheduled at 6AM everyday, but the time we created this instance was not 6AM, no matter in which relevant timezone.

 Persistent=
           Takes a boolean argument. If true, the time when the service unit was last triggered is stored on
           disk. When the timer is activated, the service unit is triggered immediately if it would have been
           triggered at least once during the time when the timer was inactive. This is useful to catch up on
           missed runs of the service when the machine was off. Note that this setting only has an effect on
           timers configured with OnCalendar=. Defaults to false.

Ubuntu Manpage: systemd.timer – Timer unit configuration

Since this Timer is configured to be “Persistent”, it catches up with the missed runs while the machine was “powered off”. From the Timer’s perspective, the time between the build of the AMI to the creation of the instance is considered “powered off”. Therefore, after the instance was created, the Timer runs at a “appropriate time”.

Conclusion

Disasters don’t just happen. They’re a chain of critical events.

Seconds from Disaster, National Geographic Channel

In conclusion, this issue stemmed from a coincidence of a series of events:

  1. EC2 instance launched from AMI.
  2. setup.sh started configuring Sinatra, triggering multiple systemctl restarts that takes 0-20s each.
  3. At the same time, apt-daily-upgrade kicked in because Persistent is set to true and there were previously missed runs.
  4. apt upgraded Systemd, and caused a daemon-reexec.
  5. The Systemd Reexecution disconnected the D-Bus while setup.sh was restarting the service.
  6. The instance failed to provision.

Solution

Our solutions was to set Persistent to False, to ensure unattended upgrade won’t happen right after the instance creation. Therefore lower the possibility of such race condition to minimum.

References

D-Bus – Wikipedia

Stop using cron! Systemd Timers Explained – CoadyTech

PeriodicUpdates – Debian Wiki


Posted

in

by

评论/Comments

发表回复/Leave a Reply

您的电子邮箱地址不会被公开。/Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.