VMware Virtual Watchdog Timerを試してみる

OSハングアップ時に強制再起動させたい、という要望が出てきたので、Virtual Watchdog Timer(VWDT)を試してみる

環境

vCenter Server 7.0 Update 3l
ESXi 7.0 Update 3l
CentOS 7.9
- 要望のあった仮想マシンが CentOS 7だったので

VWDT追加

要件を満たしているかを確認

仮想マシンのハードウェアがバージョン17以降である
- バージョン19 (ESXi 7.0 U2以降)
ゲストOSがVWDTをサポートしていること
- Windows Server 2008以降
- 4.9以降のkernelに基づくLinuxディストリビューション
  - Ubuntu 18.04以降
  - Red Hat Enterprise Linux 7.6以降

問題ない場合は、仮想マシンをパワーオフする

「アクション」→「設定の編集」を選択し、「新規デバイスを追加」から「ウォッチドッグタイマー」を選択する

デバイス追加後、「BIOS/EFIブートでの起動」にチェックを入れる(チェックを入れることで、VMが電源ONになるとすぐにwatchdogが開始される)。

仮想マシンを起動し、デバイスが作成されていることと、watchdog moduleがインストールされていることを確認する

[ ~]$ ls -la /dev/watchdog

crw-------. 1 root root 10, 130  8月  8 20:50 /dev/watchdog

[ ~]$ lsmod | grep wdat

wdat_wdt               13590  1

[ ~]$

watchdogインストール

yumコマンド(CentOS 7なので)でwatchdog packageをインストールする。今回はwatchdog-5.13-12.el7.x86_64をインストールした。

[~]# env LANG=C sudo yum info watchdog

Loaded plugins: fastestmirror, langpacks

Repodata is over 2 weeks old. Install yum-cron? Or run: yum makecache fast

Loading mirror speeds from cached hostfile

 * base: ftp.iij.ad.jp

 * extras: ftp.iij.ad.jp

 * updates: ftp.iij.ad.jp

Available Packages

Name        : watchdog

Arch        : x86_64

Version     : 5.13

Release     : 12.el7

Size        : 79 k

Repo        : base/7/x86_64

Summary     : Software and/or Hardware watchdog daemon

URL         : http://sourceforge.net/projects/watchdog/

License     : GPLv2+

Description : The watchdog program can be used as a powerful software watchdog daemon

            : or may be alternately used with a hardware watchdog device such as the

            : IPMI hardware watchdog driver interface to a resident Baseboard

            : Management Controller (BMC).  watchdog periodically writes to /dev/watchdog;

            : the interval between writes to /dev/watchdog is configurable through settings

            : in the watchdog sysconfig file.  This configuration file is also used to

            : set the watchdog to be used as a hardware watchdog instead of its default

            : software watchdog operation.  In either case, if the device is open but not

            : written to within the configured time period, the watchdog timer expiration

            : will trigger a machine reboot. When operating as a software watchdog, the

            : ability to reboot will depend on the state of the machine and interrupts.

            : When operating as a hardware watchdog, the machine will experience a hard

            : reset (or whatever action was configured to be taken upon watchdog timer

            : expiration) initiated by the BMC.

[~]# env LANG=C yum install watchdog

(snip)

Resolving Dependencies

--> Running transaction check

---> Package watchdog.x86_64 0:5.13-12.el7 will be installed

--> Finished Dependency Resolution

Dependencies Resolved

============================================================================================================================================

 Package                           Arch                            Version                              Repository                     Size

============================================================================================================================================

Installing:

 watchdog                          x86_64                          5.13-12.el7                          base                           79 k

Transaction Summary

============================================================================================================================================

Install  1 Package

Total download size: 79 k

Installed size: 160 k

Is this ok [y/d/N]: y

Downloading packages:

watchdog-5.13-12.el7.x86_64.rpm                                                                                      |  79 kB  00:00:00

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

Warning: RPMDB altered outside of yum.

  Installing : watchdog-5.13-12.el7.x86_64                                                                                              1/1

  Verifying  : watchdog-5.13-12.el7.x86_64                                                                                              1/1

Installed:

  watchdog.x86_64 0:5.13-12.el7

Complete!

[~]# rpm -qa | grep watchdog

watchdog-5.13-12.el7.x86_64

[~]#

インストール後、watchdog serviceを起動する

[~]# env LANG=C systemctl status watchdog

* watchdog.service - watchdog daemon

   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled; vendor preset: disabled)

   Active: inactive (dead)

[~]# env LANG=C systemctl enable watchdog

Created symlink from /etc/systemd/system/multi-user.target.wants/watchdog.service to /usr/lib/systemd/system/watchdog.service.

[~]# env LANG=C systemctl start watchdog

[~]# env LANG=C systemctl status watchdog

* watchdog.service - watchdog daemon

   Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled)

   Active: active (running) since Tue 2023-08-08 20:54:49 JST; 2s ago

  Process: 2748 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)

 Main PID: 2750 (watchdog)

   CGroup: /system.slice/watchdog.service

           `-2750 /usr/sbin/watchdog

Aug 08 20:54:49 XXX systemd[1]: Starting watchdog daemon...

Aug 08 20:54:49 XXX watchdog[2750]: starting daemon (5.13):

Aug 08 20:54:49 XXX watchdog[2750]: int=1s realtime=yes sync=no soft=no mla=0 mem=0

Aug 08 20:54:49 XXX watchdog[2750]: ping: no machine to check

Aug 08 20:54:49 XXX watchdog[2750]: file: no file to check

Aug 08 20:54:49 XXX watchdog[2750]: pidfile: no server process to check

Aug 08 20:54:49 XXX watchdog[2750]: interface: no interface to check

Aug 08 20:54:49 XXX watchdog[2750]: test=none(0) repair=none(0) alive=none heartbeat=none temp=none to=root no_act=no

Aug 08 20:54:49 XXX systemd[1]: Started watchdog daemon.

[~]#

watchdogの設定ファイルを調整する。設定ファイルは/etc/watchdog.confにある。設定したのは以下。

watchdog-device        = /dev/watchdog

watchdog-timer  = 10

変更後、watchdog serviceを再起動しておく

vCenterからもwatchdogが動いていることを確認する

動作確認

実際に仮想マシンをハングアップさせて、VWDTによる仮想マシン再起動が発生するかを確認する。

RHEL/CentOSでは echo c > /proc/sysrq-triggerを実行すると、即時再起動してしまうので、この方法は使えない(「kernel.panic = 0」であるにもかかわらず、カーネルパニック後にシステムが再起動する - Red Hat Customer Portal参照)。

Linuxのテスト用に、ハングアップやパニック状態にするカーネルモジュールを作ってみた - のぴぴのメモで紹介されていたGitHub - Noppy/HangAndPanicKernelModule: Hang_and_panic module causes a hang condition or a panic condition.を利用して、ハングアップさせてみる。

[~]# cat /proc/hang_panic

<<Hang&Panic module>>

'echo c > /proc/hang_panic' >>> panic

'echo h > /proc/hang_panic' >>> hang(disable local irq and preempt)

'echo H > /proc/hang_panic' >>> hang(disable only local irq)

[~]# date;echo h > /proc/hang_panic

2023年  8月  9日 水曜日 13:36:19 JST

しばらくすると、再起動していた

[~]$ env LANG=C date

Wed Aug  9 13:48:28 JST 2023

[~]$ uptime

 13:48:29 up 11 min,  1 user,  load average: 0.00, 0.01, 0.03

[~]$

vmware.logを確認する。VWDTによって再起動していることがわかる

2023-08-09T04:37:05.022Z In(05) vmx - VigorTransportProcessClientPayload: opID=lro-1925252931-10ca8c3a-01-01-69-503d seq=6902: Receiving GuestStats.SetNotificationTime request.

2023-08-09T04:37:05.022Z In(05) vmx - VigorTransport_ServerSendResponse opID=lro-1925252931-10ca8c3a-01-01-69-503d seq=6902: Completed GuestStats request.

2023-08-09T04:37:19.716Z No(00) vmx - ConfigDB: Setting vwdt.watchdogFired = "TRUE"

2023-08-09T04:37:19.723Z In(05) vmx - VWDT: Resetting VM since watchdog fired.

2023-08-09T04:37:19.725Z In(05) vcpu-0 - Destroying virtual dev for scsi0:0 vscsi=450399446371412041

2023-08-09T04:37:19.725Z In(05) vcpu-0 - VMMon_VSCSIStopVports: No such target on adapter

2023-08-09T04:37:19.735Z In(05) vcpu-0 - DEVICE: Resetting device 'ALL'.

2023-08-09T04:37:19.735Z In(05) vcpu-0 - Tools: ToolsRunningStatus_Reset, delayedRequest is 0x0

2023-08-09T04:37:19.736Z In(05) vcpu-0 - Tools: Changing running status: 1 => 0.

2023-08-09T04:37:19.736Z In(05) vcpu-0 - Tools: [RunningStatus] Last heartbeat value 9091 (last received 0s ago)

2023-08-09T04:37:19.736Z In(05) vcpu-0 - GuestLib Generated SessionId 7805150419715780647

おしまい。

しろぺん日記

VMware Virtual Watchdog Timerを試してみる

環境

VWDT追加

watchdogインストール

動作確認

参考