630-GPU服务器宕机,自动重启,日志记录:A fatal error was detected on a component at bus 128 device 3 function 0
故障原因:
造成机器宕机的原因是当多GPU高负载工作时, GPU 温度达到阈值(95度)触发了bus fatal error,导致重启宕机。
根本原因是IDRAC 温控进程异常,无法准确实时的反馈GPU实际工作温度,从而使GPU过热宕机;
Racadm直接调整风扇转速方式:
查看当前值:
[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset
Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=System.Embedded.1#ThermalSettings.1]
FanSpeedOffset=Off
设置风扇转速值为3:【0 low fan speed、1 medium fan speed、2 high fan speed、3 max fan speed】
[root@xxxxx ~]# racadm -r BMCIP -u xxx -p xxx set System.ThermalSettings.FanSpeedoffset 3
Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=System.Embedded.1#ThermalSettings.1]
Object value modified successfully
设置完成后再次查看:
[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset
Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
[Key=System.Embedded.1#ThermalSettings.1]
FanSpeedOffset=Max Fan Speed
通过调整风扇转速,服务器运行正常。
文章来源:https://www.cnaaa.net,转载请注明出处:https://www.cnaaa.net/archives/8190