Advanced search

Message boards : Graphics cards (GPUs) : GPUGRID NVIDIA related crashing (looks like random appearances)

Author Message
Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 639
Level
Ser
Scientific publications
wat
Message 58885 - Posted: 6 Jun 2022 | 10:55:14 UTC
Last modified: 6 Jun 2022 | 10:59:30 UTC

The boinc application GPUGRID seems to crash the system.
I think it is NVIDIA related, as any other application is not causing it.
Some cases it was overheating and some cases it just crashed without even
overheating.


System info;


System: Kernel: 5.10.0-14-amd64 x86_64 bits: 64 compiler: gcc v: 10.2.1
parameters: BOOT_IMAGE=/vmlinuz-5.10.0-14-amd64 root=UUID=<filter> ro quiet
splash
Desktop: Xfce 4.16.0 tk: Gtk 3.24.24 info: xfce4-panel wm: xfwm 4.16.1 vt: 7 dm: LightDM 1.26.0
Distro: MX-21.1_ahs_x64 Wildflower April 9 2022 base: Debian GNU/Linux 11 (bullseye)
Machine: Type: Laptop System: HP product: HP ENVY Laptop 17-ce0xxx v: Type1ProductConfigId serial: <filter>
Chassis: type: 10 serial: <filter>
Mobo: HP model: 85E5 v: 30.32 serial: <filter> UEFI: Insyde v: F.13 date: 08/09/2021
Battery: ID-1: BAT0 charge: 0% condition: 53.2/53.2 Wh (100.0%) volts: 10.5 min: 11.6 model: 333-54-2C-A LK03055XL
type: Li-ion serial: <filter> status: Unknown
Device-1: hidpp_battery_0 model: Logitech Wireless Mouse serial: <filter> charge: 55% (should be ignored)
rechargeable: yes status: Discharging
Device-2: hidpp_battery_1 model: Logitech Wireless Keyboard serial: <filter>
charge: 55% (should be ignored) rechargeable: yes status: Discharging
CPU: Info: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP arch: Kaby Lake note: check family: 6
model-id: 8E (142) stepping: C (12) microcode: EC cache: L2: 8 MiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 31999
Speed: 2200 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 2200
7: 2200 8: 2200
Vulnerabilities: Type: itlb_multihit status: KVM: VMX disabled
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp
Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2 mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
Type: srbds mitigation: TSX disabled
Type: tsx_async_abort status: Not affected
Graphics: Device-1: Intel WhiskeyLake-U GT2 [UHD Graphics 620] vendor: Hewlett-Packard driver: i915 v: kernel
bus-ID: 00:02.0 chip-ID: 8086:3ea0 class-ID: 0300
Device-2: NVIDIA GP108M [GeForce MX250] vendor: Hewlett-Packard driver: nvidia v: 515.48.07
alternate: nouveau,nvidia_drm bus-ID: 02:00.0 chip-ID: 10de:1d13 class-ID: 0302
Display: x11 server: X.Org 1.20.13 compositor: xfwm4 v: 4.16.1 driver: loaded: modesetting,nvidia
unloaded: fbdev,nouveau,vesa alternate: nv display-ID: :0.0 screens: 1
Screen-1: 0 s-res: 3840x1080 s-dpi: 96 s-size: 1016x286mm (40.0x11.3") s-diag: 1055mm (41.6")
Monitor-1: eDP-1 res: 1920x1080 hz: 60 dpi: 128 size: 382x215mm (15.0x8.5") diag: 438mm (17.3")
Monitor-2: DP-1 res: 1920x1080 hz: 60 dpi: 85 size: 575x323mm (22.6x12.7") diag: 660mm (26")
OpenGL: renderer: Mesa Intel UHD Graphics 620 (WHL GT2) v: 4.6 Mesa 21.2.5 direct render: Yes
Audio: Device-1: Intel Cannon Point-LP High Definition Audio vendor: Hewlett-Packard driver: sof-audio-pci
alternate: snd_hda_intel,snd_soc_skl,snd_sof_pci bus-ID: 00:1f.3 chip-ID: 8086:9dc8 class-ID: 0401
Sound Server-1: ALSA v: k5.10.0-14-amd64 running: yes
Sound Server-2: PulseAudio v: 14.2 running: yes
Network: Device-1: Intel Cannon Point-LP CNVi [Wireless-AC] driver: iwlwifi v: kernel modules: wl port: 5000
bus-ID: 00:14.3 chip-ID: 8086:9df0 class-ID: 0280
IF: wlan0 state: down mac: <filter>
Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Hewlett-Packard driver: r8169
v: kernel port: 3000 bus-ID: 03:00.0 chip-ID: 10ec:8168 class-ID: 0200
IF: eth0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth: Device-1: Intel Bluetooth 9460/9560 Jefferson Peak (JfP) type: USB driver: btusb v: 0.8 bus-ID: 1-10:4
chip-ID: 8087:0aaa class-ID: e001
Report: hciconfig ID: hci0 rfk-id: 1 state: up address: <filter> bt-v: 3.0 lmp-v: 5.1 sub-v: 100
hci-v: 5.1 rev: 100
Info: acl-mtu: 1021:4 sco-mtu: 96:6 link-policy: rswitch sniff link-mode: slave accept
service-classes: rendering, capturing, object transfer, audio
RAID: Hardware-1: Intel 82801 Mobile SATA Controller [RAID mode] driver: ahci v: 3.0 port: 5060 bus-ID: 00:17.0
chip-ID: 8086.282a rev: 30 class-ID: 0104
Drives: Local Storage: total: 1.59 TiB used: 357.94 GiB (21.9%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Western Digital model: PC SN520 SDAPNUW-256G-1006
size: 238.47 GiB block-size: physical: 512 B logical: 512 B speed: 15.8 Gb/s lanes: 2 type: SSD
serial: <filter> rev: 20110006 temp: 36.9 C scheme: GPT
ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST1000LM049-2GH172 size: 931.51 GiB block-size:
physical: 4096 B logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 7200 serial: <filter> rev: RXM3
scheme: GPT
ID-3: /dev/sdb maj-min: 8:16 type: USB vendor: HP model: x796w size: 462.32 GiB block-size:
physical: 512 B logical: 512 B type: N/A serial: <filter> rev: PMAP scheme: MBR
SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure?
Partition: ID-1: / raw-size: 47.6 GiB size: 46.55 GiB (97.80%) used: 11.56 GiB (24.8%) fs: ext4 dev: /dev/nvme0n1p8
maj-min: 259:8
ID-2: /boot raw-size: 1024 MiB size: 989.4 MiB (96.62%) used: 267.7 MiB (27.1%) fs: ext4
dev: /dev/nvme0n1p7 maj-min: 259:7
ID-3: /boot/efi raw-size: 260 MiB size: 256 MiB (98.46%) used: 97.7 MiB (38.2%) fs: vfat
dev: /dev/nvme0n1p1 maj-min: 259:1
ID-4: /home raw-size: 488.28 GiB size: 479.54 GiB (98.21%) used: 346.03 GiB (72.2%) fs: ext4
dev: /dev/sda2 maj-min: 8:2
Swap: Alert: No swap data was found.
Sensors: System Temperatures: cpu: 70.0 C mobo: 57.0 C
Fan Speeds (RPM): N/A
Repos: Packages: note: see --pkg apt: 2309 lib: 1231 flatpak: 0
No active apt repos in: /etc/apt/sources.list
Active apt repos in: /etc/apt/sources.list.d/debian-stable-updates.list
1: deb http://deb.debian.org/debian bullseye-updates main contrib non-free
Active apt repos in: /etc/apt/sources.list.d/debian.list
1: deb http://deb.debian.org/debian bullseye main contrib non-free
2: deb http://security.debian.org/debian-security bullseye-security main contrib non-free
Active apt repos in: /etc/apt/sources.list.d/mx.list
1: deb http://mirror.rise.ph/mxlinux-pkg/mx/repo/ bullseye main non-free
2: deb http://mirror.rise.ph/mxlinux-pkg/mx/repo/ bullseye ahs
Active apt repos in: /etc/apt/sources.list.d/teams.list
1: deb [arch=amd64] https://packages.microsoft.com/repos/ms-teams stable main
Info: Processes: 330 Uptime: 23m wakeups: 10 Memory: 15.37 GiB used: 3.4 GiB (22.1%) Init: SysVinit v: 2.96
runlevel: 5 default: 5 tool: systemctl Compilers: gcc: 10.2.1 alt: 10 Shell: bash
default: Bash v: 5.1.4 running-in: quick-system-info-mx inxi: 3.3.06
Boot Mode: UEFI



NVIDIA-driver specific info:
Mon Jun 6 18:54:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 60C P8 N/A / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2663 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+


BOINC log file output:
Mon Jun 6 18:33:49 2022 | | cc_config.xml not found - using defaults
Mon Jun 6 18:33:50 2022 | | Starting BOINC client version 7.16.16 for x86_64-pc-linux-gnu
Mon Jun 6 18:33:50 2022 | | log flags: file_xfer, sched_ops, task
Mon Jun 6 18:33:50 2022 | | Libraries: libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3
Mon Jun 6 18:33:50 2022 | | Data directory: /home/jari/BOINC
Mon Jun 6 18:33:50 2022 | | CUDA: NVIDIA GPU 0: NVIDIA GeForce MX250 (driver version 515.48, CUDA version 11.7, compute capability 6.1, 4042MB, 3982MB available, 1215 GFLOPS peak)
Mon Jun 6 18:33:50 2022 | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce MX250 (driver version 515.48.07, device version OpenCL 3.0 CUDA, 4042MB, 3982MB available, 1215 GFLOPS peak)
Mon Jun 6 18:33:50 2022 | | libc: Debian GLIBC 2.31-13+deb11u3 version 2.31
Mon Jun 6 18:33:50 2022 | | Host name: mx
Mon Jun 6 18:33:50 2022 | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz [Family 6 Model 142 Stepping 12]
Mon Jun 6 18:33:50 2022 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Mon Jun 6 18:33:50 2022 | | OS: Linux Debian: Debian GNU/Linux 11 (bullseye) [5.10.0-14-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u3)]
Mon Jun 6 18:33:50 2022 | | Memory: 15.37 GB physical, 0 bytes virtual
Mon Jun 6 18:33:50 2022 | | Disk: 479.54 GB total, 109.10 GB free
Mon Jun 6 18:33:50 2022 | | Local time is UTC +8 hours
Mon Jun 6 18:33:50 2022 | GPUGRID | General prefs: from GPUGRID (last modified 01-Jun-2022 19:37:34)
Mon Jun 6 18:33:50 2022 | GPUGRID | Computer location: home
Mon Jun 6 18:33:50 2022 | GPUGRID | General prefs: no separate prefs for home; using your defaults
Mon Jun 6 18:33:50 2022 | | Reading preferences override file
Mon Jun 6 18:33:50 2022 | | Preferences:
Mon Jun 6 18:33:50 2022 | | max memory usage when active: 7869.12 MB
Mon Jun 6 18:33:50 2022 | | max memory usage when idle: 7869.12 MB
Mon Jun 6 18:33:55 2022 | | max disk usage: 50.00 GB
Mon Jun 6 18:33:55 2022 | | max CPUs used: 6
Mon Jun 6 18:33:55 2022 | | (to change preferences, visit a project web site or select Preferences in the Manager)
Mon Jun 6 18:33:55 2022 | | Setting up project and slot directories
Mon Jun 6 18:33:55 2022 | | Checking active tasks


I could not find it saving anything related the system crash.
So I could not specify any reasons why it occurs.
Maybe driver related issue.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58887 - Posted: 6 Jun 2022 | 20:27:00 UTC - in response to Message 58885.

I believe the 515 series drivers are the short term branch.

The long term stable series is the 510 drivers.

I'd drop back from the cutting edge for a run and try the 510 series.

Not having any issues with the 510 series on all of my hosts.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 639
Level
Ser
Scientific publications
wat
Message 58900 - Posted: 11 Jun 2022 | 13:04:44 UTC - in response to Message 58887.

It looks like the HP ENVY Laptop 17-ce0xxx 19.5V/65W charger is either broken
or it is too small for this heavy load caused by the GPUGRID.
So I try to purchase 19.5V/90W larger charger to support the high
load caused by the GPUGRID and NVIDIA.
The battery was always nearly empty, because the 65W charger could not produce
enough current for this application to run.
Thus even in smallest load peak there might have been caused CPU/GPU undervoltage that crashed the system.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58901 - Posted: 11 Jun 2022 | 16:52:50 UTC

Glad to hear you figured out the problem was not enough power for the system.

Post to thread

Message boards : Graphics cards (GPUs) : GPUGRID NVIDIA related crashing (looks like random appearances)

//