Advanced search

Message boards : Number crunching : WUs on linux get mostly errors

Author Message
zooxit
Send message
Joined: 4 Jul 21
Posts: 23
Credit: 4,994,153,142
RAC: 24,640,293
Level
Arg
Scientific publications
wat
Message 59011 - Posted: 22 Jul 2022 | 21:36:28 UTC

Hi,

for some time now I am noticing that WUs mostly end up with errors on my Linux (Debian) computer with GTX1070, but almost no errors are reported for Windows computers both with GTX1070 or RTX3080.

On all computers the same apps are running: Python apps for GPU hosts v4.03 (cuda1131)

Any ideas what is the issue?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,919,581,959
RAC: 6,279,126
Level
Arg
Scientific publications
watwatwatwatwat
Message 59012 - Posted: 23 Jul 2022 | 6:21:26 UTC - in response to Message 59011.

A lot of your errors are badly formatted tasks. Look at all the other failed wingmen that tried seven times before the task was retired.

But I also see an issue with your Debian system in it unarchive/uncompression algorithms that aren't handling the file archives correctly.

Check if bzip2 algorithm is installed.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 203,127
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 59017 - Posted: 23 Jul 2022 | 21:54:13 UTC

Windows systems are also getting: Exit status 195 (0xc3) EXIT_CHILD_FAILED
Its a common theme in all these tasks.

zooxit
Send message
Joined: 4 Jul 21
Posts: 23
Credit: 4,994,153,142
RAC: 24,640,293
Level
Arg
Scientific publications
wat
Message 59186 - Posted: 2 Sep 2022 | 18:22:46 UTC

Hi,
thanks for answers (didn't have much time to troubleshoot lately, that's why the late response...)

So, still troubleshooting, stil not solved:
- tried Ubuntu, and now Win10, instead of Debian
- found that one RAM stick was faulty - removed it
- tried removing some of the graphic cards from the rig (originaly there where 4 cards) - now I am at only one graphic card on the troublesome computer (waiting for results)

... nothing helped. Still on this machine (GTX1070) I get almost no Valid tasks (my other two machines (GTX1070 and RTX 3080) crunch as expected).
I guess the fact that change of OS did not affect the outcome (still almost no Valid tasks) means there is a hardware problem.(?)


Any ideas?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,493,857,483
RAC: 71,175,505
Level
Trp
Scientific publications
wat
Message 59187 - Posted: 2 Sep 2022 | 21:10:10 UTC - in response to Message 59186.

Most of your errors seem to involve errors in the extraction phase.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,919,581,959
RAC: 6,279,126
Level
Arg
Scientific publications
watwatwatwatwat
Message 59188 - Posted: 2 Sep 2022 | 21:33:48 UTC

Check your storage system. Hard drive/SSD/NVMe drive flaky. Bad cabling. Incorrect transmission speed.

If storage is common to your PCIe bus along with your gpus. Check that you are not trying to drive PCI Gen2 storage with PCI Gen3 gpus at the same time.

As I mentioned previously and Ian also commented your issue is the decompression phase of the tasks where the Python libraries get expanded.

Check that your storage is not running out of room. Check that your swapfile or swap partition is adequate.

zooxit
Send message
Joined: 4 Jul 21
Posts: 23
Credit: 4,994,153,142
RAC: 24,640,293
Level
Arg
Scientific publications
wat
Message 59193 - Posted: 3 Sep 2022 | 14:54:48 UTC

Thanks for answers.

Debian, Ubuntu and now Win where all installed on different disks (all SSDs, admittedly a bit old ones... sizes 128-512GB). Is this size enough?

Will check the bios settings.

Is it normal that project status says there are 700+ tasks available, but my machine received less than an hour of work in almost two days (only 6 tasks, mostly errors)?

Also another question:
Computer with RTX3080 (running 24/7) gets on average 130.000 RAC lately - isn't that low for this graphic card?

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,210,300,176
RAC: 2,007,062
Level
Met
Scientific publications
watwat
Message 59198 - Posted: 7 Sep 2022 | 5:32:15 UTC
Last modified: 7 Sep 2022 | 5:39:19 UTC

Regarding the 3080 and its credits:
First, you can't compare the Python tasks with ACEMD or the earlier Short- and Long-runs.
They work differently, and they are credited differently.
Second, I'm running a 'flat-footed' 1080 at about 180k credits per day.
(I call it flat-footed because I limit its power to 110 watts using nvidia-smi to hopefully give it a longer life.)
When monitoring the utilisation of the GPU I saw that it is hardly used, so I set the machine to run two GPUGrid Pythons beside one MilkyWay work-unit. I chose MilkyWay because that project's work-units also don't fill up the 1080 completely, and the GPUGrid work-units should have enough space to breathe. You could chose any other project that doesn't run at 100% for the same purpose. The credits for the second project can be added on-top, but, as you can't really compare the different projects' crediting with one another for the same amounts of work, you should be able to get more than those 55k to 60k credits MilkyWay gives me.
____________
Greetings, Jens

zooxit
Send message
Joined: 4 Jul 21
Posts: 23
Credit: 4,994,153,142
RAC: 24,640,293
Level
Arg
Scientific publications
wat
Message 59263 - Posted: 17 Sep 2022 | 7:19:28 UTC

Thanks to everyone for help and ideas.

So, the problem was in first place one faulty stick of RAM, and the other problem was as mentioned by others - disk space and page file. Bought new RAM, exchanged for 500GB disk and increased swap file to 50GB and greater.
Much less errors.

I also realized that this tasks are very CPU intensive:
- Ryzen 7 3700X is not capable to 'feed' two GTX 1070 - all cores of CPU at maximum and the graphic cards are used about 20%
- Ryzen 9 5900X uses 60-80% of resources to 'feed' one RTX 3080 (GPU is used near 100%)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,493,857,483
RAC: 71,175,505
Level
Trp
Scientific publications
wat
Message 59264 - Posted: 17 Sep 2022 | 16:09:50 UTC - in response to Message 59263.

You should switch back to Linux. The Windows app seems really slow in comparison.

I’m running 2x RTX 3060, running 3x tasks on each. For a total of 6 tasks for the system. Individual tasks times are about 13hrs, about 4hrs:20min per task effective. (That’s for full length runs, excluding ones that end early).

This system is on an EPYC 7443P 24core processor. The same Zen3 architecture as your 5900X, but twice the cores. Running 6x tasks uses ~80-90% of the CPU, and ~56GB of system memory.

You should be able to run 2x tasks on your 3080 for good completion times under Linux.
____________

zooxit
Send message
Joined: 4 Jul 21
Posts: 23
Credit: 4,994,153,142
RAC: 24,640,293
Level
Arg
Scientific publications
wat
Message 59295 - Posted: 22 Sep 2022 | 7:39:39 UTC

Thanks for hints. Will try linux again.

Post to thread

Message boards : Number crunching : WUs on linux get mostly errors

//