Advanced search

Message boards : Graphics cards (GPUs) : Unit gets stuck at 2% when resuming after suspending

Author Message
jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 52,079,824
RAC: 28,390
Level
Thr
Scientific publications
wat
Message 58989 - Posted: 9 Jul 2022 | 5:18:55 UTC

My son suspends the projects when he uses Clip Studio because when they're running (GPUGrid, Rossetta@Home, etc.) they tend to bog down the performance.

If there is a GPUGrid unit being processed when he suspends everything, whenever the projects are resumed, the GPUGrid unit will go back to 2% regardless of however far along it was before suspending and stay there indefinitely. It will never get above 2% and I need to abort it in order to get a new unit.

Any ideas?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58990 - Posted: 9 Jul 2022 | 5:23:14 UTC - in response to Message 58989.

Don't run Windows. Sorry . . . . I have no solution. On Linux, tasks can be stopped and restarted at will and they resume from their last checkpoint.

The percentage complete will reset to 2% upon restart but after a couple of minutes jumps forward to the completion percentage it was at before being suspended or stopped.

And will complete to finish and report.

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,042,450,871
RAC: 984,907
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58992 - Posted: 9 Jul 2022 | 18:48:41 UTC

I took a quick look and I would say your running out of memory which is forcing it to use swap space and your running out of that too.

16GB of physical memory is really tight for the GPUgrid Python app. It will reserve that much when a task starts. It actually uses about 10GB of that.

You can possibly get it work if you increase your maximum swap space to 1x or 1.5x of your main memory but you really can't run much else.

If you suspend and restart GPUgrid it should replay from the last checkpoint so it will stay at 2% for quite awhile doing that. If it's not then it's stalled.

You can look at the stderr file in the slot the task is running in. Click on the Properties button for the task look for the directory number for the slots.

Then go to the BOINC home folder under the Program Data folder or wherever you put it. Go to slots and the slot #. Open the stderr file and scroll down toward the bottom and look at the messages.

If it still says Created Learner than it should be OK. If there are errors then it's jammed up and won't get any further.

I would suggest not running anything that uses a lot of your system memory or GPU at the same time. Switching back and forth causes problems as well.

Unfortunately the GPUgrid Python project is somewhat complicated and is tough to run on your average home PC. I even gave up running this on my older HP workstations because they don't have enough memory.

I have been somewhat successful getting it to work on more powerful HP Windows servers but it's still problematic and has about a 40% error rate.

Let us know if you need more help to try figuring this out.

jkdma
Send message
Joined: 21 Mar 20
Posts: 6
Credit: 52,079,824
RAC: 28,390
Level
Thr
Scientific publications
wat
Message 58997 - Posted: 10 Jul 2022 | 16:36:55 UTC - in response to Message 58992.

Thanks.

Flo
Send message
Joined: 28 Apr 19
Posts: 3
Credit: 24,310,683
RAC: 27,486
Level
Pro
Scientific publications
wat
Message 59013 - Posted: 23 Jul 2022 | 10:04:39 UTC

Came here, because I have the same problem as the op.
At first, it seemed like a glitch where it catches itself and jumps to the
"real percentage" - recent wus have unfortunately just crashed back to the two percent when I pause the task before going to sleep. Its frustrating to loose 10 hours of GPU crunch time on a rtx 3080 Ti. Hope this gets fixed, running entirely on linux unfortunately is not an option for my gaming setup.

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,042,450,871
RAC: 984,907
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59014 - Posted: 23 Jul 2022 | 19:22:10 UTC - in response to Message 59013.
Last modified: 23 Jul 2022 | 19:27:11 UTC

The 5 tasks you ran have completed successfully so it appears things are working OK. It's best not to suspend tasks but as long as they are restarting from the last checkpoint it should be fine.

You can tell that a task was suspended and restarted when you see the Detected memory leaks! message. As long as it gets back to the Define train loop state and Finished! it's working.

It does take awhile to replay from the last checkpoint so it will stay at 2% until it gets back to where it left off. You are not really loosing compute time but you are loosing setup or wait time if you want to look at it that way.

clcarter1999
Send message
Joined: 9 Apr 20
Posts: 3
Credit: 125,011,493
RAC: 0
Level
Cys
Scientific publications
wat
Message 59055 - Posted: 4 Aug 2022 | 15:23:59 UTC

When I get a GPUGRID Python task, it usually fails within a few minutes. Occasionally, it will run for several hours and eventually fail at about 2% complete. ACMED tasks usually complete without problems on the rare occasion they are available.

I am using a PC with 16GB of memory so it may be that the python tasks are just running out of memory. Rosetta python tasks usually complete eventually.

I may just need to abandon BOINC in favor of Folding-at-Home. This system's tasks "just work".
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 59056 - Posted: 4 Aug 2022 | 17:24:13 UTC - in response to Message 59055.
Last modified: 4 Aug 2022 | 17:25:47 UTC

The problem lies with Windows and not the tasks. Look in the Number Crunching or News forums for the discussion to get around Windows peculiarities with how it handles memory reservation.

A host with only 16GB of memory CAN be successful in crunching Python on GPU tasks IF you give the task enough virtual memory in the size of the pagefile.

There are Windows hosts with only 8GB running the tasks successfully.

Linux users do not have any issue as Linux does a better job with virtual memory reservation.

The Windows pagefile needs to be somewhere around 35-50GB and set as a custom size and NOT system or recommended size.

Flo
Send message
Joined: 28 Apr 19
Posts: 3
Credit: 24,310,683
RAC: 27,486
Level
Pro
Scientific publications
wat
Message 59079 - Posted: 8 Aug 2022 | 6:08:34 UTC - in response to Message 59056.

Currently, I have a python task which is really slow to get progress, like 2 percent in 8 hours. If I suspend the task and restart it, it starts from zero, I can see that because the percentage directly resets to a very low value like 0,013% - so it seems to discard the previous progress -_-

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 59082 - Posted: 8 Aug 2022 | 17:14:56 UTC - in response to Message 59079.

You really shouldn't do that. The very beginning of the task is doing nothing but unpacking the task and getting the setup files ready for computation.

It hasn't done any computation at this stage.

You keep interrupting the task setup by restarting it in the early setup stages every time. Don't do that.

Don't worry about the progress meter.

Flo
Send message
Joined: 28 Apr 19
Posts: 3
Credit: 24,310,683
RAC: 27,486
Level
Pro
Scientific publications
wat
Message 59084 - Posted: 8 Aug 2022 | 19:36:08 UTC - in response to Message 59082.

8 hours after the task had started shouldn't be the beginning phase anymore imho.
I have the machine in my living room and can't run it 24/7 due to heat and noise. I have discarded that weird slow job that made mo progress or noticeable load and got one that works normally.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 502
Credit: 590,520,933
RAC: 32,241
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59085 - Posted: 9 Aug 2022 | 0:17:27 UTC
Last modified: 9 Aug 2022 | 0:17:42 UTC

What I've seen seems to mean that tasks resumed from checkpoints will, at first, show 2% progress. However, when it reaches its next checkpoint, it will resume showing the actual progress, including whatever was recovered from the first checkpoint.

One peculiarity I just saw: A task was showing 100% progress but still running. I spent a few minutes trying to decide if this was a hung task. Then all the files in the slot directory disappeared, and the task started sending a file back to the server.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 59086 - Posted: 9 Aug 2022 | 7:09:20 UTC - in response to Message 59085.

You did the correct thing and exercise patience. The task is still in the finishing up stages after the task completion hits 100%.

If you would pull up a System Monitor at the time it hits 100% and look at the 32 spawned Python processes, you would see them one by one remove themselves from running until they all have finished and ended.

The spawned processes are reporting back to the output file and then the output files are uploaded and the files are removed from the slot.

If you prematurely end the task at the 100% by cancelling the task or interrupting it you would be throwing the computation away for no reason.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 48
Credit: 11,353,293
RAC: 5,690
Level
Pro
Scientific publications
wat
Message 59087 - Posted: 9 Aug 2022 | 10:18:10 UTC - in response to Message 59086.

try to install msys2 and run tail -F C:/programdata/BOINC/slots/*/wrapper_run.out
and tail -F E:/programdata/BOINC/slots/*/stderr.txt from C:\tools\msys64\usr\bin\

rtX
Send message
Joined: 2 Apr 09
Posts: 10
Credit: 63,791,416
RAC: 5,089
Level
Thr
Scientific publications
wat
Message 59166 - Posted: 28 Aug 2022 | 16:21:40 UTC

I aborted after a couple of days stuck at 2% with a projected completion 46 days away and deadline of 5 days away. I aborted the downloaded replacement, too, when that showed 46 days to finish with another 5 day deadline. I've also said 'no more tasks' for now, and have material from other projects.

Much as I like GPUGrid and much as I like Linux, I have to operate within the Windows environment, and I want my spare processing capability to be used productively and without too much intervention on my part.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 59167 - Posted: 28 Aug 2022 | 18:55:54 UTC - in response to Message 59166.

You should have kept processing and just ignored the projected completion times.
BOINC does not know how to handle these dual gpu-cpu Python tasks and makes completely unbelievable and inaccurate estimated completion dates and times.

ptgst3Kfg7MDZf8ShnWCBMMnh...
Send message
Joined: 7 May 20
Posts: 2
Credit: 17,554,704
RAC: 56,398
Level
Pro
Scientific publications
wat
Message 59183 - Posted: 1 Sep 2022 | 23:47:18 UTC - in response to Message 58989.

I started having this issue over a month ago, and finally worked it out today. I wonder if there's been some different larger units come through using more memory, because I hadn't noticed this issue for years prior.

In the stderr file, I noticed there were memory errors, and then what seemed to be aborting to my understanding. Not sure why then at that point it didn't just abort the run, I suppose it may have been hoping that memory issues would get sorted. The errors in my stderr also included mention of memory 'leak', and I'm yet to double-check this, but....

Selecting all other projects running to not seek any more work, and letting them run dry. Removing projects I'm not active in but still in BOINC. This freed up a lot of memory. Even though I didn't see the memory get to full, it possibly did at some point when it would then get 'stuck' and abandon, but not abort the run.

Now overnight it's got to 47% and going strong. I think running Milkyway or Einstein were also using a lot of memory and some GPU. The three together, also running rosetta and sidock, did not play nicely with my GT1030 and 16 GB RAM. On this PC where I want to run GPUGRID I will slowly try runnign another project simultaneously to test it out, just in case GPUGRID runs out of work so another project can kick in.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 59184 - Posted: 2 Sep 2022 | 0:15:17 UTC - in response to Message 59183.

tasks in years prior used the acemd3 application, which used less than 1GB VRAM in most cases.

the newer Python application needs about at least 3GB from what I've seen. your 2GB GT 1030 doesnt have enough memory for these tasks.

ignore the message about memory leaks in the stderr.txt. it's always there from the windows application and doesnt indicate anything wrong.
____________

Post to thread

Message boards : Graphics cards (GPUs) : Unit gets stuck at 2% when resuming after suspending

//