Unit gets stuck at 2% when resuming after suspending

Message boards : Graphics cards (GPUs) : Unit gets stuck at 2% when resuming after suspending

Author	Message
jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58989 - Posted: 9 Jul 2022 \| 5:18:55 UTC
	My son suspends the projects when he uses Clip Studio because when they're running (GPUGrid, Rossetta@Home, etc.) they tend to bog down the performance. If there is a GPUGrid unit being processed when he suspends everything, whenever the projects are resumed, the GPUGrid unit will go back to 2% regardless of however far along it was before suspending and stay there indefinitely. It will never get above 2% and I need to abort it in order to get a new unit. Any ideas?
	ID: 58989 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,966,070 RAC: 13,485,092 Level Scientific publications	Message 58990 - Posted: 9 Jul 2022 \| 5:23:14 UTC - in response to Message 58989.
	Don't run Windows. Sorry . . . . I have no solution. On Linux, tasks can be stopped and restarted at will and they resume from their last checkpoint. The percentage complete will reset to 2% upon restart but after a couple of minutes jumps forward to the completion percentage it was at before being suspended or stopped. And will complete to finish and report.
	ID: 58990 \| Rating: 0 \| rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level Scientific publications	Message 58992 - Posted: 9 Jul 2022 \| 18:48:41 UTC
	I took a quick look and I would say your running out of memory which is forcing it to use swap space and your running out of that too. 16GB of physical memory is really tight for the GPUgrid Python app. It will reserve that much when a task starts. It actually uses about 10GB of that. You can possibly get it work if you increase your maximum swap space to 1x or 1.5x of your main memory but you really can't run much else. If you suspend and restart GPUgrid it should replay from the last checkpoint so it will stay at 2% for quite awhile doing that. If it's not then it's stalled. You can look at the stderr file in the slot the task is running in. Click on the Properties button for the task look for the directory number for the slots. Then go to the BOINC home folder under the Program Data folder or wherever you put it. Go to slots and the slot #. Open the stderr file and scroll down toward the bottom and look at the messages. If it still says Created Learner than it should be OK. If there are errors then it's jammed up and won't get any further. I would suggest not running anything that uses a lot of your system memory or GPU at the same time. Switching back and forth causes problems as well. Unfortunately the GPUgrid Python project is somewhat complicated and is tough to run on your average home PC. I even gave up running this on my older HP workstations because they don't have enough memory. I have been somewhat successful getting it to work on more powerful HP Windows servers but it's still problematic and has about a 40% error rate. Let us know if you need more help to try figuring this out.
	ID: 58992 \| Rating: 0 \| rate: / Reply Quote

jkdma Send message Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level Scientific publications	Message 58997 - Posted: 10 Jul 2022 \| 16:36:55 UTC - in response to Message 58992.
	Thanks.
	ID: 58997 \| Rating: 0 \| rate: / Reply Quote

Flo Send message Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 954,849 Level Scientific publications	Message 59013 - Posted: 23 Jul 2022 \| 10:04:39 UTC
	Came here, because I have the same problem as the op. At first, it seemed like a glitch where it catches itself and jumps to the "real percentage" - recent wus have unfortunately just crashed back to the two percent when I pause the task before going to sleep. Its frustrating to loose 10 hours of GPU crunch time on a rtx 3080 Ti. Hope this gets fixed, running entirely on linux unfortunately is not an option for my gaming setup.
	ID: 59013 \| Rating: 0 \| rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level Scientific publications	Message 59014 - Posted: 23 Jul 2022 \| 19:22:10 UTC - in response to Message 59013. Last modified: 23 Jul 2022 \| 19:27:11 UTC
	The 5 tasks you ran have completed successfully so it appears things are working OK. It's best not to suspend tasks but as long as they are restarting from the last checkpoint it should be fine. You can tell that a task was suspended and restarted when you see the Detected memory leaks! message. As long as it gets back to the Define train loop state and Finished! it's working. It does take awhile to replay from the last checkpoint so it will stay at 2% until it gets back to where it left off. You are not really loosing compute time but you are loosing setup or wait time if you want to look at it that way.
	ID: 59014 \| Rating: 0 \| rate: / Reply Quote

clcarter1999 Send message Joined: 9 Apr 20 Posts: 3 Credit: 288,412,243 RAC: 3,428,535 Level Scientific publications	Message 59055 - Posted: 4 Aug 2022 \| 15:23:59 UTC
	When I get a GPUGRID Python task, it usually fails within a few minutes. Occasionally, it will run for several hours and eventually fail at about 2% complete. ACMED tasks usually complete without problems on the rare occasion they are available. I am using a PC with 16GB of memory so it may be that the python tasks are just running out of memory. Rosetta python tasks usually complete eventually. I may just need to abandon BOINC in favor of Folding-at-Home. This system's tasks "just work". ____________
	ID: 59055 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,966,070 RAC: 13,485,092 Level Scientific publications	Message 59056 - Posted: 4 Aug 2022 \| 17:24:13 UTC - in response to Message 59055. Last modified: 4 Aug 2022 \| 17:25:47 UTC
	The problem lies with Windows and not the tasks. Look in the Number Crunching or News forums for the discussion to get around Windows peculiarities with how it handles memory reservation. A host with only 16GB of memory CAN be successful in crunching Python on GPU tasks IF you give the task enough virtual memory in the size of the pagefile. There are Windows hosts with only 8GB running the tasks successfully. Linux users do not have any issue as Linux does a better job with virtual memory reservation. The Windows pagefile needs to be somewhere around 35-50GB and set as a custom size and NOT system or recommended size.
	ID: 59056 \| Rating: 0 \| rate: / Reply Quote

Flo Send message Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 954,849 Level Scientific publications	Message 59079 - Posted: 8 Aug 2022 \| 6:08:34 UTC - in response to Message 59056.
	Currently, I have a python task which is really slow to get progress, like 2 percent in 8 hours. If I suspend the task and restart it, it starts from zero, I can see that because the percentage directly resets to a very low value like 0,013% - so it seems to discard the previous progress -_-
	ID: 59079 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,966,070 RAC: 13,485,092 Level Scientific publications	Message 59082 - Posted: 8 Aug 2022 \| 17:14:56 UTC - in response to Message 59079.
	You really shouldn't do that. The very beginning of the task is doing nothing but unpacking the task and getting the setup files ready for computation. It hasn't done any computation at this stage. You keep interrupting the task setup by restarting it in the early setup stages every time. Don't do that. Don't worry about the progress meter.
	ID: 59082 \| Rating: 0 \| rate: / Reply Quote

Flo Send message Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 954,849 Level Scientific publications	Message 59084 - Posted: 8 Aug 2022 \| 19:36:08 UTC - in response to Message 59082.
	8 hours after the task had started shouldn't be the beginning phase anymore imho. I have the machine in my living room and can't run it 24/7 due to heat and noise. I have discarded that weird slow job that made mo progress or noticeable load and got one that works normally.
	ID: 59084 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,070,933 RAC: 197,388 Level Scientific publications	Message 59085 - Posted: 9 Aug 2022 \| 0:17:27 UTC Last modified: 9 Aug 2022 \| 0:17:42 UTC
	What I've seen seems to mean that tasks resumed from checkpoints will, at first, show 2% progress. However, when it reaches its next checkpoint, it will resume showing the actual progress, including whatever was recovered from the first checkpoint. One peculiarity I just saw: A task was showing 100% progress but still running. I spent a few minutes trying to decide if this was a hung task. Then all the files in the slot directory disappeared, and the task started sending a file back to the server.
	ID: 59085 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,966,070 RAC: 13,485,092 Level Scientific publications	Message 59086 - Posted: 9 Aug 2022 \| 7:09:20 UTC - in response to Message 59085.
	You did the correct thing and exercise patience. The task is still in the finishing up stages after the task completion hits 100%. If you would pull up a System Monitor at the time it hits 100% and look at the 32 spawned Python processes, you would see them one by one remove themselves from running until they all have finished and ended. The spawned processes are reporting back to the output file and then the output files are uploaded and the files are removed from the slot. If you prematurely end the task at the 100% by cancelling the task or interrupting it you would be throwing the computation away for no reason.
	ID: 59086 \| Rating: 0 \| rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 78 Credit: 125,938,293 RAC: 1,493,363 Level Scientific publications	Message 59087 - Posted: 9 Aug 2022 \| 10:18:10 UTC - in response to Message 59086.
	try to install msys2 and run tail -F C:/programdata/BOINC/slots//wrapper_run.out and tail -F E:/programdata/BOINC/slots//stderr.txt from C:\tools\msys64\usr\bin\
	ID: 59087 \| Rating: 0 \| rate: / Reply Quote

rtX Send message Joined: 2 Apr 09 Posts: 10 Credit: 80,975,982 RAC: 8,420 Level Scientific publications	Message 59166 - Posted: 28 Aug 2022 \| 16:21:40 UTC
	I aborted after a couple of days stuck at 2% with a projected completion 46 days away and deadline of 5 days away. I aborted the downloaded replacement, too, when that showed 46 days to finish with another 5 day deadline. I've also said 'no more tasks' for now, and have material from other projects. Much as I like GPUGrid and much as I like Linux, I have to operate within the Windows environment, and I want my spare processing capability to be used productively and without too much intervention on my part.
	ID: 59166 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,966,070 RAC: 13,485,092 Level Scientific publications	Message 59167 - Posted: 28 Aug 2022 \| 18:55:54 UTC - in response to Message 59166.
	You should have kept processing and just ignored the projected completion times. BOINC does not know how to handle these dual gpu-cpu Python tasks and makes completely unbelievable and inaccurate estimated completion dates and times.
	ID: 59167 \| Rating: 0 \| rate: / Reply Quote

ptgst3Kfg7MDZf8ShnWCBMMnh... Send message Joined: 7 May 20 Posts: 2 Credit: 351,659,704 RAC: 1,599,852 Level Scientific publications	Message 59183 - Posted: 1 Sep 2022 \| 23:47:18 UTC - in response to Message 58989.
	I started having this issue over a month ago, and finally worked it out today. I wonder if there's been some different larger units come through using more memory, because I hadn't noticed this issue for years prior. In the stderr file, I noticed there were memory errors, and then what seemed to be aborting to my understanding. Not sure why then at that point it didn't just abort the run, I suppose it may have been hoping that memory issues would get sorted. The errors in my stderr also included mention of memory 'leak', and I'm yet to double-check this, but.... Selecting all other projects running to not seek any more work, and letting them run dry. Removing projects I'm not active in but still in BOINC. This freed up a lot of memory. Even though I didn't see the memory get to full, it possibly did at some point when it would then get 'stuck' and abandon, but not abort the run. Now overnight it's got to 47% and going strong. I think running Milkyway or Einstein were also using a lot of memory and some GPU. The three together, also running rosetta and sidock, did not play nicely with my GT1030 and 16 GB RAM. On this PC where I want to run GPUGRID I will slowly try runnign another project simultaneously to test it out, just in case GPUGRID runs out of work so another project can kick in.
	ID: 59183 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 59184 - Posted: 2 Sep 2022 \| 0:15:17 UTC - in response to Message 59183.
	tasks in years prior used the acemd3 application, which used less than 1GB VRAM in most cases. the newer Python application needs about at least 3GB from what I've seen. your 2GB GT 1030 doesnt have enough memory for these tasks. ignore the message about memory leaks in the stderr.txt. it's always there from the windows application and doesnt indicate anything wrong. ____________
	ID: 59184 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Unit gets stuck at 2% when resuming after suspending

	About	Science	Volunteers	Performance	Forum	Join us	Donate