Author |
Message |
jkdmaSend message
Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level
Scientific publications
|
My son suspends the projects when he uses Clip Studio because when they're running (GPUGrid, Rossetta@Home, etc.) they tend to bog down the performance.
If there is a GPUGrid unit being processed when he suspends everything, whenever the projects are resumed, the GPUGrid unit will go back to 2% regardless of however far along it was before suspending and stay there indefinitely. It will never get above 2% and I need to abort it in order to get a new unit.
Any ideas? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,689,180,588 RAC: 13,303,833 Level
Scientific publications
|
Don't run Windows. Sorry . . . . I have no solution. On Linux, tasks can be stopped and restarted at will and they resume from their last checkpoint.
The percentage complete will reset to 2% upon restart but after a couple of minutes jumps forward to the completion percentage it was at before being suspended or stopped.
And will complete to finish and report. |
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,586,400,388 RAC: 4,234,677 Level
Scientific publications
|
I took a quick look and I would say your running out of memory which is forcing it to use swap space and your running out of that too.
16GB of physical memory is really tight for the GPUgrid Python app. It will reserve that much when a task starts. It actually uses about 10GB of that.
You can possibly get it work if you increase your maximum swap space to 1x or 1.5x of your main memory but you really can't run much else.
If you suspend and restart GPUgrid it should replay from the last checkpoint so it will stay at 2% for quite awhile doing that. If it's not then it's stalled.
You can look at the stderr file in the slot the task is running in. Click on the Properties button for the task look for the directory number for the slots.
Then go to the BOINC home folder under the Program Data folder or wherever you put it. Go to slots and the slot #. Open the stderr file and scroll down toward the bottom and look at the messages.
If it still says Created Learner than it should be OK. If there are errors then it's jammed up and won't get any further.
I would suggest not running anything that uses a lot of your system memory or GPU at the same time. Switching back and forth causes problems as well.
Unfortunately the GPUgrid Python project is somewhat complicated and is tough to run on your average home PC. I even gave up running this on my older HP workstations because they don't have enough memory.
I have been somewhat successful getting it to work on more powerful HP Windows servers but it's still problematic and has about a 40% error rate.
Let us know if you need more help to try figuring this out.
|
|
|
jkdmaSend message
Joined: 21 Mar 20 Posts: 6 Credit: 53,007,324 RAC: 0 Level
Scientific publications
|
Thanks. |
|
|
FloSend message
Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 806,457 Level
Scientific publications
|
Came here, because I have the same problem as the op.
At first, it seemed like a glitch where it catches itself and jumps to the
"real percentage" - recent wus have unfortunately just crashed back to the two percent when I pause the task before going to sleep. Its frustrating to loose 10 hours of GPU crunch time on a rtx 3080 Ti. Hope this gets fixed, running entirely on linux unfortunately is not an option for my gaming setup. |
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,586,400,388 RAC: 4,234,677 Level
Scientific publications
|
The 5 tasks you ran have completed successfully so it appears things are working OK. It's best not to suspend tasks but as long as they are restarting from the last checkpoint it should be fine.
You can tell that a task was suspended and restarted when you see the Detected memory leaks! message. As long as it gets back to the Define train loop state and Finished! it's working.
It does take awhile to replay from the last checkpoint so it will stay at 2% until it gets back to where it left off. You are not really loosing compute time but you are loosing setup or wait time if you want to look at it that way. |
|
|
|
When I get a GPUGRID Python task, it usually fails within a few minutes. Occasionally, it will run for several hours and eventually fail at about 2% complete. ACMED tasks usually complete without problems on the rare occasion they are available.
I am using a PC with 16GB of memory so it may be that the python tasks are just running out of memory. Rosetta python tasks usually complete eventually.
I may just need to abandon BOINC in favor of Folding-at-Home. This system's tasks "just work".
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,689,180,588 RAC: 13,303,833 Level
Scientific publications
|
The problem lies with Windows and not the tasks. Look in the Number Crunching or News forums for the discussion to get around Windows peculiarities with how it handles memory reservation.
A host with only 16GB of memory CAN be successful in crunching Python on GPU tasks IF you give the task enough virtual memory in the size of the pagefile.
There are Windows hosts with only 8GB running the tasks successfully.
Linux users do not have any issue as Linux does a better job with virtual memory reservation.
The Windows pagefile needs to be somewhere around 35-50GB and set as a custom size and NOT system or recommended size. |
|
|
FloSend message
Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 806,457 Level
Scientific publications
|
Currently, I have a python task which is really slow to get progress, like 2 percent in 8 hours. If I suspend the task and restart it, it starts from zero, I can see that because the percentage directly resets to a very low value like 0,013% - so it seems to discard the previous progress -_- |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,689,180,588 RAC: 13,303,833 Level
Scientific publications
|
You really shouldn't do that. The very beginning of the task is doing nothing but unpacking the task and getting the setup files ready for computation.
It hasn't done any computation at this stage.
You keep interrupting the task setup by restarting it in the early setup stages every time. Don't do that.
Don't worry about the progress meter. |
|
|
FloSend message
Joined: 28 Apr 19 Posts: 3 Credit: 170,932,850 RAC: 806,457 Level
Scientific publications
|
8 hours after the task had started shouldn't be the beginning phase anymore imho.
I have the machine in my living room and can't run it 24/7 due to heat and noise. I have discarded that weird slow job that made mo progress or noticeable load and got one that works normally. |
|
|
|
What I've seen seems to mean that tasks resumed from checkpoints will, at first, show 2% progress. However, when it reaches its next checkpoint, it will resume showing the actual progress, including whatever was recovered from the first checkpoint.
One peculiarity I just saw: A task was showing 100% progress but still running. I spent a few minutes trying to decide if this was a hung task. Then all the files in the slot directory disappeared, and the task started sending a file back to the server. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,689,180,588 RAC: 13,303,833 Level
Scientific publications
|
You did the correct thing and exercise patience. The task is still in the finishing up stages after the task completion hits 100%.
If you would pull up a System Monitor at the time it hits 100% and look at the 32 spawned Python processes, you would see them one by one remove themselves from running until they all have finished and ended.
The spawned processes are reporting back to the output file and then the output files are uploaded and the files are removed from the slot.
If you prematurely end the task at the 100% by cancelling the task or interrupting it you would be throwing the computation away for no reason. |
|
|
|
try to install msys2 and run tail -F C:/programdata/BOINC/slots/*/wrapper_run.out
and tail -F E:/programdata/BOINC/slots/*/stderr.txt from C:\tools\msys64\usr\bin\
|
|
|
rtXSend message
Joined: 2 Apr 09 Posts: 10 Credit: 80,975,982 RAC: 6,908 Level
Scientific publications
|
I aborted after a couple of days stuck at 2% with a projected completion 46 days away and deadline of 5 days away. I aborted the downloaded replacement, too, when that showed 46 days to finish with another 5 day deadline. I've also said 'no more tasks' for now, and have material from other projects.
Much as I like GPUGrid and much as I like Linux, I have to operate within the Windows environment, and I want my spare processing capability to be used productively and without too much intervention on my part. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,689,180,588 RAC: 13,303,833 Level
Scientific publications
|
You should have kept processing and just ignored the projected completion times.
BOINC does not know how to handle these dual gpu-cpu Python tasks and makes completely unbelievable and inaccurate estimated completion dates and times. |
|
|
|
I started having this issue over a month ago, and finally worked it out today. I wonder if there's been some different larger units come through using more memory, because I hadn't noticed this issue for years prior.
In the stderr file, I noticed there were memory errors, and then what seemed to be aborting to my understanding. Not sure why then at that point it didn't just abort the run, I suppose it may have been hoping that memory issues would get sorted. The errors in my stderr also included mention of memory 'leak', and I'm yet to double-check this, but....
Selecting all other projects running to not seek any more work, and letting them run dry. Removing projects I'm not active in but still in BOINC. This freed up a lot of memory. Even though I didn't see the memory get to full, it possibly did at some point when it would then get 'stuck' and abandon, but not abort the run.
Now overnight it's got to 47% and going strong. I think running Milkyway or Einstein were also using a lot of memory and some GPU. The three together, also running rosetta and sidock, did not play nicely with my GT1030 and 16 GB RAM. On this PC where I want to run GPUGRID I will slowly try runnign another project simultaneously to test it out, just in case GPUGRID runs out of work so another project can kick in. |
|
|
|
tasks in years prior used the acemd3 application, which used less than 1GB VRAM in most cases.
the newer Python application needs about at least 3GB from what I've seen. your 2GB GT 1030 doesnt have enough memory for these tasks.
ignore the message about memory leaks in the stderr.txt. it's always there from the windows application and doesnt indicate anything wrong.
____________
|
|
|