Author |
Message |
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
There seems to be a bug in these tasks.
I'm seeing a 100% failure on my system and the wingmen behind me.
Windows 10 or 11 does not make a difference.
A linux user also has this.
One of my tasks had 4-5 failures behind me.
Another task my first wingman failed but he runs a 780 and that does not have something in its firmware/software that will allow it to run these tasks.
I have a 1080 and it failed. The last person had a 1050 and it ran ok.
I don't get what is going on and why this was not picked up in testing.
I find this to be a common error message in the stderr file: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
That is a problem with Windows and memory reservation allocation when loading all the Python dll's.
Linux does not have the issue.
See this message of mine. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908
The solution is to increase the size of your paging file. |
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,714,899,367 RAC: 2,134,283 Level
Scientific publications
|
I had to go back 6 tasks to find the one that failed with the paging file error. More recent tasks are having a different problem running out of memory somewhere.
You system looks like it has 48GB of physical memory so that should be sufficient to run the GPUgrid Python tasks unless there is another conflict with something else.
I have a Server running Win Server 2012 with the same amount of physical memory. The swap file is still set at "Automatically manage paging file size for all drives"
I left this one that way since is was working OK. With one GPUgrid Python task running it shows Currently allocated at 12800 MB which is typical.
Check the free space available on your swap drive and make sure it has a minimum of 16GB available. If you have plenty of space there then I would suggest you set the swap space separately.
I have found that sometimes it seems the Automatic isn't fast enough so try setting it to System managed size first. If that doesn't help then set it to Custom size.
You might need to play with the sizing a bit but you can try try Initial size 16384 and Maximum size 24576 or more.
The last 5 tasks are failing with various not enough memory errors but the first traceback is something I have been seeing with a lot of the tasks failing.
Just make sure you are not running anything that is tying up too much memory and not leaving enough available for GPUgrid.
Other than that these could be an internal error in the GPUgrid Python tasks causing it. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
I have a whole HDD set aside for BOINC with 303GB of space left.
All the data files are there.
I run FAH plus all the projects you see in my profile here.
I am just around 73% memory usage.
Disk setting is leave 20GB free
Memory setting is computer in use 90%
Not in use 98%
Leave non GPU in memory (yes)
Page/Swap use at most 90%
You would think with these settings it has more than enough space to do what it needs to do.
According to BOINC tasks the current task uses 1932 physical and 3632 virtual.
BOINC says virtual size is 3.55 and working set is 1.89
Checked again after maxing everything out and this error keeps repeating:
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\data\slots\1\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\data\slots\1\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
Paging size, this seems to be an error in the code, I've opened up BOINC to the max. I think this was also a teething error in python CPU and RAH. But not paging size.
And after adjustments I get this: Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {13199} normal block at 0x000001B0A0972890, 8 bytes long.
Data: < > 00 00 94 A0 B0 01 00 00
..\lib\diagnostics_win.cpp(417) : {11918} normal block at 0x000001B0A0998B40, 1080 bytes long.
Data: <<j 4 > 3C 6A 00 00 CD CD CD CD 34 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {397} normal block at 0x000001B0A09708F0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{383} normal block at 0x000001B0A096AA80, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{378} normal block at 0x000001B0A096ABD0, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{373} normal block at 0x000001B0A096AD90, 44 bytes long.
Data: < > 01 00 00 00 00 00 CD CD B1 AD 96 A0 B0 01 00 00
{368} normal block at 0x000001B0A096AD20, 44 bytes long.
Data: < A > 01 00 00 00 00 00 CD CD 41 AD 96 A0 B0 01 00 00
Object dump complete.
09:46:01 (13124): wrapper (7.9.26016): starting
09:46:01 (13124): wrapper: running python.exe (run.py)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {13134} normal block at 0x0000023C80BA32A0, 8 bytes long.
Data: < R < > 00 00 52 82 3C 02 00 00
..\lib\diagnostics_win.cpp(417) : {11853} normal block at 0x0000023C80BCF400, 1080 bytes long.
Data: <$2 P > 24 32 00 00 CD CD CD CD 50 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {397} normal block at 0x0000023C80BA3C60, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{383} normal block at 0x0000023C80B9AA70, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{378} normal block at 0x0000023C80B9AC30, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{373} normal block at 0x0000023C80B9A840, 44 bytes long.
Data: < a < > 01 00 00 00 00 00 CD CD 61 A8 B9 80 3C 02 00 00
{368} normal block at 0x0000023C80B9A990, 44 bytes long.
Data: < < > 01 00 00 00 00 00 CD CD B1 A9 B9 80 3C 02 00 00
Object dump complete.
But then it goes on to start running. |
|
|
|
I posted some screenshots of paging file settings in message 58934. I'd had similar failures with only 8 GB system RAM installed: with 16 GB and those settings, the Python app ran, though it's not a very efficient use of that particular machine. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
I've searched windows and the net on how to do that and nothing matches those screen shots and nothing from the net matches my win 10 64bit software.
Can you tell me how to get to the tabs you did the screenshot of?
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Found this info in boinc_task_state.xml
<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>e00028a00502-ABOU_rnd_ppod_expand_demos6_again2-0-1-RND4470_2</result_name>
<checkpoint_cpu_time>31287.720000</checkpoint_cpu_time>
<checkpoint_elapsed_time>15281.828158</checkpoint_elapsed_time>
<fraction_done>0.059200</fraction_done>
<peak_working_set_size>2470195200</peak_working_set_size>
<peak_swap_size>6816833536</peak_swap_size>
<peak_disk_usage>17117387104</peak_disk_usage>
I am assuming these huge values are in bytes?
|
|
|
|
Can you tell me how to get to the tabs you did the screenshot of?
All these low-level Windows management tools have barely changed since Windows NT 4 days, but the roadmap for finding them changes every time. The ones I posted were from Windows 7, but here's the routing for Windows 11 - split the difference...
For the final one, unset the first and third ('Automatic' and 'System' management), and set 'Custom' to open up all the options. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything.
That's way undersized. It should be GB's . . . . not MB's
From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage>
That is 17GB's of disk usage.
I would set 17GB or 17000MB for initial size and double it for max size.
or
34GB or 34000MB |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
after a little trial and error I found a way to that location.
Set it to 144MB 3x physical to start and gave it 154MB max
See if this helps anything.
That's way undersized. It should be GB's . . . . not MB's
From your task data . . . <peak_disk_usage>17117387104</peak_disk_usage>
That is 17GB's of disk usage.
I would set 17GB or 17000MB for initial size and double it for max size.
or
34GB or 34000MB
oh! thanks...will make the change
170000 and 340000 |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Well that seems to have solved the problem on my Win10 machine.
2 tasks run and completed ok.
Thanks Keith!
Curious though why if it has to much space it errors out, but only here, not in other projects? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
Go back and read this post of mine.
https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908
Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for.
Don't think there are any other BOINC projects that use pytorch.
So not affected. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Go back and read this post of mine.
https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908
Only affects projects that use pytorch in Windows that have large DLL's that Windows MUST reserve a lot of memory for.
Don't think there are any other BOINC projects that use pytorch.
So not affected.
I have never heard of that. I wondered what that was.
So after reading that, it explains why Python GPU or anything in GPU is used at my oldest project RAH. They have Python CPU to run, generated by an external client, but that's about it for us BOINC users. They keep all the really interesting stuff inhouse for the AI system. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
Once again, GPUGrid is on the cutting edge of gpu science for BOINC projects with its machine learning and AI development. They were the first BOINC project to use gpus. I like they are still pushing the envelope.
The only other machine learning BOINC project I know about is MLC@home and they only use cpus now. Had a gpu app a few years ago but I don't think they are producing any tasks for gpus currently. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
I like projects that push the boundaries. Look for stuff that has not been done before either in code or in ideas of what to send out for crunching. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space.
https://quchempedia.univ-angers.fr/athome/about.php
QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space.
https://quchempedia.univ-angers.fr/athome/about.php
QuChemPedIA is an AI project, though CPU only. And it works best with Linux. You can use Windows with VirtualBox, but there are a lot of stuck work units you have to deal with.
I know it and due to that exact reason and other technical errors, I gave up.
I can't get it to run stable on my windows system, so forget it.
GPU's get enough action with this project and primegrid and FAH as well as Eisenstein.
I think I am attached to enough to projects to keep this system busy all the time it runs (16 hours a day) |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
so...a new wrinkle.
I have two tasks running at the same time and RAH is complaining about disk space with the CPU Python.
I've maxed out the upper value.
rosetta python projects needs 3624.20MB more disk space. You currently have 15449.28 MB available and it needs 19073.49 MB.
So what do I have to do? I suppose I will have to restrict this project to 1 GPU in order to solve this disk space problem? |
|
|
|
Disk space limits can be solved by tweaking BOINC's limits.
They're quite separate and distinct from the memory (RAM) problems you were having here earlier. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Disk space limits can be solved by tweaking BOINC's limits.
They're quite separate and distinct from the memory (RAM) problems you were having here earlier.
Ok thanks...fixed |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
New problem...I stopped last night with 98% done and about a hour and half to go on the end of the task. I do all the normal shut down procedures, suspend all computing, shut down client, exit program. When I restart this morning the task has gone to hell. Time to finish 159 days and 2% done and time remaining counts UP and not down.
CPU time
6d 11:39:36
CPU time since checkpoint
00:14:10
Elapsed time
3d 06:14:26
Estimated time remaining
159d 17:47:48
Fraction done
2.000%
Now after several restarts the time remaining goes down, but still 159 days.
I had another task that was also close to done, but the server considered it timed out. I guess I missed the deadline.
I'll let this task run for a bit longer, but to me it looks all messed up.
I don't see anything wrong in stderr or boinc_task_state |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
New problem...I stopped last night with 98% done and about a hour and half to go on the end of the task. I do all the normal shut down procedures, suspend all computing, shut down client, exit program. When I restart this morning the task has gone to hell. Time to finish 159 days and 2% done and time remaining counts UP and not down.
CPU time
6d 11:39:36
CPU time since checkpoint
00:14:10
Elapsed time
3d 06:14:26
Estimated time remaining
159d 17:47:48
Fraction done
2.000%
Now after several restarts the time remaining goes down, but still 159 days.
It settled down now. 47 minutes left.
I had another task that was also close to done, but the server considered it timed out. I guess I missed the deadline.
I'll let this task run for a bit longer, but to me it looks all messed up.
I don't see anything wrong in stderr or boinc_task_state
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Different question: When looking at Boinc Tasks program and looking at the CPU%, why do I see 197% and 131% CPU usage? Is that just how these tasks work?
I thought CPU was for control and guidance only? This almost looks like it is processing as well. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
It is normal for tasks to temporarily revert to 2% completion upon restart.
But they quickly jump back to their original completion done percentage at the point they were stopped in just a few minutes.
And then continue on till finish.
At least that is what they always do on all my Linux hosts.
But I have seen similar comments from others running Windows. Probably best not to chance stopping them on Windows.
The application does in fact use the cpu. Quite a bit in fact. The task will jump back and forth from running on the cpu to a quick spurt on the gpu and then back to the cpu.
The tasks spawn 32 individual python processes on the cpu so you are really using more than 100% of a single cpu core. That is what BoincTasks is detecting and showing.
From The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently. Message 59980 |
|
|
|
The failure rate on the GPU tasks has reached the point where I feel it is a waste to even try to explain the processes of the failures: 97 out of 101 tasks have failed on either a GTX 1060 or an RTX 3080 and I aborted the RTX task after it wasted 5 days+ of running time, exceeded the return time limit, and still had double-digit days remaining. The three tasks that succeeded used only about 1800 to 3500 seconds of run time.
My patience has expired and I am terminating tasking on Grid for a couple of weeks or so and perhaps the problem can be solved using internal GPUs.
Added Comment: Just for the hell of it: I downloaded a new task just now on the GTX 1060 machine and the initial time to compute was shown as 30 DAYS; OH SURE!!!This does not constitute a sound confidence builder.
Billy Ewell 1931 (Yes, my year of birth)
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
Sorry to hear you go.
The estimated time to complete values can be completely ignored at GPUGrid.
BOINC does not have the mechanism to compute the time remaining values of the dual cpu-gpu nature of these tasks and cannot estimate the time to complete correctly.
On modern gpus of at least Pascal generation, the tasks complete well within the standard 5 day deadlines. Typical compute times of around 20 minutes to 12 hours.
Windows needs to be set up correctly however to run these tasks properly.
The Windows pagefile size needs to be increased to around 35-50GB for the tasks to run and finish properly. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Billy, scroll down this thread a bit.
There is a post where Keith gives some upper and lower limits to the page file size. This cleared things up for me really fast.
I run a 1080 and a 1050 and once I did the page file setting I have never had an error on either card. Run time is about 3 days on these cards, but I am sharing them with Folding At Home, so that might slow things down a bit. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
Thanks for the confirmation Greg that the Python tasks CAN in fact be properly run to completion well within their deadlines AS LONG as Windows is configured correctly.
Glad to hear you are successfully processing this new work and contributing to cutting edge science. |
|
|
|
Try to install boinc on rocky linux 8 in vmware workstation player . It is free for home use. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,122,976 RAC: 16,522 Level
Scientific publications
|
Thanks for the confirmation Greg that the Python tasks CAN in fact be properly run to completion well within their deadlines AS LONG as Windows is configured correctly.
Glad to hear you are successfully processing this new work and contributing to cutting edge science.
Just chugging along now. Once that swap space issue was taken care of, no problems. This is a Win10 machine with AMD Ryzen. |
|
|
|
Adria, please fix the bug in your WUs.
error code 195 |
|
|
|
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:23:38 (19824): wrapper (7.9.26016): starting
23:23:38 (19824): wrapper: running bin/acemd3.exe (--boinc --device 0)
ACEMD failed:
Error invoking kernel: CUDA_ERROR_LAUNCH_FAILED (719)
23:50:50 (19824): bin/acemd3.exe exited; CPU time 1611.078125
A GeForce GTX 1660 Ti should be OK: check your drivers. |
|
|
|
Keith: Thanks for the input but I am personally cautious in changing items for fear I will screw up what I cannot fix.
Here are the current page filing settings on automatic and I have changed nothing so far. This is as currently specified:
Minimum allowed----16 MB
Recommended--------4957 MB
Currently----------45056 MB
As I understand the suggestion is I unclick the automatic setting option and set the Minimum as 35 and the others as ?????.
Await your reply: Bill |
|
|
|
Try to set it 51200 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
Keith: Thanks for the input but I am personally cautious in changing items for fear I will screw up what I cannot fix.
Here are the current page filing settings on automatic and I have changed nothing so far. This is as currently specified:
Minimum allowed----16 MB
Recommended--------4957 MB
Currently----------45056 MB
As I understand the suggestion is I unclick the automatic setting option and set the Minimum as 35 and the others as ?????.
Await your reply: Bill
Those setting pages are enumerated in MB's, not GB's, which it needs to be for Python tasks.
So you need to add X1000 to your 35 IOW 35000 MB's |
|
|
|
Keith Myers and Kotenok2000:
Once I reset the pagefiles to the recommended values I have processed bunches of tasks without a skip. Thanks for the great advice. BET
The bottom number is 35000MB and the top is 51200MB.
It would seem practical to me for the admins/techs to incorporate the pagefiles criteria in such a way that all contributors will find it easy to find the instructions and likewise easy to modify their machines. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1365 Credit: 7,928,584,073 RAC: 3,996,856 Level
Scientific publications
|
I just made a post about the pagefile mod needed for Python task in the FAQ section.
Just need a admin to make it sticky. |
|
|
|
If the GPUGrid project is willing to ask for and accept the in-kind donations of people's GPU time, then GPUGrid has an obligation to do what they can to resolve problematic tasks and code
If WUs require mods to the defaults in config files, etc., people should NOT have to hunt around in forum posts to glean a solution.
BOINC manager does have a Notices tab, and it is negligent of GPUGrid not to post needed instructions there, or at least a direct link to the specific forum post, for resolution
...in particular when the problem is not an isolated issue to just a few PCs
Other projects DO extend the coutesy to communicate via the Notices tab.
LLP, PhD, Prof. Engr. |
|
|