Advanced search

Message boards : Number crunching : some hosts won't get tasks

Author Message
Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57146 - Posted: 5 Jul 2021 | 14:44:55 UTC

this one is a real head-scratcher for me. ever since the new app was released, two of my hosts have not been able to receive tasks. they don't give any error, or other obvious sign that anyting is wrong, they just always get the "no tasks available" response.

now I know that task availability is slim right now, but all hosts are requesting work on the same interval (every 100s) and 3 out of the 5 hosts are getting work fairly regularly. after running for several days, I would expect all hosts to at least get one task. it seems odd that 2 hosts will never get any tasks, they can't be THAT unlucky.

These hosts have no problem getting some tasks occasionally:
[7] RTX 2080 Ti / EPYC 7402P / Ubuntu 20.04
[1] RTX 3080 Ti / R9-5950X / Ubuntu 20.04
[1] GTX 1660 Super / [2] EPYC 7642 / Ubuntu 20.04


These hosts have not received tasks since the new app was released:
[8] RTX 2070 / EPYC 7402P / Ubuntu 20.04
[7] RTX 2080 / EPYC 7502 / Ubuntu 20.04

All hosts have the same "venue/location" in preferences, same OS/software package, compatible drivers, the proper boost packages installed, and I've reset the project on all hosts. I can't see an obvious reason why the two haven't received any work.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57148 - Posted: 5 Jul 2021 | 15:27:57 UTC - in response to Message 57146.

We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57149 - Posted: 5 Jul 2021 | 15:34:19 UTC - in response to Message 57148.

We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens.


that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). In that case, a schedule request would fail, but occasionally get through. I've worked around this problem for a long time and nothing has changed in that regard. these systems are spread across 3 physical locations and one of the systems (the 8x 2070) is actually the only host at it's IP, it's not competing with any other system.

so that's not the issue here. I have no problem making schedule requests, and it's always asking for work, but these two for some reason always get the response that no tasks are available. it seems unlikely that they would be THAT unlucky to never get a resend when 3 other systems are occasionally picking them up

____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57153 - Posted: 5 Jul 2021 | 17:28:56 UTC - in response to Message 57149.

that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address).

I am well familiar with the temporary block. There are two problems present, the second problem is longer-term.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57154 - Posted: 5 Jul 2021 | 17:56:50 UTC - in response to Message 57153.

that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address).

I am well familiar with the temporary block. There are two problems present, the second problem is longer-term.


can you link to some additional information about this second case? I've never seen that discussed here. only the one I mentioned.

but again, the server is responding, so it's not actually being blocked from communication, the server just always responds that there are no tasks, even when there probably is at some times.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57156 - Posted: 5 Jul 2021 | 19:27:53 UTC - in response to Message 57154.

It has been over a year ago since I last saw it mentioned. I searched own posts, but unfortunately the search function does not work correctly.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,265,339,066
RAC: 16,049,998
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57157 - Posted: 5 Jul 2021 | 21:05:50 UTC

I found this interesting Message #54344 from Retvari Zoltan
Also this other Message #51060 from kksplace

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57158 - Posted: 5 Jul 2021 | 21:20:03 UTC - in response to Message 57157.
Last modified: 5 Jul 2021 | 21:22:01 UTC

Thanks for digging, but those are both describing different situations. In the case from Zoltan, the user was getting a message that tasks won’t finish in time. I am not getting any such message. Only that tasks are not available.

And I’m not having any issue, or getting messages for, low disk space preventing work.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,894,103,302
RAC: 7,266,669
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57159 - Posted: 6 Jul 2021 | 0:31:39 UTC

I'm assuming GPUGrid is your only gpu project?

Any past projects that were gpu on those hosts?

You might still have an REC debt to those other projects.

Have you tried a work_fetch_debug or a rr_simulation_debug?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57160 - Posted: 6 Jul 2021 | 0:54:25 UTC - in response to Message 57159.
Last modified: 6 Jul 2021 | 0:54:42 UTC

GPUGRID is the only non-zero resource GPU project.

i never have to deal with REC, I do not run multiple projects at the same time. only one as prime (GPUGRID) and another as backup (Einstein), so it's always prioritizing GPUGRID.

it's asking for work (1sec, n devices), just never gets any. project always says no tasks available. if it was some REC thing I would get a different response, either something stating that, or just not even asking for work (0sec, 0 devices).
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57175 - Posted: 8 Jul 2021 | 3:57:58 UTC - in response to Message 57160.

this "feels" like a similar issue being described here. maybe not exactly the same, but something similar at least.

https://einsteinathome.org/content/invalid-global-preferences-problem#comment-169966

but as best I can tell, all of my global_prefs files are the same, but they come from WCG anyway. not sure what that has to do with GPUGRID. and there is no <venue> line item on any of them, including the hosts which get work fine. they are pretty much identical between all hosts.

but it seems the symptoms from something like this are similar. the project just telling you that no tasks are available when they really are.

Richard, do you remember anything about this?

I've updated preferences (just re-instating existing settings) at both WCG and GPUGRID. and also tried removing GPUGRID totally from one host and adding it back. so far nothing changed.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57176 - Posted: 8 Jul 2021 | 7:40:17 UTC - in response to Message 57175.

Richard, do you remember anything about this?

Yes, I remember it well. Message 150509 was one of my better bits of bug-hunting.

But I also draw your attention to Message 150489:

All my machines have global_prefs_override.xml files, so are functioning normally in spite of the oddities.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57179 - Posted: 8 Jul 2021 | 11:41:46 UTC - in response to Message 57176.

I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,894,103,302
RAC: 7,266,669
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57180 - Posted: 8 Jul 2021 | 13:36:56 UTC - in response to Message 57179.

I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue?

An override file always takes precedence over any project preference file.

It is completely local to the host.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57181 - Posted: 8 Jul 2021 | 13:48:01 UTC - in response to Message 57179.

The significance lies in the fact that Einstein has re-written large parts of their server code in Drupal, In some respects, their re-write didn't exactly correspond with the original Berkeley PHP (or whatever it was) version of the code.

In particular, the Drupal code barfed when reading the Berkeley version of the global_prefs.xml file, when venues were in use. The Berkeley file was mal-formed: the BOINC client was relaxed enough to read it, but the Drupal server was stricter and threw an error: that's when scheduler requests failed and no work was sent.

But 'no work' was a specifically Drupal (Einstein project) problem. Work could be fetched from other projects as normal. Einstein solved the problem by modifying their Drupal code so the the missing tag became a non-fatal error - it just wrote a warning into the server log instead. And they fixed the Berkeley code so that projects which updated their server code didn't trigger the bug any more.

So, I don't think the Einsten discussion will be a pointer to the cause of your problem here - even though this project still uses server version 613 (dating to around 2012-2013, before the Einstein fixes of 2016). My machines continue to request GPU work from all three of Einstein, GPUGrid and WCG. Only Einstein has 100% work availability - 'no work available' is common at the other two.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57182 - Posted: 8 Jul 2021 | 14:18:46 UTC - in response to Message 57181.
Last modified: 8 Jul 2021 | 14:27:23 UTC

I figured the exact Einstein issue was not causing any issue at GPUGRID, just that some of the aspects of that situation feel similar to whats happening now.

I know this happened at seti once before. and I think it was somehow related to how many "days" of work were set in the compute preferences. IIRC if it was set too high, it would request work, but the project always responded that no tasks were available, even when there were. reducing the work request "days" allowed work to finally get sent.

That's what I think is happening now, though not necessarily related to days requested, just some situation LIKE that. nothing changed on my systems, just one day 2 hosts stopped ever getting work, even when it was available (always got no tasks available response). of course it's harder to troubleshoot now that there is much less work available and only the occasional resend.

I've now disabled work fetch for 2 of the "good" systems, and removed the script constantly looking for a top-up on the 3rd. the 3080ti host will now only check for work on BOINC's logic. "while" the two bad hosts are trying for work every 100 seconds. so the two bad hosts should have a MUCH greater probability to grab work than the 3080ti host, yet the 3080ti host still manages to catch some here and there, and the other two hosts have not received anything since July 1st. always getting the "no tasks available" message. that's what makes me think there's something deeper going on, the behavior is outside of statistical norms.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,265,339,066
RAC: 16,049,998
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57183 - Posted: 8 Jul 2021 | 15:02:52 UTC - in response to Message 57182.
Last modified: 8 Jul 2021 | 15:03:22 UTC

...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message...

July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched.
And both your problematic hosts haven't received any task of this new version.
Simply coincidence?
I think not.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57184 - Posted: 8 Jul 2021 | 15:06:55 UTC - in response to Message 57183.

...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message...

July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched.
And both your problematic hosts haven't received any task of this new version.
Simply coincidence?
I think not.


I agree with this. but so far can find no difference between the setup of the two bad hosts which would prevent it getting work. it's the same as hosts that are getting work.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,265,339,066
RAC: 16,049,998
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57185 - Posted: 8 Jul 2021 | 15:50:34 UTC - in response to Message 57184.

I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57186 - Posted: 8 Jul 2021 | 17:32:21 UTC - in response to Message 57185.

I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side...


yeah but if the hosts look the same from the outside, they should meet the same requirements. I think it's something not so obvious where the server isnt telling me what the problem is.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,376,298,734
RAC: 3,488,278
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57195 - Posted: 11 Jul 2021 | 12:57:41 UTC

I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,894,103,302
RAC: 7,266,669
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57196 - Posted: 11 Jul 2021 | 14:56:30 UTC

Might be a solution. Easy enough to do and you can always merge the old hostID back into the new ID.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57197 - Posted: 11 Jul 2021 | 18:57:45 UTC - in response to Message 57195.

I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts.


Could be a solution. But right now since not much work is available anyway. I will wait until work is plentiful again and reassess. If I’m still not getting work when there are thousands of tasks ready to send, then I’ll do it. Really prefer not to though.
____________

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,546,620,012
RAC: 2,485,789
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 57198 - Posted: 12 Jul 2021 | 6:48:01 UTC
Last modified: 12 Jul 2021 | 6:49:18 UTC

Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it:


<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0)
acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory
10:36:11 (18462): acemd3 exited; CPU time 0.000578
10:36:11 (18462): app exit status: 0x7f
10:36:11 (18462): called boinc_finish(195)

</stderr_txt>
]]>


Perhaps some bugs waiting to be solved?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57199 - Posted: 12 Jul 2021 | 14:09:22 UTC - in response to Message 57198.

Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it:


<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0)
acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory
10:36:11 (18462): acemd3 exited; CPU time 0.000578
10:36:11 (18462): app exit status: 0x7f
10:36:11 (18462): called boinc_finish(195)

</stderr_txt>
]]>


Perhaps some bugs waiting to be solved?


you need to install the boost 1.74 package from your distribution or from a PPA. no idea what system you have since your computers are hidden, the install process will vary from distribution to distribution. On Ubuntu there is a PPA for it.

that will fix your error.
____________

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,546,620,012
RAC: 2,485,789
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 57200 - Posted: 12 Jul 2021 | 18:17:09 UTC - in response to Message 57199.

Ok, thanks for the info. My computers run mostly CentOS 6/7, but there is one Linux Mint and one Win10 also.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57267 - Posted: 3 Sep 2021 | 15:17:40 UTC

I think it's resolved now.

Background,
When the cuda1121 app was released on July 1st, 2 of my hosts stopped receiving any tasks/applications. the cuda100 linux app was pulled and replaced with cuda1121. all systems had compatible drivers and displayed compatible driver versions, however only some systems continued to receive this new app. all others constantly got a "no tasks available" message. I had no problems getting the cuda100 task before

I run my coproc_info files on all hosts in a locked down state, so it always shows the same obfuscated driver version and doesn't change when I change drivers. this can be beneficial for testing sometimes (for example, it's the only way i can get Einstein to send me the new 1.28 beta app for AMD because BOINC detects OpenCL 1.2 even with the compatible drivers, and Einstein will not send the app unless you display OpenCL 2.0+), also gives me the ability to control what is actually shown. Usually I do not update the coproc file with the latest info. if i wanted to change something I just unlocked it, changed what needed to be changed, and locked it back down.

Recently,
They pushed updates for the cuda1121 app, but also brought back a new cuda101 app. It was this app that I received. but I did not receive cuda1121.

So I had the thought, maybe they are actually checking the CUDA version from BOINC, and not just checking for compatible driver version. so I checked the CUDA version listed in the "good" hosts' coproc file and they all reported greater than 11.2. and the bad hosts were outdated and still reporting cuda 11.1 from older driver installs, though the driver version itself was listed as being high enough for cuda 11.2 based on the nvidia thresholds. So this made sense as to why one of my hosts actually picked up tasks for the cuda 101 app, as previously the cuda100 was taken away, and it didn't report high enough cuda version to get the 11.2 app. but now that 101 was brought back, I now qualified for that again.

So I've now recycled the coproc file on the two bad hosts to report CUDA 11.4 so i expect I'll get the new app now. it might be useful in the future to test 101 vs 1121 by manipulating the coproc file to control what gets sent, I assume GPUGRID will send the highest version you qualify for.

so the combination of an outdated coproc file (that was locked from updates), and the removal of the old cuda100 app is what caused my previous issues getting work on a few hosts. if they would have kept the old cuda100 app in play, I would have still received that.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57279 - Posted: 7 Sep 2021 | 8:24:56 UTC - in response to Message 57267.

Ian,

Did you make any other changes - to coproc_info or elsewhere?

After rebooting a Linux Mint machine, I get

Tue 07 Sep 2021 09:03:21 BST | | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91, CUDA version 11.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak)
Tue 07 Sep 2021 09:03:21 BST | | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91.03, device version OpenCL 1.2 CUDA, 5942MB, 3974MB available, 5153 GFLOPS peak)

- all of which seems to match your settings, but I've still never been sent a task beyond version 212. Any ideas?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57282 - Posted: 7 Sep 2021 | 22:44:28 UTC - in response to Message 57279.
Last modified: 7 Sep 2021 | 22:48:11 UTC

i have to assume that their CUDA version is "11.2.1" the .1 denoting the Update 1 version. based on the fact that their app plan class is cuda1121.

does BOINC reflect CUDA version 11.2.1 or greater in the coproc file? your driver version is sufficient, but it's possible that BOINC isn't capturing these minor versions and the project only knows what you have based on what BOINC tells it.

try upgrading the drivers to 465+ to get into the CUDA 11.3+ to ensure that your version is greater than required.

also keep in mind the low task availability. seems like new work hasnt been available for a few days. maybe they pulled back on sending work after I reported the issues with the new app.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57283 - Posted: 8 Sep 2021 | 11:10:25 UTC - in response to Message 57282.

OK, I'll see your 465 and raise you 470 (-:

Wed 08 Sep 2021 12:04:41 BST | | CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57, CUDA version 11.4, compute capability 7.5, 4096MB, 3972MB available, 5530 GFLOPS peak)
Wed 08 Sep 2021 12:04:41 BST | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57.02, device version OpenCL 3.0 CUDA, 5942MB, 3972MB available, 5530 GFLOPS peak)

It sounds plausible, coproc_info had <cudaVersion>11020</cudaVersion>: it now has <cudaVersion>11040</cudaVersion>. No tasks on the first request, but as you say, they're as rare as hen's teeth. I'll leave it trying and see what happens.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57285 - Posted: 8 Sep 2021 | 21:18:07 UTC

OK, so I've got a Crypic_Scout task running with v217 and cuda 1121.

But it's on the machine where I didn't update the video driver. Go figure.

It's - according to BOINC Manager - on device 1, and two Einstein tasks are running on device 0. As usual.

I've had a long day in the hills (last day of summer weather), so I'll leave it for tonight. But at least I'll have some entrails to pick over in the morning.

Thought - I might exclude the project from devices other than 0, until we get to the bottom of this.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57286 - Posted: 8 Sep 2021 | 21:38:48 UTC

Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,265,339,066
RAC: 16,049,998
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57287 - Posted: 8 Sep 2021 | 21:48:21 UTC - in response to Message 57286.
Last modified: 8 Sep 2021 | 21:49:00 UTC

Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported.

nvidia-smi command will quickly confirm this

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57288 - Posted: 8 Sep 2021 | 22:18:27 UTC - in response to Message 57287.

Yup, so it has.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:01:00.0 On | N/A |
| 55% 87C P2 126W / 125W | 1531MiB / 5941MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 166... Off | 00000000:05:00.0 Off | N/A |
| 31% 37C P8 11W / 125W | 8MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 89MiB |
| 0 N/A N/A 49977 C bin/acemd3 302MiB |
| 0 N/A N/A 50085 C ...nux-gnu__GW-opencl-nvidia 1135MiB |
| 1 N/A N/A 1133 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+

acemd3 running on GPU 0 is conclusive. And so to bed.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,713,228,362
RAC: 3,819,843
Level
Phe
Scientific publications
wat
Message 57583 - Posted: 11 Oct 2021 | 19:07:43 UTC
Last modified: 11 Oct 2021 | 19:13:46 UTC

After not crunching for several months I started back again about a month ago. It took some time due to limited work units, but I received some GPUGrid WUs starting the first week of October, but now haven't received any since October 6th. I have tried snagging one when some are showing as available and only receive a message "No tasks are available for New version of ACEMD" on BOINC Manager Event log. Any ideas what I may have changed/not set correctly? (I am receiving and crunching Einstein and Milkway WUs. GPUGrid resource share is set 15 times higher than Einstein and 50 times higher than Milkyway.)

Nvidia 1080
Driver 470.63.01 Cuda Version: 11.4
Linux Mint OS

Edit: I have also tried a project reset, which did not help.

Computer is not hidden. Thank you for taking a look.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 27
Level
Trp
Scientific publications
wat
Message 57584 - Posted: 11 Oct 2021 | 19:16:58 UTC - in response to Message 57583.

No tasks available. Your system looks fine to me.

If you want to snag some tasks as they become available or any resends as they become available, you’ll have to setup some kind of script or looping command to constantly check GPUGRID for more work. BOINC’s default work fetch behavior will fall into kind of hidden back off and will stop checking if it has several instances of no work received. A script to manually check periodically is the only sure way to defeat this. Just make sure it’s checking on some interval longer than the default server cooldown (I think it’s 30 seconds). Checking every 5 mins will give you a good chance to catch some resends or new work.
____________

Post to thread

Message boards : Number crunching : some hosts won't get tasks

//