Python Runtime (GPU, beta)

Message boards : News : Python Runtime (GPU, beta)

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 57655 - Posted: 26 Oct 2021 \| 10:57:36 UTC
	If anybody wants to help debug a new application, please enable the above mentioned app.
	ID: 57655 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57656 - Posted: 26 Oct 2021 \| 12:09:24 UTC - in response to Message 57655.
	I don't see anything new on https://www.gpugrid.net/apps.php yet?
	ID: 57656 \| Rating: 0 \| rate: / Reply Quote

Azmodes Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level Scientific publications	Message 57657 - Posted: 26 Oct 2021 \| 12:30:43 UTC
	GPUGRID 10/26/2021 2:01:26 PM No tasks are available for Python apps for GPU hosts
	ID: 57657 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57659 - Posted: 26 Oct 2021 \| 15:46:17 UTC
	One system queued up and waiting. ____________
	ID: 57659 \| Rating: 0 \| rate: / Reply Quote

Azmodes Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level Scientific publications	Message 57660 - Posted: 26 Oct 2021 \| 17:31:20 UTC
	Got a 2080 Ti and two 2070 Supers ready to roll.
	ID: 57660 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57663 - Posted: 26 Oct 2021 \| 20:25:32 UTC - in response to Message 57656.
	I don't see anything new on https://www.gpugrid.net/apps.php yet? OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist. ____________
	ID: 57663 \| Rating: 0 \| rate: / Reply Quote

mg13 [HWU] Send message Joined: 18 Nov 09 Posts: 7 Credit: 107,006 RAC: 0 Level Scientific publications	Message 57665 - Posted: 26 Oct 2021 \| 23:12:12 UTC - in response to Message 57663.
	I don't see anything new on https://www.gpugrid.net/apps.php yet? OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist. Yes, there is the link, just go to the home page and click on "Join us" and on the page that opens in the "Configuring your participation" section in point 2 click on "apps" and you will find it.
	ID: 57665 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57666 - Posted: 27 Oct 2021 \| 0:57:38 UTC - in response to Message 57665.
	Thanks. You’re right it’s there. But I’ll follow up that it’s a very odd place for it. Nearly all other BOINC project puts a link near/with credit statistics, directly on the main page, or as a link on the bottom of every page. ____________
	ID: 57666 \| Rating: 0 \| rate: / Reply Quote

Bill F Send message Joined: 21 Nov 16 Posts: 32 Credit: 140,098,150 RAC: 388,093 Level Scientific publications	Message 57668 - Posted: 27 Oct 2021 \| 1:41:30 UTC
	Well I am checked and enabled including "run Test Apps" we will see of I get a task assigned. Thanks Bill F
	ID: 57668 \| Rating: 0 \| rate: / Reply Quote

dthonon Send message Joined: 26 Aug 21 Posts: 1 Credit: 454,270,067 RAC: 1,841,360 Level Scientific publications	Message 57678 - Posted: 27 Oct 2021 \| 13:58:31 UTC - in response to Message 57668.
	This application is enabled in my preferences, and I accept test applications, but I am not getting any python task : mer. 27 oct. 2021 15:51:04 \| GPUGRID \| Scheduler request completed: got 0 new tasks Server status shows 10 tasks waiting to be sent.
	ID: 57678 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57680 - Posted: 27 Oct 2021 \| 16:44:26 UTC
	Why are you not running more test tasks for the new app? Almost all of the tasks ended on one of Ian’s hosts … or is that enough feedback for now? Anyway, credit calculation looks almost random to me. At least for these tasks. Any chance you will fix that before this gets into production? (Comparison: 700sec runtime awarded ~100k vs. admittedly lower end card 110k sec runtime getting 565k credit. Seems out of scope.
	ID: 57680 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57682 - Posted: 28 Oct 2021 \| 4:00:11 UTC - in response to Message 57680.
	wow, I didnt even notice, I was out all day. I just set the system (7x GPU) to check for work every 100s or so and only checking for beta GPU work, so it doesn't surprise me that it got so many. it would ask for 7 at once, and I guess it got lucky that it asked for some work before anyone else. beta tasks have always paid a lot of credit here for some reason. but as with previous beta tasks, I see no indication that these tasks actually did anything on the GPU. my guess is that they ran some stuff on the CPU then finished. I've asked before what their intentions are with these tasks, and it's clear they are doing some type or machine learning kind of thing, but they dont appear to be even using the GPU at all, which is very strange when they are labelled as a cuda app. ____________
	ID: 57682 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57683 - Posted: 28 Oct 2021 \| 8:28:16 UTC
	The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.
	ID: 57683 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57686 - Posted: 28 Oct 2021 \| 15:39:24 UTC - in response to Message 57683.
	The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs. cuda100* but yeah, looks to be the same app as listed in the Anaconda Python 3 category, same versioning. ____________
	ID: 57686 \| Rating: 0 \| rate: / Reply Quote

Azmodes Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level Scientific publications	Message 57706 - Posted: 1 Nov 2021 \| 10:25:58 UTC
	So, uh, that was it?
	ID: 57706 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57711 - Posted: 2 Nov 2021 \| 14:13:27 UTC - in response to Message 57706.
	Not quite, but... Got a new Python task. It failed: 14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py) Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? Not yet, but I'll install it before the replacement task I got on report has a chance to start. We shouldn't need to do that. (and now my second Linux machine has got one too)
	ID: 57711 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57712 - Posted: 2 Nov 2021 \| 14:38:54 UTC
	That looks better - I'd say the GPU is running: But what's [ObstacleTower (as boinc)]? It's appeared on my task bar, and opens to a tiny, all black, window?
	ID: 57712 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57713 - Posted: 2 Nov 2021 \| 14:53:23 UTC
	Second machine has acquired an ObstacleTower, too. Interesting snip from stderr in running (repeated many times): [2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals' [2m[33m(raylet)[0m /var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command. [2m[33m(raylet)[0m warnings.warn( [2m[33m(raylet)[0m Traceback (most recent call last): [2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module> [2m[33m(raylet)[0m import ray.new_dashboard.utils as dashboard_utils [2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module> [2m[33m(raylet)[0m import aiohttp.signals [2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals' WARNING:gym_unity:New seed 57 will apply on next reset. WARNING:gym_unity:New starting floor 0 will apply on next reset.
	ID: 57713 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 57714 - Posted: 2 Nov 2021 \| 16:02:12 UTC - in response to Message 57711.
	This is being solved server-side, no need to install software of course. Not quite, but... Got a new Python task. It failed: 14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py) Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? Not yet, but I'll install it before the replacement task I got on report has a chance to start. We shouldn't need to do that. (and now my second Linux machine has got one too)
	ID: 57714 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57715 - Posted: 2 Nov 2021 \| 16:22:51 UTC - in response to Message 57711.
	The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too. Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment. ____________
	ID: 57715 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57716 - Posted: 2 Nov 2021 \| 16:45:03 UTC - in response to Message 57715.
	The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too. Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment. do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks. ____________
	ID: 57716 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57718 - Posted: 2 Nov 2021 \| 17:50:17 UTC
	Thanks for the feedback - on that basis, I'll keep pushing them through. Had an odd finish: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073' [2m[33m(raylet)[0m /var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command. [2m[33m(raylet)[0m warnings.warn( [2m[33m(raylet)[0m Traceback (most recent call last): [2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module> [2m[33m(raylet)[0m import ray.new_dashboard.utils as dashboard_utils [2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module> [2m[33m(raylet)[0m import aiohttp.signals [2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals' INFO:mlagents_envs.environment:Environment shut down with return code 0. 15:21:11 (827067): ./gpugridpy/bin/python exited; CPU time 1598.264794 15:21:11 (827067): app exit status: 0x1 15:21:11 (827067): called boinc_finish(195) "Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'.
	ID: 57718 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57719 - Posted: 3 Nov 2021 \| 6:23:50 UTC
	Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run. Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out.
	ID: 57719 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57721 - Posted: 3 Nov 2021 \| 9:30:00 UTC
	The next round of testing has started. e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by upload failure: <file_xfer_error> <file_name>e1a10-ABOU_PPOObstacle6-0-1-RND2533_0_0</file_name> <error_code>-131 (file size too big)</error_code>
	ID: 57721 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57722 - Posted: 3 Nov 2021 \| 9:36:16 UTC Last modified: 3 Nov 2021 \| 10:01:49 UTC
	Got another from what looks like the same batch. Limit is <max_nbytes>100000000.000000</max_nbytes> I'll catch the output and see how big it is. Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain.
	ID: 57722 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57723 - Posted: 3 Nov 2021 \| 10:25:40 UTC - in response to Message 57722. Last modified: 3 Nov 2021 \| 10:28:53 UTC
	Yes the file should be 170M approx.
	ID: 57723 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57724 - Posted: 3 Nov 2021 \| 10:26:09 UTC - in response to Message 57722.
	Yes the file should be 170M approx. ____________
	ID: 57724 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57725 - Posted: 3 Nov 2021 \| 10:33:04 UTC Last modified: 3 Nov 2021 \| 10:45:31 UTC
	Well, I got one for you to study: e1a8-ABOU_PPOObstacle7-0-1-RND2466_3 That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package. I've suspended work fetch for now - please let us know when the next iteration is ready to test. Edit - this it what the upload file contained: It seems a bit odd to return the ObstacleTower zip back to you unchanged?
	ID: 57725 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57726 - Posted: 3 Nov 2021 \| 10:55:09 UTC - in response to Message 57711. Last modified: 3 Nov 2021 \| 11:38:59 UTC
	The git-related errors should be solved now. ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback. ____________
	ID: 57726 \| Rating: 0 \| rate: / Reply Quote

Azmodes Send message Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level Scientific publications	Message 57727 - Posted: 3 Nov 2021 \| 12:19:15 UTC
	Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes: ==> WARNING: A newer version of conda exists. <== current version: 4.8.3 latest version: 4.10.3 Please update conda by running $ conda update -n base -c defaults conda 13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190 13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py) path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap'] git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git Traceback (most recent call last): File "run.py", line 340, in <module> main() File "run.py", line 53, in main print("GPU available: {}".format(torch.cuda.is_available())) NameError: name 'torch' is not defined 13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758 13:14:10 (11501): app exit status: 0x1 13:14:10 (11501): called boinc_finish(195) </stderr_txt> ]]>
	ID: 57727 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57728 - Posted: 3 Nov 2021 \| 14:24:08 UTC Last modified: 3 Nov 2021 \| 14:24:55 UTC
	Got five PythonGPU tasks to finish and report after about ten minutes that were valid.
	ID: 57728 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57732 - Posted: 3 Nov 2021 \| 17:18:12 UTC
	My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks?
	ID: 57732 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57733 - Posted: 3 Nov 2021 \| 19:50:45 UTC - in response to Message 57732.
	Haven't heard of any posts by admin types that Windows apps will be made. That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated.
	ID: 57733 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57735 - Posted: 3 Nov 2021 \| 19:55:47 UTC
	This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits. https://www.gpugrid.net/result.php?resultid=32660133
	ID: 57735 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 4,909,494,525 RAC: 31,617,022 Level Scientific publications	Message 57737 - Posted: 3 Nov 2021 \| 20:05:30 UTC - in response to Message 57735.
	Did you notice if it used the GPU and if it did what percentage ? I had one that ran for about 3 hours before failing, never saw the fans running during that time.
	ID: 57737 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57740 - Posted: 3 Nov 2021 \| 20:28:14 UTC Last modified: 3 Nov 2021 \| 20:41:52 UTC
	just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184 16:19:48 (1841951): wrapper (7.7.26016): starting 16:19:48 (1841951): wrapper (7.7.26016): starting 16:19:48 (1841951): wrapper: running /usr/bin/flock (/home/ian/BOINC/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /home/ian/BOINC/projects/www.gpugrid.net/miniconda && /home/ian/BOINC/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ") 0%\| \| 0/35 [00:00<?, ?it/s] Extracting : tk-8.6.8-hbc83047_0.conda: 0%\| \| 0/35 [00:00<?, ?it/s] Extracting : tk-8.6.8-hbc83047_0.conda: 3%\|2 \| 1/35 [00:00<00:11, 3.04it/s] Extracting : urllib3-1.25.8-py37_0.conda: 3%\|2 \| 1/35 [00:00<00:11, 3.04it/s] Extracting : libedit-3.1.20181209-hc058e9b_0.conda: 6%\|5 \| 2/35 [00:00<00:10, 3.04it/s] Extracting : libgcc-ng-9.1.0-hdf63c60_0.conda: 9%\|8 \| 3/35 [00:00<00:10, 3.04it/s] Extracting : ld_impl_linux-64-2.33.1-h53a641e_7.conda: 11%\|#1 \| 4/35 [00:00<00:10, 3.04it/s] Extracting : python-3.7.7-hcff3b4d_5.conda: 14%\|#4 \| 5/35 [00:00<00:09, 3.04it/s] Extracting : python-3.7.7-hcff3b4d_5.conda: 17%\|#7 \| 6/35 [00:00<00:06, 4.16it/s] Extracting : tqdm-4.46.0-py_0.conda: 17%\|#7 \| 6/35 [00:00<00:06, 4.16it/s] Extracting : ca-certificates-2020.1.1-0.conda: 20%\|## \| 7/35 [00:00<00:06, 4.16it/s] Extracting : wheel-0.34.2-py37_0.conda: 23%\|##2 \| 8/35 [00:00<00:06, 4.16it/s] Extracting : libstdcxx-ng-9.1.0-hdf63c60_0.conda: 26%\|##5 \| 9/35 [00:00<00:06, 4.16it/s] Extracting : certifi-2020.4.5.1-py37_0.conda: 29%\|##8 \| 10/35 [00:00<00:06, 4.16it/s] Extracting : readline-8.0-h7b6447c_0.conda: 31%\|###1 \| 11/35 [00:00<00:05, 4.16it/s] Extracting : ncurses-6.2-he6710b0_1.conda: 34%\|###4 \| 12/35 [00:00<00:05, 4.16it/s] Extracting : conda-package-handling-1.6.1-py37h7b6447c_0.conda: 37%\|###7 \| 13/35 [00:00<00:05, 4.16it/s] Extracting : chardet-3.0.4-py37_1003.conda: 40%\|#### \| 14/35 [00:00<00:05, 4.16it/s] Extracting : zlib-1.2.11-h7b6447c_3.conda: 43%\|####2 \| 15/35 [00:00<00:04, 4.16it/s] Extracting : six-1.14.0-py37_0.conda: 46%\|####5 \| 16/35 [00:00<00:04, 4.16it/s] Extracting : pycparser-2.20-py_0.conda: 49%\|####8 \| 17/35 [00:00<00:04, 4.16it/s] Extracting : libffi-3.3-he6710b0_1.conda: 51%\|#####1 \| 18/35 [00:00<00:04, 4.16it/s] Extracting : pycosat-0.6.3-py37h7b6447c_0.conda: 54%\|#####4 \| 19/35 [00:00<00:03, 4.16it/s] Extracting : cffi-1.14.0-py37he30daa8_1.conda: 57%\|#####7 \| 20/35 [00:00<00:03, 4.16it/s] Extracting : _libgcc_mutex-0.1-main.conda: 60%\|###### \| 21/35 [00:00<00:03, 4.16it/s] Extracting : pyopenssl-19.1.0-py37_0.conda: 63%\|######2 \| 22/35 [00:00<00:03, 4.16it/s] Extracting : idna-2.9-py_1.conda: 66%\|######5 \| 23/35 [00:00<00:02, 4.16it/s] Extracting : pysocks-1.7.1-py37_0.conda: 69%\|######8 \| 24/35 [00:00<00:02, 4.16it/s] Extracting : xz-5.2.5-h7b6447c_0.conda: 71%\|#######1 \| 25/35 [00:00<00:02, 4.16it/s] Extracting : setuptools-46.4.0-py37_0.conda: 74%\|#######4 \| 26/35 [00:00<00:02, 4.16it/s] Extracting : ruamel_yaml-0.15.87-py37h7b6447c_0.conda: 77%\|#######7 \| 27/35 [00:00<00:01, 4.16it/s] Extracting : cryptography-2.9.2-py37h1ba5d50_0.conda: 80%\|######## \| 28/35 [00:00<00:01, 4.16it/s] Extracting : openssl-1.1.1g-h7b6447c_0.conda: 83%\|########2 \| 29/35 [00:00<00:01, 4.16it/s] Extracting : sqlite-3.31.1-h62c20be_1.conda: 86%\|########5 \| 30/35 [00:00<00:01, 4.16it/s] Extracting : pip-20.0.2-py37_3.conda: 89%\|########8 \| 31/35 [00:00<00:00, 4.16it/s] Extracting : yaml-0.1.7-had09818_2.conda: 91%\|#########1\| 32/35 [00:00<00:00, 4.16it/s] Extracting : requests-2.23.0-py37_0.conda: 94%\|#########4\| 33/35 [00:00<00:00, 4.16it/s] Extracting : conda-4.8.3-py37_0.tar.bz2: 97%\|#########7\| 34/35 [00:00<00:00, 4.16it/s] ==> WARNING: A newer version of conda exists. <== current version: 4.8.3 latest version: 4.10.3 Please update conda by running $ conda update -n base -c defaults conda 16:21:21 (1841951): /usr/bin/flock exited; CPU time 61.036800 16:21:21 (1841951): wrapper: running ./gpugridpy/bin/python (run.py) Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-wwv7ghqo /home/ian/BOINC/slots/15/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command. warnings.warn( Downloading... From: https://storage.googleapis.com/obstacle-tower-build/v4.1/obstacletower_v4.1_linux.zip To: /home/ian/BOINC/slots/15/obstacletower_v4.1_linux.zip 0%\| \| 0.00/170M [00:00<?, ?B/s] 1%\| \| 2.10M/170M [00:00<00:08, 19.9MB/s] 6%\|â \| 10.5M/170M [00:00<00:02, 56.2MB/s] 11%\|ââ \| 19.4M/170M [00:00<00:02, 70.8MB/s] 16%\|ââ \| 27.8M/170M [00:00<00:02, 70.6MB/s] 22%\|âââ \| 37.7M/170M [00:00<00:01, 76.7MB/s] 28%\|âââ \| 47.7M/170M [00:00<00:01, 79.0MB/s] 34%\|ââââ \| 57.1M/170M [00:00<00:01, 82.8MB/s] 38%\|ââââ \| 65.5M/170M [00:00<00:01, 80.4MB/s] 43%\|âââââ \| 73.9M/170M [00:00<00:01, 81.2MB/s] 49%\|âââââ \| 82.8M/170M [00:01<00:01, 83.4MB/s] 54%\|ââââââ \| 91.2M/170M [00:01<00:00, 80.8MB/s] 59%\|ââââââ \| 101M/170M [00:01<00:00, 81.3MB/s] 65%\|âââââââ \| 110M/170M [00:01<00:00, 83.7MB/s] 70%\|âââââââ \| 119M/170M [00:01<00:00, 79.0MB/s] 75%\|ââââââââ \| 127M/170M [00:01<00:00, 80.2MB/s] 80%\|ââââââââ \| 137M/170M [00:01<00:00, 79.2MB/s] 85%\|âââââââââ \| 145M/170M [00:01<00:00, 80.1MB/s] 90%\|âââââââââ \| 154M/170M [00:01<00:00, 79.1MB/s] 96%\|ââââââââââ\| 163M/170M [00:02<00:00, 82.6MB/s] 100%\|ââââââââââ\| 170M/170M [00:02<00:00, 78.6MB/s] 16:21:54 (1841951): ./gpugridpy/bin/python exited; CPU time 22.798227 16:21:59 (1841951): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>e1a6-ABOU_PPOObstacle6-0-1-RND7771_2_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU? and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU. Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing. ____________
	ID: 57740 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57741 - Posted: 3 Nov 2021 \| 20:33:04 UTC
	The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi. The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately. Will have to wait for more to observe.
	ID: 57741 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57743 - Posted: 3 Nov 2021 \| 21:41:16 UTC Last modified: 3 Nov 2021 \| 21:44:30 UTC
	Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program. I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen.
	ID: 57743 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57748 - Posted: 4 Nov 2021 \| 9:27:01 UTC - in response to Message 57725.
	The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data. The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned. Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory. ____________
	ID: 57748 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57750 - Posted: 4 Nov 2021 \| 9:31:48 UTC - in response to Message 57748.
	The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned. That makes much more sense. Standing by for the next round of debugging... :-)
	ID: 57750 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57751 - Posted: 4 Nov 2021 \| 9:47:15 UTC Last modified: 4 Nov 2021 \| 9:47:32 UTC
	That's the next bad news for me as my GPU is maxed out at 6GB. Without upgrading my GPU and that's not likely gonna be soon, I suppose I have to give up on these types of tasks - at least for the time being. Thanks for the update though
	ID: 57751 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57752 - Posted: 4 Nov 2021 \| 12:31:04 UTC - in response to Message 57748.
	Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory. why such low GPU utilization? and 8000? or do you mean 800? 8GB? or 800MB? ____________
	ID: 57752 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57753 - Posted: 4 Nov 2021 \| 12:42:20 UTC Last modified: 4 Nov 2021 \| 12:44:21 UTC
	I can only speculate in regards to the former one. But your latter question likely resolves to 8,000 MiB (Mebibyte) which is just another convention to count bits – if he indeed meant to write 8,000. While k (kilo), M (Mega), G (Giga) and T (Tera) are the SI-prefix units and are computed as base 10 by 10^3, 10^6, 10^9 and 10^12 respectively, the binary prefix units of Ki (Kibi), Mi (Mebi), Gi (Gibi) and Ti (Tebi) are computed as base 2 by 2^10, 2^20, 2^30 and 2^40. As such M/Mi = (10^6/2^20) ~ 95.37% or a difference of ~4.63% between the SI and binary prefix units. 1 kB = 1000 B 1 KiB = 1024 B
	ID: 57753 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57754 - Posted: 4 Nov 2021 \| 12:52:03 UTC - in response to Message 57753.
	yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB ____________
	ID: 57754 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57755 - Posted: 4 Nov 2021 \| 12:58:04 UTC Last modified: 4 Nov 2021 \| 12:58:33 UTC
	ah, all right. didn't mean to offend you if that's what I did. still don't understand their beta testing procedure anyway. so far not many tasks have been run, only few of them successfully, but meanwhile nearly no information has been shared rendering the whole procedure rather intransparent and leaving others in the dark wondering about their piles of unsuccessful tasks. and the little information that is indeed shared seems to conflict a lot with the user experience and observations. for a ML task 8 GB isn't untypical though
	ID: 57755 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57756 - Posted: 4 Nov 2021 \| 13:06:37 UTC - in response to Message 57755.
	I agree that lots of memory use wouldnt be atypical for AI/ML work. and also agree that the admins should be a little more transparent about what these tasks are doing and the expected behaviors. it seems so far they have tons and tons of errors, then the admins come back and say they fixed the errors, then just more errors again. I'd also like to know if these are using the Tensor cores on RTX GPUs. ____________
	ID: 57756 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57757 - Posted: 4 Nov 2021 \| 13:17:33 UTC
	I think the Beta testing process is (as usual anywhere) very much an incremental process. It will have started with small test units, and as each little buglet surfaces and is solved, the process moves on to test a later segment that wasn't accessible until the previous problem had been overcome. Thus - Abouh has confirmed that yesterday's upload file size problem was caused by including a source data file in the output - "Should not be returned". I also noted that some of Keith's successful runs were resends of tasks which had failed on other machines - some of them generic problems which I would have expected to cause a failure on his machine too. So it seems that dynamic fixes may have been applied too. Normally, a new BOINC replication task is an exact copy of its predecessor, but I don't think can be automatically assumed during this Beta phase. In particular, Keith's observation that one test task only used 200 MB of GPU memory isn't necessarily a foolproof guide to the memory demand of later tests.
	ID: 57757 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57758 - Posted: 4 Nov 2021 \| 13:22:50 UTC - in response to Message 57757.
	which is why I asked for clarification in light of the disparity between expected and observed behaviors. ____________
	ID: 57758 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57761 - Posted: 4 Nov 2021 \| 17:05:37 UTC - in response to Message 57754.
	yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB Yes, I have watched tasks complete fully to a proper boinc finish end and I never saw more than 290MB of gpu memory reported in nvidia-smi at a max 13% utilization. Unless nvidia-smi has an issue in reporting gpu RAM used, the 8GB of memory post is out of line. Or the tasks the scientist-developer mentioned haven't been released to us out of the laboratory yet.
	ID: 57761 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57766 - Posted: 5 Nov 2021 \| 10:37:31 UTC - in response to Message 57752.
	We are progressing in our debugging and have managed to solve several errors, but as mentioned in a previous post, it is an incremental process. We are trying to train AI agent using reinforcement learning, which generally interleaves stages in which the agent collects data (a process less GPU intensive) and stages in which the agent learns from that data. The nature of the problem, in which data in progressively generated, accounts for a lower GPU utilisation that in supervised machine learning, although we will work to progressively make it more efficient once debugging is completed. Since the obstacle tower environment (https://github.com/Unity-Technologies/obstacle-tower-env), the source of data, also runs in GPU, during the learning stage, the neural network and the training data together with the environment occupy approximately 8,000 MiB (Mebibyte, was not a typo) of GPU memory when checked locally with nvidia-smi. Basically, the python script has the following steps: step 1: Defining the conda environment with all dependencies. step 2: Downloading obstacletower.zip, a necessary file used to generate the data. step 3: Initialising the data generator using the contents of obstacletower.zip. step 4: Creating the AI agent and alternating data collection and data training stages. step 5: Returning the trained AI agent, and not obstacletower.zip. Only after reaching step 4 and step 5 the GPU is used. Some of the jobs that succeeded but barely used the GPU were to test that indeed problems in step 1 and step 2 had been solved (most of them solved by Keith Myers). We noticed that most recent failed jobs returned the following error at step 3: mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that : The environment does not need user interaction to launch The Agents' Behavior Parameters > Behavior Type is set to "Default" The environment and the Python interface have compatible versions. We are working to solve it. If step 3 is completed without errors, jobs reaching steps 4 and 5 should be using GPU. We hope that helped shed some light on our work and the recent results. We will try to solve any further doubts and inform about our progress. ____________
	ID: 57766 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57767 - Posted: 5 Nov 2021 \| 12:10:16 UTC - in response to Message 57766.
	Thanks for the more detailed answer. regarding the 8GB of memory used. -which step of the process does this happen? -was Keith's nvidia-smi screenshot that he posted in another thread showing low memory use, from an earlier unit that did not require that much VRAM? -will these units fail from too little VRAM? -what will you do or are you doing about GPUs with less than 8GB VRAM, or even with 8GB? -do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB? ____________
	ID: 57767 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57768 - Posted: 5 Nov 2021 \| 13:57:40 UTC - in response to Message 57767.
	-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB? It's certainly possible to set such a filter: referring again to Specifying plan classes in C++, the scheduler can check a CUDA plan class specification like if (!strcmp(plan_class, "cuda23")) { if (!cuda_check(c, hu, 100, // minimum compute capability (1.0) 200, // max compute capability (2.0) 2030, // min CUDA version (2.3) 19500, // min display driver version (195.00) 384*MEGA, // min video RAM 1., // # of GPUs used (may be fractional, or an integer > 1) .01, // fraction of FLOPS done by the CPU .21 // estimated GPU efficiency (actual/peak FLOPS) )) { return false; } } We last discussed that code in connection with compute capability, but I think we're still having problems implementing filters via tools like that.
	ID: 57768 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57769 - Posted: 5 Nov 2021 \| 14:50:31 UTC - in response to Message 57767. Last modified: 5 Nov 2021 \| 14:51:06 UTC
	At step 3, initialising the environment requires a small amount of GPU memory (somewhere around 1GB). At step 4 the AI agent is initialised and trained, and a data storage class and a neural network are created and placed on the GPU. This is when more memory is required. However, in the next round of tests we will lower the GPU memory requirements of the script while debugging step 3. Eventually for steps 4 and 5 we expect it to require the 8G mentioned earlier. Keith's nvidia-smi screenshot showing a job with low memory use was a job that returned after step 2, to verify problems in steps 1 and 2 had been solved. ____________
	ID: 57769 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57770 - Posted: 5 Nov 2021 \| 17:08:45 UTC - in response to Message 57769.
	So was this WU https://www.gpugrid.net/result.php?resultid=32660133 the one that was completed after steps 1 and 2? Or after steps 4 and 5? I never got to witness this one in realtime. I had nvidia-smi polling update set at 1 second and I never saw the gpu memory usage go above 290MB for that screenshot. It was not taken from the task linked above. The BOINC completion percentage just went to 10% and stayed there and never showed 100% completion when it finished. Think that is an issue with BOINC historically.
	ID: 57770 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57771 - Posted: 5 Nov 2021 \| 17:16:26 UTC
	The environment and the Python interface have compatible versions. Is the reason why I was able to complete a workunit properly because of having my local python environment match the zipped wrapper python interface? I use several pypi applications that probably have setup the python environment variable. Is there something I can dump out of the host that completed the workunit properly that will help you debug the application package?
	ID: 57771 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57776 - Posted: 8 Nov 2021 \| 14:54:09 UTC - in response to Message 57770.
	This one completed the whole python script. Including steps 4 and 5. Should have used the GPU. ____________
	ID: 57776 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57777 - Posted: 8 Nov 2021 \| 16:00:17 UTC - in response to Message 57776.
	Thanks for confirming the one I completed used the gpu.
	ID: 57777 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 24,192,102,249 RAC: 13,992,829 Level Scientific publications	Message 57778 - Posted: 9 Nov 2021 \| 21:23:30 UTC
	Did a check on one host running GPUGridpy units. e4a6-ABOU_ppo_gym_demos3-0-1-RND1018_0 Run time 4,999.53 GPU Memory: nvidia-smi report 2027MiB No check-pointing yet but works well.
	ID: 57778 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57779 - Posted: 10 Nov 2021 \| 8:42:17 UTC - in response to Message 57778. Last modified: 10 Nov 2021 \| 8:45:13 UTC
	We sent out some jobs yesterday and almost all finished successfully. We are still working on avoiding the following error related to the Obstacle Tower environment: mlagents_envs.exception.UnityTimeOutException: The Unity environment took too. long to respond. Make sure that : The environment does not need user interaction to launch The Agents' Behavior Parameters > Behavior Type is set to "Default" The environment and the Python interface have compatible versions. However, to test the rest of the code we tried with another set of environments that are less problematic (https://gym.openai.com/). The successful jobs used these environments. While we find and test a solution for the Obstacle Tower locally we will continue to send jobs with these environments to test the rest of the code. Note that reinforcement learning (RL) techniques are independent of the environment. The environment represents the world where the AI agent learns intelligent behaviours. Switching to another environment simply means applying the learning technique to a different problem that can be equally challenging (placing the agent in a different world). Thus, we will now finish debugging the app with these Gym environments simply because are less prone to errors and, once we know the only possible source of problems is the environment, consider solving others. ____________
	ID: 57779 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57782 - Posted: 10 Nov 2021 \| 13:47:20 UTC - in response to Message 57779. Last modified: 10 Nov 2021 \| 14:08:41 UTC
	I had a few failures: https://www.gpugrid.net/result.php?resultid=32660680 and https://www.gpugrid.net/result.php?resultid=32660448 seems to be a bad WU on both instances since all wingmen are erroring in the same way. mainly used ~6-7% GPU utilization on my 3080Ti, with intermittent spikes to ~20% every 10s or so. power use near idle, GPU memory utilization around 2GB, and system memory use around 4.8GB. make sure your system has enough memory in multi-GPU systems. ____________
	ID: 57782 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57785 - Posted: 10 Nov 2021 \| 14:48:30 UTC - in response to Message 57782.
	Thank you for the feedback. We had detected the error in https://www.gpugrid.net/result.php?resultid=32660448 but not the one in https://www.gpugrid.net/result.php?resultid=32660680 Having alternating phases of lower and higher GPU utilisation is normal in Reinforcement Learning, as the agent alternates between data collection (generally low GPU usage) and training (higher GPU memory and utilisation). Once we solve most of the errors we will focus on maximizing GPU efficiency during the training phases. ____________
	ID: 57785 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57786 - Posted: 10 Nov 2021 \| 15:04:20 UTC - in response to Message 57785. Last modified: 10 Nov 2021 \| 15:09:17 UTC
	have you considered creating a modified app that will use the RTX (and other) GPU's onboard Tensor cores? it should speed up things considerably. https://www.quora.com/Does-tensorflow-and-pytorch-automatically-use-the-tensor-cores-in-rtx-2080-ti-or-other-rtx-cards I'm guessing in addition to making the needed configuration changes, you'd need to adjust your scheduler to only send to cards with Tensor cores (GeForce RTX cards, TitanV, Tesla/QuadroRTX cards from Volta forward) ____________
	ID: 57786 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57789 - Posted: 10 Nov 2021 \| 16:28:26 UTC - in response to Message 57786. Last modified: 10 Nov 2021 \| 16:30:20 UTC
	information for pytorch here: https://github.com/NVIDIA/apex https://nvidia.github.io/apex/ ____________
	ID: 57789 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57790 - Posted: 10 Nov 2021 \| 17:04:13 UTC - in response to Message 57786.
	We are using PyTorch to train our agents, and for now we have not considered using mixed precision, which seem required for the Tensor cores. It could be an interesting possibility to reduce memory requirements and speed up training processes. I have to admit that I do not know how it affects performance in reinforcement learning algorithms, but it is an interesting option. ____________
	ID: 57790 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57792 - Posted: 10 Nov 2021 \| 18:53:33 UTC Last modified: 10 Nov 2021 \| 19:48:25 UTC
	Getting errors in the test5 run, like e2a16-ABOU_ppod_gym_test5-0-1-RND0379_1 e2a10-ABOU_ppod_gym_test5-0-1-RND0874_1 And on the test6 run. This time, the error seems to be in placing the expected task files in the slot directory, prior to starting the main run. e3a17-ABOU_ppod_gym_test6-0-1-RND2029_0 e3a11-ABOU_ppod_gym_test6-0-1-RND1260_4 Both have File "run.py", line 393, in <module> main() File "run.py", line 106, in main feature_extractor_network=get_feature_extractor(args.nn), File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/pytorchrl/agent/actors/feature_extractors/__init__.py", line 19, in get_feature_extractor raise ValueError("Specified model not found!") ValueError: Specified model not found!
	ID: 57792 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 57794 - Posted: 10 Nov 2021 \| 23:56:58 UTC Last modified: 10 Nov 2021 \| 23:57:13 UTC
	I got one that worked today. Then 6 more that didnt on the same PC https://www.gpugrid.net/workunit.php?wuid=27086033
	ID: 57794 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 57795 - Posted: 11 Nov 2021 \| 2:04:09 UTC
	I got another. So far it is running Over 4 CPU threads at 1st then 1 thread for 1st 4min 13% completed back to 10% then no more progression At 10% hen GPU load at 3-5% 875mb vram 78min so far.
	ID: 57795 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57796 - Posted: 11 Nov 2021 \| 6:49:43 UTC
	I've got several GPU Python beta tasks at my triple GPU Host #480458 Several of them have succeeded after around 5000 seconds execution time. But three of these tasks have exceeded this time. Task e1a20-ABOU_ppod_gym_test-0-1-RND4563_6 failed after 11432 seconds. Task e1a6-ABOU_ppod_gym_test-0-1-RND1186_1 failed after 18784 seconds. Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. This last task is theoreticaly running at device 1. But it seems to be effectively running at device 0, sharing the same device with an ACEMD3 regular task e14s132_e10s98p1f905-ADRIA_AdB_KIXCMYB_HIP-0-2-RND7676_5.
	ID: 57796 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57797 - Posted: 11 Nov 2021 \| 7:13:49 UTC
	I've got the same thing going on. BOINC says the task is running on Device2 while in reality it is sharing Device0 along with an Einstein GRP task. This is the task https://www.gpugrid.net/result.php?resultid=32661276
	ID: 57797 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57798 - Posted: 11 Nov 2021 \| 9:30:27 UTC - in response to Message 57796.
	Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. The risk of beta testing: It finally failed after 42555 seconds. I hope this is somehow useful for debugging...
	ID: 57798 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57799 - Posted: 11 Nov 2021 \| 9:50:17 UTC - in response to Message 57798.
	Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer. FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201' The same for two of your predecessors on this workunit. Is there any way we could avoid re-inventing the wheel (slowly) for errors like this?
	ID: 57799 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57800 - Posted: 11 Nov 2021 \| 13:44:16 UTC - in response to Message 57799. Last modified: 11 Nov 2021 \| 13:48:31 UTC
	The excessively long training time problem and the problem related to FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201' Have been fixed now. Most jobs sent today are being completed successfully. The reported issues were very helpful for debugging. Progress: The core research idea is to train populations of reinforcement learning agents that learn independently for a certain amount of time and, once they return to the server, put their learned knowledge in common with other agents to create a new generation of agents equipped with the information acquired by previous generations. Each GPUgrid job is one of these agents doing some training independently. In that sense, the first 4 letters of the job name identify the generation and the number of the agent (i.e. e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 refers to the epoch or generation number 1 and the agent number 2 within that generation). The debugging done recently, has allowed more and more of this jobs to finish. An experiment currently running has achieved already a 3rd generation of agents. As mentioned in an earlier post, we are working now with OpenAI gym environments (https://gym.openai.com/) ____________
	ID: 57800 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57801 - Posted: 11 Nov 2021 \| 15:48:48 UTC - in response to Message 57800. Last modified: 11 Nov 2021 \| 15:54:02 UTC
	Are you working on fixing the issue that the tasks only run on Device#0 in BOINC? Even when Device#0 is already occupied by another task from another project? That leaves at least one device doing nothing because BOINC thinks it is occupied.
	ID: 57801 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57802 - Posted: 11 Nov 2021 \| 17:15:48 UTC - in response to Message 57801. Last modified: 11 Nov 2021 \| 17:16:26 UTC
	Are you working on fixing the issue that the tasks only run on Device#0 in BOINC? +1 At this other example, Device 0 is running 1 Gpugrid ACEMD3 task and 2 Python GPU tasks. Meanwhile, Device 1 and Device 2 remain idle.
	ID: 57802 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57803 - Posted: 11 Nov 2021 \| 17:25:26 UTC - in response to Message 57802.
	weird, I thought this problem had been fixed already. I guess I never realized since I've only been running the beta tasks on my single GPU system. ____________
	ID: 57803 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57804 - Posted: 11 Nov 2021 \| 17:31:21 UTC Last modified: 11 Nov 2021 \| 17:57:23 UTC
	Count me in on this, too. My client is running e8a16-ABOU_ppod_gym_test7-0-1-RND1448_0 on device 1. I have GPUGrid excluded from device 0, so I can run tasks from other projects in the faster PCIe slot while testing. But ... Well, despite running on the wrong card, it finished and passed the GPUGrid validation test. I've swapped over the exclusion, and BOINC and GPUGrid are now in agreement that card 0 is the card to use.
	ID: 57804 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57805 - Posted: 11 Nov 2021 \| 18:32:50 UTC
	Hard to tell from the error code snippet whether the tasks are hardwired to run on Device#0 or whether the error snippet is just the result of where the task actually has run. [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
	ID: 57805 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57817 - Posted: 12 Nov 2021 \| 19:33:08 UTC Last modified: 12 Nov 2021 \| 19:39:58 UTC
	Well, I have a new python task running by itself now on Device#2. So it may mean they have fixed the issue where the tasks always ran on Device#0. See this new output in the stderr.txt that looks like it is allocating to Device#2 It hasn't been there in any other of my tasks till just now for this new task. Found GPU: True, Number 2 - 2
	ID: 57817 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 57818 - Posted: 12 Nov 2021 \| 19:51:54 UTC - in response to Message 57817.
	Yes, we have fixed the issue. It should be fine now. Please, let us know if you encounter any new device placement error. We just ran the tests and, as you mention, we print the device number in the stderr file. ____________
	ID: 57818 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57819 - Posted: 12 Nov 2021 \| 20:08:22 UTC - in response to Message 57818.
	Thank you for fixing this issue. I don't know whether you test in a multi-gpu environment or not. I suspect a lot of projects don't. But there are lots of us that run many multi-gpu hosts that have been bit by this bug often.
	ID: 57819 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57821 - Posted: 12 Nov 2021 \| 20:15:16 UTC - in response to Message 57818.
	Thank you very much for your continuous support.
	ID: 57821 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57824 - Posted: 13 Nov 2021 \| 10:44:54 UTC Last modified: 13 Nov 2021 \| 10:46:38 UTC
	Overnight, every of my currently active 6 varied Linux hosts received at least one task of the kind ...-ABOU_ppod_gym_test9-0-1-... All the tasks gave a valid result, none of them errored. This is promising! My triple-GPU host happened to receive several tasks in a short time, and three of them were executed concurrently. It catched my attention that there was observed a drastic change in overall system temperatures when transitioning from executing highly GPU/CPU intensive PrimeGrid tasks to the Gpugrid tasks. On the other hand, every GPU was effectively executing its own task, as shown at the following nvidia-smi screenshot: This confirms the Keith Myers observation that the previous task-to-GPU assignment problem in multi-GPU systems is solved. Well done!
	ID: 57824 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 57827 - Posted: 13 Nov 2021 \| 13:27:09 UTC Last modified: 13 Nov 2021 \| 13:27:24 UTC
	I enabled Python on a 2nd PC with a 1070 and 1080 and they all error out https://www.gpugrid.net/result.php?resultid=32662330 Output in format: Requested package -> Available versions Then lists tons pf packages and versions. When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all. I'm guessing there is some incompatibility between packages I have installed?
	ID: 57827 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57828 - Posted: 13 Nov 2021 \| 17:52:56 UTC
	You needn't install any packages. The tasks are entirely packaged with everything they need in the work unit bundle.
	ID: 57828 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 57830 - Posted: 13 Nov 2021 \| 23:15:03 UTC
	Supposedly, but then they should work. Another PC of mine also with Ubuntu 18.04, driver 470 and Pascal arch works OK. These tasks were all completed by others.
	ID: 57830 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 57831 - Posted: 14 Nov 2021 \| 1:34:01 UTC
	I can only guess the tasks are confused with the locally installed old Python 2.7 library with the bundle containing 3.8 Python. Python 2.7 is deprecated in current Linux distributions with minimum Python 3.6 in the distros now. You might want to either uninstall Python or upgrade it to the 3 series. I don't think uninstalling though is desired as I believe a lot of stock applications are Python based and you would lose those.
	ID: 57831 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 57832 - Posted: 14 Nov 2021 \| 2:27:31 UTC - in response to Message 57831.
	I think you can uninstall python2 without damage. At least I could on Ubuntu 20.04.3, though I had only BOINC and Folding installed on it. But I then made the mistake of trying to purge all python versions. It made the system unbootable, and I had to re-install it.
	ID: 57832 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 57833 - Posted: 14 Nov 2021 \| 9:12:32 UTC - in response to Message 57827.
	When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all. To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands: sudo apt install python-is-python3 sudo apt install python3-pip And after that, you can uninstall unnecessary old packages with the command: sudo apt autoremove
	ID: 57833 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57834 - Posted: 14 Nov 2021 \| 10:24:54 UTC
	I also have two closely similar Linux machines: 132158 508381 Don't be fooled by the host IDs: 132158 is an inherited ID from an earlier generation of hardware, and is actually slightly younger than 508381. Both run the same version of Linux Mint 20.2, installed from the same ISO download, and the same basic software environment - but I do make tweaks to the installed packages separately, as I encounter different testing needs. Yesterday, I was away from home, but both machines downloaded tasks from the ppod_gym_test9 batch. 132158 failed to run them, 508381 succeeded. The problem occurs during the learner.step in Python, with a ValueError raised at line 55 during initialisation: File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__ raise ValueError( ValueError: Expected parameter loc (Tensor of shape (146, 8)) of distribution Normal(loc: torch.Size([146, 8]), scale: torch.Size([146, 8])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>) The two 'file extraction' logs for the GPUGrid Python download seem to be different. I'll try to compare the software environment of the two machines and work out where the difference is coming from.
	ID: 57834 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 57838 - Posted: 14 Nov 2021 \| 17:49:02 UTC - in response to Message 57834.
	Well, I've looked through the software installations for both machines, but I can't see any significant differences. Both have Python 3.8 installed (probably with the operating system), and no sign of any Python 2.x; I've installed a few sundries from terminal (libboost, git, some 32-bit libs for CPDN), but the same list on both machines. The 'file extraction' logs are different for every task, and sometimes the same filename appears more than once (is duplicated) in the list for a single task. For the tasks I ran successfully on host 508381, that was the only host that attempted them. The tasks that failed on host 132158 were issued to the full limit of 8 hosts, and failed on all of them. I can only assume that the difference between success and failure resulted from differences in the task data make-up, and not from differences in the installed software on my hosts.
	ID: 57838 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 57840 - Posted: 15 Nov 2021 \| 2:59:24 UTC - in response to Message 57833. Last modified: 15 Nov 2021 \| 3:00:56 UTC
	When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all. To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands: sudo apt install python-is-python3 sudo apt install python3-pip And after that, you can uninstall unnecessary old packages with the command: sudo apt autoremove The python-is command didn't work. So I followed the instructions here starting with Option 1 https://phoenixnap.com/kb/how-to-install-python-3-ubuntu At the end I did the python --version to check. Same 2.7.17 even though it seemed to complete. So I tried option 2 from source. That worked OK too with 3.7.5 I get to the end and see the note about checking for specific versions. Uh, oh. python --version = 2.7.17 python3 --version = 3.6.9 python3.7 --version = 3.7.5 So now I have 3 versions installed haha. Maybe one will work, dunno. But we'll need some more tasks to find out.
	ID: 57840 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level Scientific publications	Message 58077 - Posted: 12 Dec 2021 \| 14:10:45 UTC - in response to Message 57833.
	sudo apt install python-is-python3 sudo apt install python3-pip Thanks, that worked and I now have python 3.8.10 installed on my two GG computers with cuda 11.4. I just noticed that one computer had previously attempted to run a python WU but it failed. https://www.gpugrid.net/result.php?resultid=32727968 The stderr said this among many other things: ==> WARNING: A newer version of conda exists. <== current version: 4.8.3 latest version: 4.11.0 Please update conda by running $ conda update -n base -c defaults conda I tried running that command but it said "conda: command not found." The rig that didn't run a python WU installed many more lines of files. The rig that did run the failed python WU installed less than half of the files. What are all of the prerequisites I need to run these python WUs?
	ID: 58077 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,768,112,024 RAC: 21,327,418 Level Scientific publications	Message 58079 - Posted: 12 Dec 2021 \| 14:50:51 UTC - in response to Message 58077.
	What are all of the prerequisites I need to run these python WUs? I read Keith Myers Message #58061 Then, I executed: sudo apt install cmake chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5 The same WU had previously failed at five other hosts.
	ID: 58079 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level Scientific publications	Message 58081 - Posted: 12 Dec 2021 \| 15:28:52 UTC - in response to Message 58079.
	sudo apt install cmake Done. Fingers crossed. Thx
	ID: 58081 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 58083 - Posted: 12 Dec 2021 \| 16:47:26 UTC - in response to Message 58079.
	What are all of the prerequisites I need to run these python WUs? I read Keith Myers Message #58061 Then, I executed: sudo apt install cmake chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5 The same WU had previously failed at five other hosts. I was hoping to get a response from the researcher before interfering with the process. Happy someone beat me to it. So once again we crunchers need to help along the process by installing missing software on our hosts to properly crunch the work the researchers are sending out. Would be nice if the researchers ran some of their work on some test systems of their own before releasing it to the public, or as we are also known as . . . "beta-testers"
	ID: 58083 \| Rating: 0 \| rate: / Reply Quote

abouh Project administrator Project developer Project tester Project scientist Send message Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level Scientific publications	Message 58105 - Posted: 14 Dec 2021 \| 16:59:44 UTC - in response to Message 58083.
	Hello everyone, sorry for the late reply. we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error. The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents. Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py. I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished. http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762 ____________
	ID: 58105 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level Scientific publications	Message 58106 - Posted: 14 Dec 2021 \| 19:08:48 UTC - in response to Message 58105. Last modified: 14 Dec 2021 \| 19:12:24 UTC
	http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762 I cannot open these links. Please use the [url][/url] tags to make them linkable. I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU? https://www.gpugrid.net/result.php?resultid=32730780 https://www.gpugrid.net/result.php?resultid=32730783 BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test? I recommend making its CPU use require 1 and not 0.963.
	ID: 58106 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 58107 - Posted: 14 Dec 2021 \| 19:24:18 UTC - in response to Message 58106.
	Those are private links, but you can see the result ID.
	ID: 58107 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 58108 - Posted: 14 Dec 2021 \| 20:21:24 UTC - in response to Message 58106. Last modified: 14 Dec 2021 \| 20:21:53 UTC
	http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760 http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762 I cannot open these links. Please use the [url][/url] tags to make them linkable. I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU? https://www.gpugrid.net/result.php?resultid=32730780 https://www.gpugrid.net/result.php?resultid=32730783 BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test? I recommend making its CPU use require 1 and not 0.963. real memory? or virtual memory allocation? high virt is normal, and on the order of tens of GB, even for acemd3 tasks. re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that. ____________
	ID: 58108 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level Scientific publications	Message 58109 - Posted: 15 Dec 2021 \| 12:25:37 UTC - in response to Message 58108. Last modified: 15 Dec 2021 \| 12:30:57 UTC
	re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that. I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963.
	ID: 58109 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 58110 - Posted: 15 Dec 2021 \| 15:20:14 UTC - in response to Message 58109.
	a trivial question garners a trivial response :) does it solve your problem? ____________
	ID: 58110 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,822,866,430 RAC: 19,442,844 Level Scientific publications	Message 58111 - Posted: 15 Dec 2021 \| 15:24:48 UTC - in response to Message 58109.
	re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that. I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963. Because the GPUGrid staff don't set that figure. There's an algorithm in the (Berkeley written) BOINC server code which generates the figure to use from a range of outdated, stupid, data. I discussed this at some length almost three years ago, in https://github.com/BOINC/boinc/issues/2949 - with examples drawn from GPUGrid, among other projects. I think that this was about the point that Berkeley stopped reading a single word of what I write. Someone else can get to grips with it this time.
	ID: 58111 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 58113 - Posted: 15 Dec 2021 \| 15:38:56 UTC - in response to Message 58111.
	thanks Richard, I had a thought that it was likely the output of some automated function within BOINC since nearly all projects end up with something like this by default if they don't manually set the figures. ____________
	ID: 58113 \| Rating: 0 \| rate: / Reply Quote

Bill F Send message Joined: 21 Nov 16 Posts: 32 Credit: 140,098,150 RAC: 388,093 Level Scientific publications	Message 58121 - Posted: 16 Dec 2021 \| 2:56:58 UTC - in response to Message 58113.
	Too bad they never worked to implement all or part of Richard's suggestion / GitHub issue. While I don't claim to be able to see the bigger picture in BOINC it sounds like a good path for automated adjustments when new GPU hardware is released. Bill F ____________ In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic; There was no expiration date.
	ID: 58121 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 88,126,864 RAC: 833 Level Scientific publications	Message 58205 - Posted: 24 Dec 2021 \| 15:20:12 UTC
	I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2). Is it normal behaviour that the WU uses more than 7GB of RAM?
	ID: 58205 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 58206 - Posted: 24 Dec 2021 \| 15:38:40 UTC - in response to Message 58205.
	I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2). Is it normal behaviour that the WU uses more than 7GB of RAM? Yes. ____________
	ID: 58206 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 88,126,864 RAC: 833 Level Scientific publications	Message 58207 - Posted: 24 Dec 2021 \| 15:40:09 UTC - in response to Message 58206.
	I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2). Is it normal behaviour that the WU uses more than 7GB of RAM? Yes. Thanks for answering.
	ID: 58207 \| Rating: 0 \| rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level Scientific publications	Message 58699 - Posted: 22 Apr 2022 \| 15:39:17 UTC Last modified: 22 Apr 2022 \| 15:40:29 UTC
	What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit.
	ID: 58699 \| Rating: 0 \| rate: / Reply Quote

Bill F Send message Joined: 21 Nov 16 Posts: 32 Credit: 140,098,150 RAC: 388,093 Level Scientific publications	Message 58712 - Posted: 25 Apr 2022 \| 14:05:30 UTC - in response to Message 58699.
	What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit. Your times may be about right. I have a GTX 1060 with 6GB and my times were similar.
	ID: 58712 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,070,933 RAC: 197,388 Level Scientific publications	Message 58948 - Posted: 19 Jun 2022 \| 18:44:06 UTC
	This task is giving wild results for its estimated time remaining. This morning, it was saying over 400 days remaining. Application Python apps for GPU hosts 4.03 (cuda1131) Name e23a16-ABOU_rnd_ppod_demo_sharing_large-0-1-RND7660 State Running Received 6/19/2022 6:33:56 AM Report deadline 6/24/2022 6:34:02 AM Resources 0.949 CPUs + 1 NVIDIA GPU Estimated computation size 1,000,000,000 GFLOPs CPU time 23:13:06 CPU time since checkpoint 00:02:47 Elapsed time 06:42:45 Estimated time remaining 357d 12:44:26 Fraction done 22.580% Virtual memory size 5.81 GB Working set size 1.05 GB Directory slots/10 Process ID 6376 Progress rate 3.240% per hour Executable wrapper_6.1_windows_x86_64.exe I've seen other tasks start out claiming over 300 days remaining, and then finish in between 5 and 6 days. Is there something wrong in the data sent as task input, or is it the wild first ten tasks for a new application version?
	ID: 58948 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,617,738,971 RAC: 10,992,172 Level Scientific publications	Message 58950 - Posted: 20 Jun 2022 \| 18:38:06 UTC Last modified: 20 Jun 2022 \| 18:38:52 UTC
	Yup, non beta task but I've seen over 3k day ETAs recently. Name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212_1 Application Python apps for GPU hosts 4.03 (cuda1131) Workunit name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212 State Running High P. Received 6/20/2022 12:17:53 PM Report deadline 6/25/2022 12:17:53 PM Estimated app speed 311.15 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.99 CPUs + 1 NVIDIA GPU CPU time at last checkpoint 05:44:17 CPU time 05:47:46 Elapsed time 02:17:30 Estimated time remaining 2764d,01:55:33 Fraction done 11.890% Virtual memory size 18,693.93 MB Working set size 3,824.01 MB
	ID: 58950 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 58951 - Posted: 20 Jun 2022 \| 20:50:03 UTC
	Just ignore the ETA estimates. Garbage data. The tasks finish fine and well within their deadlines.
	ID: 58951 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,070,933 RAC: 197,388 Level Scientific publications	Message 58952 - Posted: 21 Jun 2022 \| 0:45:31 UTC Last modified: 21 Jun 2022 \| 0:47:44 UTC
	Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that. Note there seems to be no thread for discussing non-beta Python tasks.
	ID: 58952 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 58953 - Posted: 21 Jun 2022 \| 2:53:45 UTC - in response to Message 58952.
	Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that. Note there seems to be no thread for discussing non-beta Python tasks. No not at all. BOINC just has no mechanism for dealing with hybrid cpu-gpu tasks. The Python on GPU tasks are the first of their kind. It will take the BOINC devs a lot of time to accommodate them correctly. If they are getting in the way of your other work, I suggest stopping them or limiting them to only a single task at any time by changing your cache size to absolute minimal values.
	ID: 58953 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 58956 - Posted: 21 Jun 2022 \| 17:31:43 UTC - in response to Message 58952.
	Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that. Note there seems to be no thread for discussing non-beta Python tasks. The other threads are here: https://www.gpugrid.net/forum_thread.php?id=5323 https://www.gpugrid.net/forum_thread.php?id=5319
	ID: 58956 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,653,116,070 RAC: 13,421,873 Level Scientific publications	Message 58957 - Posted: 21 Jun 2022 \| 18:29:22 UTC Last modified: 21 Jun 2022 \| 18:29:32 UTC
	FYI, the Python on GPU tasks are the same as the beta Python tasks currently. Both tasks are using the latest application code. The devs said they would still keep the beta plan class available, just not in use, for whenever a new application might be developed. So everyone is getting the standard Python work even if they have beta selected.
	ID: 58957 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 834,294,060 RAC: 3,900,942 Level Scientific publications	Message 59715 - Posted: 12 Jan 2023 \| 15:51:14 UTC
	Does anybody know how many cpu threads would be ideal to run them efficiently? I gave them 12 threads exclusively but still the task is run primarily on the cpu with my 3070ti kicking in only sporadically for a second or two.
	ID: 59715 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 59716 - Posted: 12 Jan 2023 \| 16:09:32 UTC - in response to Message 59715. Last modified: 12 Jan 2023 \| 16:10:32 UTC
	giving it more cores wont necessarily make it run faster. as few as 4 cores per task works fine on my EPYC system. but if you are running other projects on the CPU it will slow them down as the processes compete with each other for CPU time. by default the program will use however many cores you have and you really can't change this with any BOINC settings. also I would recommend putting Linux on that system instead of Windows. Linux runs much faster ____________
	ID: 59716 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 18 Credit: 834,294,060 RAC: 3,900,942 Level Scientific publications	Message 59719 - Posted: 13 Jan 2023 \| 10:59:48 UTC - in response to Message 59716.
	Gotcha! My Linux Laptop finishes them in 10 hours, my much faster Windows PC needs 18! Thanks again Ian & Steve
	ID: 59719 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : Python Runtime (GPU, beta)

	About	Science	Volunteers	Performance	Forum	Join us	Donate