Author |
Message |
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
As I said. We are currently compiling the Windows version.
GDF |
|
|
|
might as well compile it for CUDA 11.8 to bring Ada (40-series) support.
____________
|
|
|
HZLSend message
Joined: 23 Nov 08 Posts: 1 Credit: 612,500 RAC: 0 Level
Scientific publications
|
大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img] |
|
|
|
大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img]
这个情况对于这个Python程序很正常,这个python程序用更多的CPU,而不是GPU。GPU的使用会被CPU限制。如果你同时运行两个任务,可以提高GPU的使用。但是在用这个Python程序的时候,你无法让GPU达到百分之百的状态。
____________
|
|
|
guoyeahSend message
Joined: 17 Mar 10 Posts: 1 Credit: 5,362,500 RAC: 0 Level
Scientific publications
|
我Nvidia能到80%。我也同时在运行其他的CPU(20%)和Intel GPU(97%)项目。电源调成最佳性能后,CPU到50%。Intel i7 12代。 |
|
|
|
Looking around I see the present batch of protein ligand sims are crashing... DARNIT!
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)
anything else found?
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
|
Looking around I see the present batch of protein ligand sims are crashing... DARNIT!
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)
anything else found?
if someone can preserve the data files and slot directory before it gets uploaded and subsequently wiped from your system, should be easy to figure out what's wrong.
my guess is they didn't name that run.sh file properly (via open_name probably), or didnt add a task to extract the file in the wrapper config file (jobs.xml), or something along those lines.
____________
|
|
|
|
actually I have some on my system so i took a look.
there appear to be many things wrong.
the job.xml file is calling just tar, with no reference to what tar is. this should probably be /bin/tar to use the system tar.
the extracted run.sh script looks woefully lacking in detail. i can see it trying to call python and conda from 'bin/' but that is not included in the input package and will fail. the input tarball only includes some text/config files and not the whole python package.
____________
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
What app exactly? |
|
|
|
What app exactly?
the new free energy one ('ATM' moniker). using the wrapper to call the run.sh script.
also it would be a good idea to add a checkbox for this app in project preferences. this app showed up with no warning and no announcement from the project and no way to prevent it it seems. I'm not sure if it's marked as beta or not.
____________
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Yes, we should have made a beta, but this app is not related to this thread. |
|
|
|
Yes, we should have made a beta, but this app is not related to this thread.
you're right, but there is no announcement thread for this app, so no where else appropriate in the News section to get your attention about it.
____________
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Soon we will announce it. This is just testing to see if it works which should have been done on a beta app.
I expect tons of workunits using this app. Soon I will introduce a new postdoc running the simulations.
g |
|
|
|
interesting to see that Ada "should" run on the Ampere cubins. I know the app has an architecture compatibility check, and it may fail there even if it could otherwise work.
you could also consider compiling your apps with the PTX version for forward compatibility
like this:
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_86,code=compute_86
and the user can set the environment variable as needed. or you could set it in the wrapper config file
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
I am successfully running the current ACEMD_3 tasks on a GTX980ti, on a Quadro P5000, and on two RTX3070.
However, they fail on a GTX1650 after a few seconds:
https://www.gpugrid.net/result.php?resultid=33263379
https://www.gpugrid.net/result.php?resultid=33263343
can anyone tell me what might be the reason? |
|
|
|
As a first, you can try resetting GPUGRID project at failing host.
But probably the reason is 4GB RAM being too short for executing these tasks. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
...
But probably the reason is 4GB RAM being too short for executing these tasks.
that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well). |
|
|
|
...
But probably the reason is 4GB RAM being too short for executing these tasks.
that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).
i could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.
perhaps one of the GPUGRID people could tell me if this is the case?
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,067,170,900 RAC: 8,208,883 Level
Scientific publications
|
Just had one and it failed after 26 seconds on my 4090 |
|
|
|
Just had one and it failed after 26 seconds on my 4090
are the Python tasks working on your 4090? or were those run on a different GPU?
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,067,170,900 RAC: 8,208,883 Level
Scientific publications
|
Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU. |
|
|
|
Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU.
Thanks.
could you please report your failed task? click update on BOINC for GPUGRID to send back the result. I'd like to see the nature of the failure, to see if the architecture check is the reason for failure.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,067,170,900 RAC: 8,208,883 Level
Scientific publications
|
Done :) |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) |
|
|
|
it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.
perhaps one of the GPUGRID people could tell me if this is the case?
maybe you can tell, (if you can run an ACEMD3 app on another host that is AVX enabled) by setting the AVX offset in the bios of a capable host and then checking to see if the processor speed corresponds while running the wrapper (with no other WU). |
|
|
|
...
But probably the reason is 4GB RAM being too short for executing these tasks.
that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).
interesting, larrywhitehead's 1060 3GB also does not seem to want to do these tasks
https://www.gpugrid.net/results.php?hostid=493191
only a vague siderr message
onl(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:38:59 (9616): wrapper (7.9.26016): starting
23:38:59 (9616): wrapper: running bin/acemd3.exe (--boinc --device 0)
23:39:01 (9616): bin/acemd3.exe exited; CPU time 0.000000
23:39:01 (9616): app exit status: 0xc0000135
23:39:01 (9616): called boinc_finish(195)y this in the siderr
Yet I only observe a little over 2GB graphics memory being utilized max so far on my hosts. |
|
|
|
Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
That’s exactly what I thought would happen. I had the same experience with some other people trying to run the Einstein CUDA BRP7 app. Didn’t work on 11.7 but did work once I compiled it for 11.8 with gencode defined for CC 8.9
____________
|
|
|
|
Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) |
|
|
|
Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
That’s correct. The current CUDA 11.21 app does not support Ada 4000 series.
____________
|
|
|
|
Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
That’s correct. The current CUDA 11.21 app does not support Ada 4000 series.
Thanks for confirming. |
|
|
oemuserSend message
Joined: 18 Sep 16 Posts: 10 Credit: 1,291,979 RAC: 0 Level
Scientific publications
|
I got ACEMD 3 task for my gtx 1080ti on Windows (2oiq-ADRIA_KDMD_1k_test_3809-0-1-RND9959).
GPU stays at very low clock speed 750Mhz and VRAM 800Mhz. I expected 2x CPU clock and 7x VRAM clock. Or would it not have any advantage?
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
I see that a new acemd3 app was published yesterday for the Linux hosts in an attempt to fix the expired Acellera licensing issue.
Unfortunately, the app is still not working and any new work is still failing, this time with more information, problem with the python packaging of the job files.
https://www.gpugrid.net/result.php?resultid=33722983
Looks like they've moved away from a standalone acemd3 binary which is what was used in the past work.
Looks like they tried to just use the Windows code and of course failed with trying to use a Windows only msvcrt Python function. |
|
|
|
Looks like they tried to just use the Windows code and of course failed with trying to use a Windows only msvcrt Python function.
It seems that You're right.
And currently still pending to address for Linux hosts:
Nombre 0_0-CRYPTICSCOUT_pocket_discovery_c82914d2_15b4_4300_b4db_cb72998e09bf-6-7-RND0445_6
Unidad de trabajo 27641639
Creado 26 Dec 2023 | 9:50:25 UTC
Enviado 26 Dec 2023 | 9:50:26 UTC
Recibir 26 Dec 2023 | 9:57:14 UTC
Estado del servidor Over
Resultado Error de ejecución
Estado del cliente Error de ejecución
Exit status 195 (0xc3) EXIT_CHILD_FAILED
ID del ordenador 186626
Límite de tiempo para informar 31 Dec 2023 | 9:50:26 UTC
Tiempo de ejecución 23.07
Tiempo de CPU 0.00
Estado de validación Inválido
Crédito 0.00
Versión de la aplicación ACEMD 3: molecular dynamics simulations for GPUs v2.21 (cuda1121)
Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
09:55:04 (339849): wrapper (7.7.26016): starting
09:55:25 (339849): wrapper (7.7.26016): starting
09:55:25 (339849): wrapper: running bin/acemd (--boinc --device 0)
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "runtime.py", line 8, in init runtime
File "/usr/lib/python3.10/platform.py", line 119, in <module>
import subprocess
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 72, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 12, in <module>
import subprocess, tempfile, os.path, re, pwd, grp, os, io
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'
Original exception was:
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "runtime.py", line 8, in init runtime
File "/usr/lib/python3.10/platform.py", line 119, in <module>
import subprocess
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'
Python error
09:55:26 (339849): bin/acemd exited; CPU time 0.032149
09:55:26 (339849): app exit status: 0x1
09:55:26 (339849): called boinc_finish(195)
</stderr_txt>
]]>
No hope for a solution in short term, since usually Universities get frozen in Christmas time...
Merry Xmas |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
I'm waiting till after New Years before bugging Gianni again with the request to fix the acemd3 app properly. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
I'm waiting till after New Years before bugging Gianni again with the request to fix the acemd3 app properly.
my Windows10 PCs were successfully crunching ACEMD 3 until this morning.
Within the past hour, some more ACEMD 3 tasks were downloaded and failed after about 1 minute.
See here: http://www.gpugrid.net/result.php?resultid=33725238 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
I'm shocked to discover that this morning I have a acemd3 task running for 50 minutes so far.
All previous tasks insta-failed on the missing license issue and then when the app got updated in December for a missing Windows file.
All my hosts are Linux based and no Windows has ever been installed.
The slot that has the running task in it has all the normal and usual files in it along with checkpoint files that made running acemd3 tasks so wonderful because they could be stopped and started without failing.
Wish the other tasks at GPUGrid had that same capability.
I must assume that the app got updated again and now works. And after looking at the apps list, I see that that is the case. New app released today for acemd3.
Thank you Gianni!! |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
But that is only one task out of about 20 so far today that is being successfully run. All the rest are ATMbeta and have failed due to bad configuration file inputs. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
New Linux acemd3 app has an expiration date 3649 days into the future. Should not be an issue for years now.
#
# ACEMD version 3.7.3
#
# Copyright (C) 2017-2024 Acellera (www.acellera.com)
#
# By using ACEMD, you accept the terms and conditions of the ACEMD licence
# Check the licence by running "acemd --licence"
# More details: https://software.acellera.com/acemd/licence.html
#
# When publishing, please cite:
# ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale
# M. J. Harvey, G. Giupponi and G. De Fabritiis,
# J Chem. Theory. Comput. 2009 5(6), pp1632-1639
# DOI: 10.1021/ct9000685
#
# Arguments:
# input: input
# platform:
# device: 2
# ncpus:
# precision: mixed
#
# ACEMD is running in Boinc mode!
#
# WARNING: This ACEMD version expires in 3649 days! |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
New Linux acemd3 app has an expiration date 3649 days into the future. Should not be an issue for years now.
good news for the Linux crunchers.
However, it would be great it they did the same for the Windows version, and until this will be done, they should stop sending out Windows tasks which keep failing within a minute.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
You need to look at a running task while it is still in its slot and capture the stderr.txt and progress files for later examination before the task errors out and clears the slot.
Your uploaded result files do not have any useful information about why the tasks are failing.
You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.
acemd --licence would at least eliminate that as the issue. Or prove it. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,250,932,676 RAC: 29,037,360 Level
Scientific publications
|
... Your uploaded result files do not have any useful information about why the tasks are failing.
yes, you are right, the task from the link I uploaded before does not show any stderr.txt - for what reason ever (I did not check this before, sorry for that). I have noticed that this is the case with all tasks from this PC, regardless of whether they succeed for fail; no idea why.
However, the stderr from the other PC where ACEMD 3 tasks also failed does work, here is an example:
http://www.gpugrid.net/result.php?resultid=33725327
You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.
acemd --licence would at least eliminate that as the issue. Or prove it.
As yesterday the ACEMD 3 started failing at about the same time on both of my PCs (with a third PC, unfortunately I cannot crunch ACEMD 3 because the app does not work with Ada Lovelace yet), my guess, of course, was that this is not due to any problems with my hardware or my software, but rather due to a problem with the app itself, probably with the license.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
The stderr.txt on Windows hosts never shows any reason for failing or succeeding.
I've never been able to decipher why all Windows tasks have the debug dump in their outputs.
You get the same dump output whether it succeeds or fails. They only ever display the generic error 195 BOINC catchall error code which does not explain anything.
The Linux stderr.txt output actually does show explicit reasons for why a task fails.
Your Quadro P5000 is NOT Ada generation, it's Pascal generation and Pascal cards have always worked with acemd tasks.
I've been trying to find the code path for these acemd tasks and haven't been able to deduce anything beyond the CONDA environment they set up in the job file and pass on the parameter file to the app.
They don't have the same layout the ATMbeta tasks use so you can follow along with the processing and figure out where they fail in the setup or processing flow.
There were a slug of acemd tasks released today that had the same issue with no Windows file found and all failed on all the Linux hosts. But they were not from the initial bad release but newly generated tasks today.
This task as an example of what I said.
https://www.gpugrid.net/result.php?resultid=33725353
It was attempted by 7 hosts of both Windows and Linux so the task itself is badly configured.
|
|
|
|
Actually, Erich's https://www.gpugrid.net/result.php?resultid=33725327 does contain a useful error code:
app exit status: 0xc0000135
That's a generic Windows NT code:
0xC0000135
STATUS_DLL_NOT_FOUND
{Unable To Locate Component} This application has failed to start because %hs was not found. Reinstalling the application might fix this problem.
You have to be careful and search Microsoft itself for that one: the general internet chatterbox will usually say that a specific component is at fault (usually the .NET framework), which is unlikely to be relevant for research applications. You might be able to get a name for the missing component by trying to launch the application manually in a terminal window - it should populate that %hs parameter. |
|
|
|
4 ACEMD tasks received at this Linux host on January 7-8th still continued failing after a few seconds.
One example:
Application: ACEMD 3: molecular dynamics simulations for GPUs 2.22 (cuda1121)
Name: 0_2-CRYPTICSCOUT_pocket_discovery_f279f6d5_5830_427a_b012_ee7935c48e7f-1-3-RND8942
State: Computation error
Received: Mon 08 Jan 2024 02:58:37 WET
Report deadline: Sat 13 Jan 2024 02:58:36 WET
Resources: 0.49 CPUs + 1 NVIDIA GPU
Estimated computation size: 5,000,000 GFLOPs
CPU time: 00:00:00
Elapsed time: 00:00:33
Executable: wrapper_26198_x86_64-pc-linux-gnu |
|
|
|
They have "ModuleNotFoundError: No module named 'msvcrt'".
I think that stands for "MicroSoft Visual C RunTime [module]" - which is odd to see in a Linux package. |
|
|
|
Following on from the reported issue in the ATM thread ("exceeded elapsed time limit" error - message 61483):
I've finally caught one of these for inspection in daylight. It's on a Linux machine, so a slightly different version - v2.24, deployed 15 Apr 2024 - but it should be close enough.
Here are the vital statistics:
App speed: <flops> 6271039115434
Task size: <rsc_fpops_est> 1000000000000000000
Correction: <duration_correction_factor> 0.010000
for an estimated run time of 1594 seconds - or 26 minutes 34 seconds, shown in BOINC Manager.
The time limit for the task is set by <rsc_fpops_bound>, which is 10 times larger than the estimate. So, 4 hours, 25 minutes, 40 seconds on this GeForce GTX 1660 Ti. I'll let you know how it gets on - or you can look it up yourself this afternoon, at task 35250069.
Or not.
ACEMD failed:
Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)
Back to the drawing board, while it gets on with Quantum chemistry as usual! |
|
|
|
You may need to update your drivers.
____________
|
|
|
|
You may need to update your drivers.
It's a possibility - but the card/driver combo is accepted to run the cuda1121 version of QC. It's only the Python beta which needs cuda1131.
We'll see what happens when my other Linux machine catches a task - that does have a newer card and driver. |
|
|
|
OK, that's looking more plausible. My other machine (driver 535.99) has completed tasks on the primary RTX 3060 GPU, and is now running one on the secondary GTX 1660 GPU - no problems so far.
So I've upgraded the failing machine from driver 470.99 up to a matching 535.99: back to the long slow fishing game!
Meanwhile, I'll check the estimates for the task on the slower secondary card - that might be a (different) problem. |
|
|
|
I see we've been given a big new block of ACEND tasks to chew on.
Here are my current estimates for host 132158, after 9 completed tasks:
nearly 12 days for Quantum Chemistry
5.5 hours for ACEMD 3
That's still pretty tight on maximum time, but I've already got two more tasks to run - and they're all running to completion for now. We'll take another look after 11 completed tasks, to see what effect that has. |
|
|
|
Yup, confirmed:
If you can get to 11 completed tasks, it's plain sailing from there on. The original 'time limit exceeded' problem was caused by the project's poor estimation of the work involved in completing the different work types - but it would be devilishly difficult for them to correct it at this late stage, without causing similar problems for other apps too. I suspect we'll have to live with it. |
|
|
|
I guess updating the drivers solved your previous problem.
the app may be labelled with an incorrect CUDA version requirement.
____________
|
|
|
|
I guess updating the drivers solved your previous problem.
Yes, that machine is running fine now - 6 tasks completed, plus two running.
It's still in the danger zone for 'exceeded elapsed time limit', but looks like it should pull through.
|
|
|
mrchipsSend message
Joined: 9 May 21 Posts: 16 Credit: 1,384,772,757 RAC: 2,689,499 Level
Scientific publications
|
All of my WU have failed for the past 3 days
-112 (0xffffffffffffff90) ERR_XML_PARSE
____________
|
|
|
|
It may be related with ACEMD 3 app update to v2.28 deployed on 26/06/2024.
Your previous v2.27 tasks were completing correctly.
Wait for no tasks in execution and try resetting GPUGRID project at BOINC Manager, to freshly download all related libraries.
If it doesn't help, something might be wrong at new version. |
|
|
mrchipsSend message
Joined: 9 May 21 Posts: 16 Credit: 1,384,772,757 RAC: 2,689,499 Level
Scientific publications
|
I just started a new computer yesterday to run gpugrid
all new uploads were performed, I think the new version is corrupt.
Is there a way to get back to the previos version?
____________
|
|
|
|
No way back.
Have to wait for debugging on Project's side... |
|
|
tomarasSend message
Joined: 4 Mar 20 Posts: 15 Credit: 2,095,926,692 RAC: 11,984,665 Level
Scientific publications
|
Are there no work units? Is something amiss? |
|
|
roundup Send message
Joined: 11 May 10 Posts: 63 Credit: 9,204,655,193 RAC: 55,255,849 Level
Scientific publications
|
Are there no work units? Is something amiss?
Watch the Server Status. It tells you how many work units are available:
https://www.gpugrid.net/server_status.php |
|
|
|
Bad batch of ACEMD 3: fails on Linux after ~20 sec, with:
ERROR: read error for file "input.coor", byte number 4: number of atoms (1880162304) != (107863) expected
ERROR: /home/user/mambaforge/conda-bld/acemd_1704215649797/work/src/mdsim/forcefield.cpp line 300: Cannot read BINCOORD file: input.coor
Tasks 35376930, 35377052, 35377128. |
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 77 Credit: 1,581,272,434 RAC: 11,519,211 Level
Scientific publications
|
aucun probleme chez moi sous linux mint.
https://www.gpugrid.net/results.php?userid=563937
no problems with my linux mint.
https://www.gpugrid.net/results.php?userid=563937
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,671,980,095 RAC: 13,342,645 Level
Scientific publications
|
aucun probleme chez moi sous linux mint.
https://www.gpugrid.net/results.php?userid=563937
no problems with my linux mint.
https://www.gpugrid.net/results.php?userid=563937
You just haven't been sent one of the bad ones yet.
https://www.gpugrid.net/result.php?resultid=35372532
https://www.gpugrid.net/result.php?resultid=35376959
Host https://www.gpugrid.net/show_host_detail.php?hostid=462662 |
|
|
roundup Send message
Joined: 11 May 10 Posts: 63 Credit: 9,204,655,193 RAC: 55,255,849 Level
Scientific publications
|
Bad batch of ACEMD 3: fails on Linux after ~20 sec, with:
ERROR: read error for file "input.coor", byte number 4: number of atoms (1880162304) != (107863) expected
ERROR: /home/user/mambaforge/conda-bld/acemd_1704215649797/work/src/mdsim/forcefield.cpp line 300: Cannot read BINCOORD file: input.coor
Same here:
https://www.gpugrid.net/workunit.php?wuid=28923983 |
|
|
Peak7100 Send message
Joined: 15 Jun 09 Posts: 8 Credit: 468,764,244 RAC: 5,787,164 Level
Scientific publications
|
All of the ACEMD 3 2.28 tasks I received have failed, as well for the other people who were running them. So I'd venture with it being a fresh release, there are some issues with this batch.
I recently started back with the project after a fairly long hiatus, so excuse me if this was obvious.
____________
|
|
|
Peak7100 Send message
Joined: 15 Jun 09 Posts: 8 Credit: 468,764,244 RAC: 5,787,164 Level
Scientific publications
|
Darn lag in posting!
____________
|
|
|
mrchipsSend message
Joined: 9 May 21 Posts: 16 Credit: 1,384,772,757 RAC: 2,689,499 Level
Scientific publications
|
mine are still failing
Client state Compute error
Exit status -112 (0xffffffffffffff90) ERR_XML_PARSE
____________
|
|
|
|
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
(unknown error) (317) - exit code 4294967184 (0xffffff90)</message>
]]>
Is there anything I can do on my end to resolve this issue?
Ryzen 4500 6core
GTX1060
32GB RAM
Win11
All drivers updated. |
|
|
|
Hi, same here .....
Nome e51s3_e39s1p0f34-ADRIA_Explor_srcpp1_e2t_25ns_allcontacts4_10us_b2-0-1-RND8030_3
Lavoro 28934849
Creato 15 Jul 2024 | 8:36:59 UTC
Mandato 15 Jul 2024 | 8:37:22 UTC
Ricevuto 15 Jul 2024 | 8:38:05 UTC
Stato server Terminato
Risultato Errore di elaborazione
Stato client Errore di computazione
Stato di uscita -112 (0xffffffffffffff90) ERR_XML_PARSE
ID Computer 400693
Scadenza del report 20 Jul 2024 | 8:37:22 UTC
Tempo di elaborazione 2.66
Tempo CPU 0.00
Stato di validazione Invalido
Crediti 0.00
versione dell'applicazione ACEMD 3: molecular dynamics simulations for GPUs v2.28 (cuda1121)
Output su Stderr
<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
(unknown error) (317) - exit code 4294967184 (0xffffff90)</message>
]]>
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King) |
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 77 Credit: 1,581,272,434 RAC: 11,519,211 Level
Scientific publications
|
bonsoir
pour ne plus avoir d'erreur je vous conseille de passer sous linux si vous etes sous windows.
linux parait compliqué mais c'est simple.En cherchant on trouve toutes les réponses sur google.
Je détestais linux mais ça marche mieux que windows.
je n'ai plus d'erreur sauf celles que je provoque ou les annulations par le serveur de gpugrid a cause de la lenteur de ma connexion internet.
J'étais sous windows depuis des années et ai mis linux depuis 3 ou 4 mois,je ne remettrais plus windows dans ma tour.Surtout ne pas mettre de gpu amd sous linux car c'est compliqué a installer mais les gpu nvidia c'est une merveille de simplicité.
good evening
to have no more error I advise you to pass under linux if you are under windows.
linux seems complicated but it’s simple. By searching we find all the answers on google.
I hated linux but it works better than windows.
I have no more errors except those I cause or cancellations by the gpugrid server because of the slowness of my internet connection.
I was on windows for years and put linux for 3 or 4 months, I would not put windows back in my tower. Especially do not put amd gpu under linux because it is complicated to install but nvidia GPUs it is a marvel of simplicity.
linux mint-
i5 11400f-
rtx 4060-
rtx a2000-
32 gigas de ram
ssdf 1 téra-
msi z590 wifi
____________
|
|
|
|
Hi, thanks Pascal but "mia mamma usa Windows" (my mom like/use Windows).
K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King) |
|
|
PascalSend message
Joined: 15 Jul 20 Posts: 77 Credit: 1,581,272,434 RAC: 11,519,211 Level
Scientific publications
|
e peccato.anch'io useva windows e non volevo conoscere linux ma adesso e il contrario.ciao
Buongiorno a tutti gli amici italiani
____________
|
|
|
SteveVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello, it does appear the the Windows app version 2.28 is now broken. 2.27 worked. We are investigating. |
|
|
SteveVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 21 Dec 23 Posts: 46 Credit: 0 RAC: 0 Level
Scientific publications
|
It is now fixed, one of the files was corrupted. |
|
|
|
Hi, thanks a lot.
K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King) |
|
|
Peak7100 Send message
Joined: 15 Jun 09 Posts: 8 Credit: 468,764,244 RAC: 5,787,164 Level
Scientific publications
|
Have started getting the ACEMD 3: molecular dynamics simulations for GPUs v2.30 tasks for Windows.
It's good even if it's only a trickle, which has been 3 tasks today. Anything is welcome! |
|
|
|
Hi,
I was playing with zluda and got following error.
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
http://www.gpugrid.net/show_host_detail.php?hostid=603175
http://www.gpugrid.net/result.php?resultid=35643503
Hopefully helpful
coproc_info.xml
<cudaVersion>12020</cudaVersion>
<major>8</major>
<minor>8</minor>
I'm thinking cuda toolkit min/maj is outside compatibility of the app. Could we get v8?
I was not successful in changing to different cuda version by installing different cuda toolkits tried
(11.2, 11.6, 12.2, 12.4; boinc detects it as 12.2), and i don't have enough tasks to really validate my findings as i got only 1 task in an hour.
|
|
|
|
I think, at this project, that the application is more likely to be having a problem with your GPU hardware than the driver version. |
|
|
|
Hi,
I was playing with zluda and got following error.
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
http://www.gpugrid.net/show_host_detail.php?hostid=603175
http://www.gpugrid.net/result.php?resultid=35643503
Hopefully helpful
coproc_info.xml
<cudaVersion>12020</cudaVersion>
<major>8</major>
<minor>8</minor>
I'm thinking cuda toolkit min/maj is outside compatibility of the app. Could we get v8?
I was not successful in changing to different cuda version by installing different cuda toolkits tried
(11.2, 11.6, 12.2, 12.4; boinc detects it as 12.2), and i don't have enough tasks to really validate my findings as i got only 1 task in an hour.
ZLUDA will only work with PTX code, code that's agnostic to CC version.
the acemd3 app is not compiled with PTX code, it's compiled with discrete CC compatibility values (6.0, 7.5, 8.6, etc). ZLUDA gives you a cc of 8.8, which is not a real CC from nvidia, and as such it is not possible to compile non-PTX code for this CC.
____________
|
|
|