All ATM beta error out

Message boards : Number crunching : All ATM beta error out

Author	Message
JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,331,546,800 RAC: 0 Level Scientific publications	Message 60554 - Posted: 27 Jun 2023 \| 4:06:06 UTC Last modified: 27 Jun 2023 \| 4:09:03 UTC
	Pair RTX-2070 6/26/2023 10:58:22 PM CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 2070 SUPER (driver version 535.98, CUDA version 12.2, compute capability 7.5, 8192MB, 8192MB available, 9216 GFLOPS peak) 6 6/26/2023 10:58:22 PM CUDA: NVIDIA GPU 1: NVIDIA GeForce RTX 2070 SUPER (driver version 535.98, CUDA version 12.2, compute capability 7.5, 8192MB, 8192MB available, 9216 GFLOPS peak) Einstein & asteroids run fine, but 9 out of 10 GPUGRID tasks terminate after about 90 seconds. All (78) tasks show up as error at web site. Have no idea what the problem is http://www.gpugrid.net/results.php?hostid=605305&offset=60&show_names=0&state=0&appid= ____________ try my performance program, the BoincTasks History Reader. Find and read about it here: https://forum.efmer.com/index.php?topic=1355.0
	ID: 60554 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 729,045,933 RAC: 56,364 Level Scientific publications	Message 60768 - Posted: 29 Sep 2023 \| 18:10:51 UTC - in response to Message 60554.
	I often see a similar problem. https://www.gpugrid.net/result.php?resultid=33631003 Also, there workunits usually reach 100% completion for hours before they finish.
	ID: 60768 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 60786 - Posted: 31 Oct 2023 \| 11:08:00 UTC
	Failing for me as well on both of my systems Linux Mint with Nvidia A4000 Windows 11 with RTX 4090
	ID: 60786 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60787 - Posted: 31 Oct 2023 \| 12:06:07 UTC
	working fine on all my systems (linux + various nvidia). like a 7% error rate, which is in line with previous batches. ____________
	ID: 60787 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,281,381,959 RAC: 11,204,357 Level Scientific publications	Message 60797 - Posted: 31 Oct 2023 \| 18:17:13 UTC - in response to Message 60786.
	Never have been able to decipher most of the errors on Windows hosts. But your error on your Linux-A4000 host was simply that you interrupted the task during its run. ATMbeta tasks can't be interrupted, paused or restarted once they have begun or the task will error out. If you let the tasks run without any stoppage, they will complete just fine and validate.
	ID: 60797 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60798 - Posted: 31 Oct 2023 \| 18:36:19 UTC Last modified: 31 Oct 2023 \| 18:47:25 UTC
	the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? try running BOINC as Administrator. just a guess. maybe that's not it. for his linux task. he might not have interrupted it intentionally. looks like the app (or the wrapper) hit a segfault for some reason after running for about 4 mins. then it restarted, and of course it will fail after a restart as they always do. ____________
	ID: 60798 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 60816 - Posted: 1 Nov 2023 \| 14:15:34 UTC - in response to Message 60798.
	the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? try running BOINC as Administrator. just a guess. maybe that's not it. for his linux task. he might not have interrupted it intentionally. looks like the app (or the wrapper) hit a segfault for some reason after running for about 4 mins. then it restarted, and of course it will fail after a restart as they always do. Just had another one fail on the Linux box, the machine has not been touched since it was downloaded to me seeing it sat there with a computation error, it failed in 42s.
	ID: 60816 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60817 - Posted: 1 Nov 2023 \| 14:46:40 UTC - in response to Message 60816.
	the windows issue might be a permissions issue or something. all Ryan's windows tasks end with "cmd.exe exited" which might be due to boinc lacking permissions to call the local system's cmd.exe? try running BOINC as Administrator. just a guess. maybe that's not it. for his linux task. he might not have interrupted it intentionally. looks like the app (or the wrapper) hit a segfault for some reason after running for about 4 mins. then it restarted, and of course it will fail after a restart as they always do. Just had another one fail on the Linux box, the machine has not been touched since it was downloaded to me seeing it sat there with a computation error, it failed in 42s. the one that just failed today, actually ran for over an hour. See Here. It failed with the common energy is NaN error, which happens occasionally. you just had bad luck that it happened on one of the first tasks you've run on this batch. Nothing wrong with your system, this just happens sometimes (to everyone). the one that failed in "42 seconds" was from yesterday. it also actually ran for about 4 mins, but the timer reset when the task tried to restart (which is what ultimately caused the computation error). You can clearly see the real runtime from the timestamps in the logs. See Here. Started at 09:26:01 Segfault at about 9:30:xx Restarted at 9:30:26 Error at 9:30:44 all very understandable and explainable given the current idiosyncrasies of the ATMbeta app and what specifically happened on your system. ____________
	ID: 60817 \| Rating: 0 \| rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 44 Credit: 724,942,359 RAC: 1,388,800 Level Scientific publications	Message 60833 - Posted: 2 Nov 2023 \| 15:08:52 UTC - in response to Message 60554.
	I'm seeing the same problem with ATMbeta. They error out in the first 5 minutes then go to 100% then take a long time to error out. I just upgraded to an RTX 4080 from my GTX 980ti as I was getting a lot of failures from Einstein tasks. That has stopped but the only GPU tasks erroring are from this project.
	ID: 60833 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60834 - Posted: 2 Nov 2023 \| 15:19:36 UTC - in response to Message 60833. Last modified: 2 Nov 2023 \| 15:20:12 UTC
	I'm seeing the same problem with ATMbeta. They error out in the first 5 minutes then go to 100% then take a long time to error out. I just upgraded to an RTX 4080 from my GTX 980ti as I was getting a lot of failures from Einstein tasks. That has stopped but the only GPU tasks erroring are from this project. All of your tasks from the last few days completed successfully. You have no errors listed on your host details. ____________
	ID: 60834 \| Rating: 0 \| rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 44 Credit: 724,942,359 RAC: 1,388,800 Level Scientific publications	Message 60835 - Posted: 2 Nov 2023 \| 16:05:23 UTC - in response to Message 60834.
	Yes becasue it's STILL running at 100% over an hour. It was supposed to run 28days then went to 100% in 5 minues and now is still running 1h22m at 100% complete and has not terminated. The task manager is still showing high utilization on the task.
	ID: 60835 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60836 - Posted: 2 Nov 2023 \| 16:41:46 UTC - in response to Message 60835.
	This behavior has been known and discussed for like 8 months now. The first segments of these tasks will show normal progress. Tasks with “0-5” or “0-10” in the filename. Subsequent segment tasks with “1-5” or “2-5” or “1-10” or “2-10” etc will all jump to 100% immediately due to a bug in how the application reports progress. If you leave it alone and don’t touch it, it will complete successfully. The tasks you have on your host now are indeed this type with both being listed as “2-10” units. This is normal/expected. And you can expect the same with the rest of the batch all the way to the 9-10s. Just leave them be and do their thing. It’s not an error unless BOINC says “Computation Error”, which it hasn’t. ____________
	ID: 60836 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 60837 - Posted: 2 Nov 2023 \| 19:20:53 UTC
	Looks like units are completing fine now.
	ID: 60837 \| Rating: 0 \| rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 44 Credit: 724,942,359 RAC: 1,388,800 Level Scientific publications	Message 60840 - Posted: 3 Nov 2023 \| 16:45:55 UTC - in response to Message 60836.
	You spoke too soon: https://www.gpugrid.net/workunit.php?wuid=27605492
	ID: 60840 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60841 - Posted: 3 Nov 2023 \| 16:55:46 UTC - in response to Message 60840. Last modified: 3 Nov 2023 \| 17:27:52 UTC
	You spoke too soon: https://www.gpugrid.net/workunit.php?wuid=27605492 not really. you hit the energy is NaN error. which is also well known to happen. it happens to everyone occasionally. about 7% of my results hit this error at some point (69 out of 1102 completed results), sometimes 5 mins in, sometimes 5 hours in. usually means that there is something wrong with the task setup (from the project), though sometimes it will complete successfully on another host on a resend. that's not the same issue that some folks are having where all tasks error out. this usually happens with windows hosts unfortunately and might be a permissions issue or some other fringe thing with windows that hasnt been hashed out yet by the project (or users). some windows users seem to have little issue and others seem to have only issues. but your host is in general working as expected since you've submitted successful tasks. hanging at 100% is NOT the root cause for the error you got. it's simply an idiosyncrasy of the higher level task segments. two totally different things at play there. your other tasks which also stuck at 100% completed fine. ____________
	ID: 60841 \| Rating: 0 \| rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 44 Credit: 724,942,359 RAC: 1,388,800 Level Scientific publications	Message 60844 - Posted: 4 Nov 2023 \| 19:12:43 UTC - in response to Message 60841.
	Thanks for the background. I would however reconsider the term "well known" -- it's clear that it's not to the casual participant. Has anyone considered or tabulated the wasted power, time, and corresponding CO2 emissions on these broken WUs? Wasting resources on crypto is one thing but we shouldn't be wasting volunteer's resources. These add up. Another just hit: https://www.gpugrid.net/workunit.php?wuid=27606226
	ID: 60844 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,281,381,959 RAC: 11,204,357 Level Scientific publications	Message 60845 - Posted: 4 Nov 2023 \| 22:33:39 UTC - in response to Message 60844.
	If you are concerned, they you can stop your consternation by simply removing this project.
	ID: 60845 \| Rating: 0 \| rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 44 Credit: 724,942,359 RAC: 1,388,800 Level Scientific publications	Message 60849 - Posted: 5 Nov 2023 \| 9:55:04 UTC - in response to Message 60845.
	Done. Thanks!
	ID: 60849 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 60855 - Posted: 7 Nov 2023 \| 19:58:31 UTC
	Still failing for me, blocked the 4090 machine for now, anyone on 40xx series get them working? if so what magic are you using?
	ID: 60855 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60856 - Posted: 7 Nov 2023 \| 20:12:34 UTC - in response to Message 60855.
	Still failing for me, blocked the 4090 machine for now, anyone on 40xx series get them working? if so what magic are you using? did you try running BOINC as administrator so that you have elevated privileges? or switch to Linux. ____________
	ID: 60856 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 60864 - Posted: 9 Nov 2023 \| 16:02:00 UTC - in response to Message 60856.
	Yea tried running as admin, same issue, switching this machine to Linux is not an option.
	ID: 60864 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60865 - Posted: 9 Nov 2023 \| 16:48:46 UTC - in response to Message 60864.
	do you have AV on the system? is it possible that it's blocking some activity from the app? like preventing the download of the extra packages it needs. it seems to be failing sometime between downloading the extra packages and running the app. try disabling your AV, or whitelisting the BOINC data directory. ____________
	ID: 60865 \| Rating: 0 \| rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 126 Credit: 108,281,939 RAC: 39,088 Level Scientific publications	Message 60866 - Posted: 10 Nov 2023 \| 0:34:10 UTC
	This is a dead tasks stdeerr output (Currently 2 hours and shows 100%) and since I am going to bed for 8 hrs I am aborting this. Its only my 3rd and I have a 4th like it. 23:28:42 (16788): wrapper (7.9.26016): starting 23:28:42 (16788): wrapper: running python.exe (bin/conda-unpack) 23:28:44 (16788): python.exe exited; CPU time 0.000000 23:28:44 (16788): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) run.bat run.sh tnks2_m5b_m5c_0.xml tnks2_m5b_m5c_asyncre.cntl tnks2_m5b_m5c.inpcrd tnks2_m5b_m5c.prmtop 23:28:45 (16788): Library/usr/bin/tar.exe exited; CPU time 0.031250 23:28:45 (16788): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat) Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'D:\data\slots\13\tmp\pip-req-build-y0gn8rc9' Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac' Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. This is completed one: ( At quick glance in the opening commands I can see no difference) <core_client_version>7.24.1</core_client_version> <![CDATA[ <stderr_txt> 19:06:49 (13988): wrapper (7.9.26016): starting 19:06:49 (13988): wrapper: running python.exe (bin/conda-unpack) 19:06:52 (13988): python.exe exited; CPU time 0.000000 19:06:52 (13988): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2) run.bat run.sh tnks2_m1b_m5h_0.xml tnks2_m1b_m5h_asyncre.cntl tnks2_m1b_m5h.inpcrd tnks2_m1b_m5h.prmtop 19:06:53 (13988): Library/usr/bin/tar.exe exited; CPU time 0.015625 19:06:53 (13988): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat) Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'D:\data\slots\15\tmp\pip-req-build-o6pc0y1a' Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac' Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. tar: run.log: file changed as we read it 03:07:01 (13988): C:/Windows/system32/cmd.exe exited; CPU time 27177.796875 03:07:01 (13988): called boinc_finish(0) 0 bytes in 0 Free Blocks. 310 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 427320182 bytes. Dumping objects -> {3078551} normal block at 0x00000261DB821170, 48 bytes long. Data: <PATH=D:\data\slo> 50 41 54 48 3D 44 3A 5C 64 61 74 61 5C 73 6C 6F {3078530} normal block at 0x00000261D9D674A0, 139 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {3078527} normal block at 0x00000261D9D35490, 8 bytes long. Data: < Ûa > 00 00 83 DB 61 02 00 00 {3077868} normal block at 0x00000261D9D67640, 139 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {3077233} normal block at 0x00000261D9D34C20, 8 bytes long. Data: < {Ûa > 80 7B 94 DB 61 02 00 00 ..\zip\boinc_zip.cpp(122) : {278} normal block at 0x00000261D9D3F160, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {263} normal block at 0x00000261D9D408E0, 16 bytes long. Data: <¨ ÔÙa > A8 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {262} normal block at 0x00000261D9D410B0, 16 bytes long. Data: < ÔÙa > 80 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {261} normal block at 0x00000261D9D41010, 16 bytes long. Data: <X ÔÙa > 58 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {260} normal block at 0x00000261D9D40E30, 16 bytes long. Data: <0 ÔÙa > 30 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {259} normal block at 0x00000261D9D40CF0, 16 bytes long. Data: < ÔÙa > 08 19 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {258} normal block at 0x00000261D9D40700, 16 bytes long. Data: <à ÔÙa > E0 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {257} normal block at 0x00000261D9D393D0, 32 bytes long. Data: <CUDA_DEVICE=0 PU> 43 55 44 41 5F 44 45 56 49 43 45 3D 30 00 50 55 {256} normal block at 0x00000261D9D407A0, 16 bytes long. Data: < ¢ÓÙa > 10 A2 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {255} normal block at 0x00000261D9D3A210, 40 bytes long. Data: <  ÔÙa ÐÓÙa > A0 07 D4 D9 61 02 00 00 D0 93 D3 D9 61 02 00 00 {254} normal block at 0x00000261D9D411F0, 16 bytes long. Data: <À ÔÙa > C0 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {253} normal block at 0x00000261D9D406B0, 16 bytes long. Data: < ÔÙa > 98 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {252} normal block at 0x00000261D9D38AD0, 32 bytes long. Data: <C:/Windows/syste> 43 3A 2F 57 69 6E 64 6F 77 73 2F 73 79 73 74 65 {251} normal block at 0x00000261D9D41060, 16 bytes long. Data: <p ÔÙa > 70 18 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {250} normal block at 0x00000261D9D391F0, 32 bytes long. Data: <xjvf input.tar.b> 78 6A 76 66 20 69 6E 70 75 74 2E 74 61 72 2E 62 {249} normal block at 0x00000261D9D40DE0, 16 bytes long. Data: <¸ ÔÙa > B8 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {248} normal block at 0x00000261D9D40F70, 16 bytes long. Data: < ÔÙa > 90 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {247} normal block at 0x00000261D9D40F20, 16 bytes long. Data: <h ÔÙa > 68 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {246} normal block at 0x00000261D9D40CA0, 16 bytes long. Data: <@ ÔÙa > 40 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {245} normal block at 0x00000261D9D40C50, 16 bytes long. Data: < ÔÙa > 18 17 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {244} normal block at 0x00000261D9D40C00, 16 bytes long. Data: <ð ÔÙa > F0 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {242} normal block at 0x00000261D9D40BB0, 16 bytes long. Data: < ¥ÓÙa > 20 A5 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {241} normal block at 0x00000261D9D3A520, 40 bytes long. Data: <° ÔÙa p Ûa > B0 0B D4 D9 61 02 00 00 70 11 82 DB 61 02 00 00 {240} normal block at 0x00000261D9D40B60, 16 bytes long. Data: <Ð ÔÙa > D0 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {239} normal block at 0x00000261D9D40FC0, 16 bytes long. Data: <¨ ÔÙa > A8 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {238} normal block at 0x00000261D9D39130, 32 bytes long. Data: <Library/usr/bin/> 4C 69 62 72 61 72 79 2F 75 73 72 2F 62 69 6E 2F {237} normal block at 0x00000261D9D411A0, 16 bytes long. Data: < ÔÙa > 80 16 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {236} normal block at 0x00000261D9D392B0, 32 bytes long. Data: <bin/conda-unpack> 62 69 6E 2F 63 6F 6E 64 61 2D 75 6E 70 61 63 6B {235} normal block at 0x00000261D9D40ED0, 16 bytes long. Data: <È ÔÙa > C8 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {234} normal block at 0x00000261D9D40E80, 16 bytes long. Data: <  ÔÙa > A0 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {233} normal block at 0x00000261D9D40B10, 16 bytes long. Data: <x ÔÙa > 78 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {232} normal block at 0x00000261D9D40A20, 16 bytes long. Data: <P ÔÙa > 50 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {231} normal block at 0x00000261D9D40610, 16 bytes long. Data: <( ÔÙa > 28 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {230} normal block at 0x00000261D9D40570, 16 bytes long. Data: < ÔÙa > 00 15 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {229} normal block at 0x00000261D9D41150, 16 bytes long. Data: <à ÔÙa > E0 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {228} normal block at 0x00000261D9D40520, 16 bytes long. Data: <¸ ÔÙa > B8 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {227} normal block at 0x00000261D9D40980, 16 bytes long. Data: < ÔÙa > 90 14 D4 D9 61 02 00 00 00 00 00 00 00 00 00 00 {226} normal block at 0x00000261D9D41490, 1488 bytes long. Data: < ÔÙa python.e> 80 09 D4 D9 61 02 00 00 70 79 74 68 6F 6E 2E 65 {90} normal block at 0x00000261D9D39070, 32 bytes long. Data: <windows_x86_64__> 77 69 6E 64 6F 77 73 5F 78 38 36 5F 36 34 5F 5F {89} normal block at 0x00000261D9D34900, 16 bytes long. Data: <À§ÓÙa > C0 A7 D3 D9 61 02 00 00 00 00 00 00 00 00 00 00 {88} normal block at 0x00000261D9D3A7C0, 40 bytes long. Data: < IÓÙa p ÓÙa > 00 49 D3 D9 61 02 00 00 70 90 D3 D9 61 02 00 00 {67} normal block at 0x00000261D9D34AE0, 16 bytes long. Data: < ê,ã÷ > 80 EA 2C E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {66} normal block at 0x00000261D9D34A90, 16 bytes long. Data: <@é,ã÷ > 40 E9 2C E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {65} normal block at 0x00000261D9D35170, 16 bytes long. Data: <øW)ã÷ > F8 57 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x00000261D9D348B0, 16 bytes long. Data: <ØW)ã÷ > D8 57 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x00000261D9D35350, 16 bytes long. Data: <P )ã÷ > 50 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x00000261D9D34EA0, 16 bytes long. Data: <0 )ã÷ > 30 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x00000261D9D352B0, 16 bytes long. Data: <à )ã÷ > E0 02 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x00000261D9D34A40, 16 bytes long. Data: < )ã÷ > 10 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x00000261D9D353F0, 16 bytes long. Data: <p )ã÷ > 70 04 29 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x00000261D9D35260, 16 bytes long. Data: < À'ã÷ > 18 C0 27 E3 F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]>
	ID: 60866 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 60867 - Posted: 10 Nov 2023 \| 1:26:01 UTC - in response to Message 60866.
	its not dead. it's just not finished. please read the lengthy ATM post in the News forum. it's known about all task segments except the first exhibiing this behavior of staying pegged to 100%. it's fine, just let it finish ____________
	ID: 60867 \| Rating: 0 \| rate: / Reply Quote

Greg _BE Send message Joined: 30 Jun 14 Posts: 126 Credit: 108,281,939 RAC: 39,088 Level Scientific publications	Message 60868 - Posted: 10 Nov 2023 \| 11:10:51 UTC - in response to Message 60867.
	its not dead. it's just not finished. please read the lengthy ATM post in the News forum. it's known about all task segments except the first exhibiing this behavior of staying pegged to 100%. it's fine, just let it finish Ok, thanks
	ID: 60868 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 61070 - Posted: 25 Jan 2024 \| 9:45:13 UTC
	Thought I would try again, still failing, did anyone ever come up with a fix to get these running on 40 series cards? failed unit https://www.gpugrid.net/result.php?resultid=33751449
	ID: 61070 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 61077 - Posted: 25 Jan 2024 \| 14:10:20 UTC - in response to Message 61070.
	use Linux seems to be the best way to run them. ____________
	ID: 61077 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,281,381,959 RAC: 11,204,357 Level Scientific publications	Message 61081 - Posted: 25 Jan 2024 \| 17:02:40 UTC - in response to Message 61070.
	Thought I would try again, still failing, did anyone ever come up with a fix to get these running on 40 series cards? failed unit https://www.gpugrid.net/result.php?resultid=33751449 You got one of the bad work units that are still floating around. Read the News ATMbeta thread. The researcher updated the Windows packaging to fix your referenced task type error. But only will apply for any new work units generated. Try and get one of the new ones.
	ID: 61081 \| Rating: 0 \| rate: / Reply Quote

Tex1954 Send message Joined: 20 May 11 Posts: 16 Credit: 86,798,974 RAC: 1,129 Level Scientific publications	Message 61085 - Posted: 25 Jan 2024 \| 19:43:01 UTC - in response to Message 61081. Last modified: 25 Jan 2024 \| 19:44:34 UTC
	I have a 5950X setup with a perfectly running 2080 Super and 3 tasks I got today (so far) are all failing... 8-) GPUGRID 1.09 ATMbeta: Free energy calculations of protein-ligand binding (cuda1121) JNK1_m29_m27_3-QUICO_ATM_500K_dih14fit-2-5-RND8000_0 00:00:32 (00:00:00) 1/25/2024 1:16:36 PM 1/25/2024 1:19:36 PM 0.989C + 1NV 0.0 Reported: Computation error (195,) Win11-Main
	ID: 61085 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 310,572,017 RAC: 3,237,180 Level Scientific publications	Message 61087 - Posted: 25 Jan 2024 \| 19:53:28 UTC - in response to Message 61085.
	I have a 5950X setup with a perfectly running 2080 Super and 3 tasks I got today (so far) are all failing... 8-) GPUGRID 1.09 ATMbeta: Free energy calculations of protein-ligand binding (cuda1121) JNK1_m29_m27_3-QUICO_ATM_500K_dih14fit-2-5-RND8000_0 00:00:32 (00:00:00) 1/25/2024 1:16:36 PM 1/25/2024 1:19:36 PM 0.989C + 1NV 0.0 Reported: Computation error (195,) Win11-Main What Keith already said just above: the fix is known but will be deployed at the earliest on the next batch of tasks. Current batch cannot be retroactively fixed... See post here: https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#61076 I'm waiting on that just as you are. All it needs is a little patience.
	ID: 61087 \| Rating: 0 \| rate: / Reply Quote

Tex1954 Send message Joined: 20 May 11 Posts: 16 Credit: 86,798,974 RAC: 1,129 Level Scientific publications	Message 61089 - Posted: 25 Jan 2024 \| 22:24:38 UTC - in response to Message 61087.
	Ohhh! So those were stray OLD tasks.. Gotchya! Thanks! 8-)
	ID: 61089 \| Rating: 0 \| rate: / Reply Quote

Ryan Munro Send message Joined: 6 Mar 18 Posts: 33 Credit: 469,025,077 RAC: 2,800,844 Level Scientific publications	Message 61090 - Posted: 25 Jan 2024 \| 22:50:31 UTC - in response to Message 61081.
	Cheers, ill leave it enabled then and see what happens
	ID: 61090 \| Rating: 0 \| rate: / Reply Quote

Nuadormrac Send message Joined: 21 Jul 12 Posts: 7 Credit: 383,434,258 RAC: 1,291,631 Level Scientific publications	Message 61092 - Posted: 26 Jan 2024 \| 3:12:20 UTC Last modified: 26 Jan 2024 \| 3:15:11 UTC
	I'm noticing an interesting aside wrt ATM WUs. All mine seem to be erroring out atm, but drilling into the WUs, it isn't just my computer. As an example https://www.gpugrid.net/workunit.php?wuid=27653498 Now the odd thing is that all the other computers seem to have an Intel processor (when I inspect each computer showing the failed result), and going down the list there seems to be a fail. But then 1 result is returned successfully, which has an AMD processor. Now I know the bulk of the processing is on the GPU, but there is some CPU processing as I understand. Perhaps there's something in the code which is playing nasty with Intel's implementation of something? Thought I'd pass this observation along, it being beta and all... ____________
	ID: 61092 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1036 Credit: 39,372,107,483 RAC: 94,951,920 Level Scientific publications	Message 61093 - Posted: 26 Jan 2024 \| 3:27:40 UTC - in response to Message 61092.
	The one that succeeded was because it was on Linux, not because it was on AMD. ____________
	ID: 61093 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 567 Credit: 6,507,327,024 RAC: 24,907,911 Level Scientific publications	Message 61094 - Posted: 26 Jan 2024 \| 9:37:12 UTC - in response to Message 60787.
	On this last batch of ATMbeta, I've noticed an increase of tasks failing with ValueError: Energy is NaN Currently about 1/3 ratio of failures. Since January 22th, 64 valid tasks, and 20 failed with that error at my Linux hosts. 14 of the failed tasks exceeded 1 hour processing time before erroring, and in these, 9 of them exceeding 3 hours processing time. Anyway, I still consider it as a reasonable ratio for these well-rewarded beta tasks, and I keep my hope of being contributing to Science.
	ID: 61094 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,566,947,716 RAC: 12,845,964 Level Scientific publications	Message 61095 - Posted: 26 Jan 2024 \| 12:33:16 UTC - in response to Message 61094.
	On this last batch of ATMbeta, I've noticed an increase of tasks failing with ValueError: Energy is NaN Currently about 1/3 ratio of failures. Since January 22th, 64 valid tasks, and 20 failed with that error at my Linux hosts. 14 of the failed tasks exceeded 1 hour processing time before erroring, and in these, 9 of them exceeding 3 hours processing time. Anyway, I still consider it as a reasonable ratio for these well-rewarded beta tasks, and I keep my hope of being contributing to Science. I noticed that too, about 1/3 was about right for me at one point, which happened on the Intel chip computer. Energy is NaN. I usually get about 90% success rate. Ignore the AMD chip computer, that's another story. But I seem to be ending on a high note for 4 in row ending successfully as far, let hope it continues.
	ID: 61095 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : All ATM beta error out

	About	Science	Volunteers	Performance	Forum	Join us	Donate