New project in long queue

Message boards : News : New project in long queue

Author	Message
noelia Send message Joined: 5 Jul 12 Posts: 35 Credit: 393,375 RAC: 0 Level Scientific publications	Message 28895 - Posted: 1 Mar 2013 \| 10:49:14 UTC
	Hello all, After testing the new application, it is time to send a new project. I'm sending at the moment around 6000WUs to the long queue. Credits will be around 100000. Let me know if you have any issues, since is the first big thing we submit to the recently updated long queue and these WUs include new features. Noelia
	ID: 28895 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28896 - Posted: 1 Mar 2013 \| 11:16:58 UTC
	I can't download any, I keep trying, but no long runs in the last hour.
	ID: 28896 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 28899 - Posted: 1 Mar 2013 \| 12:20:38 UTC - in response to Message 28896.
	These units appear to be very long, could be close to 20 hours finishing time on my computers. Assuming there are no errors!!
	ID: 28899 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28911 - Posted: 2 Mar 2013 \| 7:02:10 UTC
	I've had to abort 3 NOELIA'S in the past 2 hours, GPU usage was at 100% and the memory controller was at 0%. I had to reboot the computer to get the GPU's working again. Windows popped up an error message complaining that "acemd.2865P.exe had to be terminated unexpectedly". As soon as I suspended the NOELIA work unit the error message went away. This was on a GTX560, GTX670 and a GTX680. Windows XP 64 bit, BIONIC v7.0.28
	ID: 28911 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 28930 - Posted: 3 Mar 2013 \| 9:20:30 UTC Last modified: 3 Mar 2013 \| 9:21:51 UTC
	So far i got an error one After 14000secs :( and a second one witch was successful after 15 hours (560ti 448 core edition, 157k credits). Now its calculating a third one..lets see. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 28930 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28933 - Posted: 3 Mar 2013 \| 13:17:06 UTC - in response to Message 28895.
	The first one of these I received was: 005px1x2-NOELIA_005p-0-2-RND6570_0 After running for over 24 hours this happened: <core_client_version>7.0.52</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> I have another one that's at 62.5% after 14 hours. Looking at some of the NOELIA Wus, they seem to be failing all over the place, some of them repeatedly. They're also too long for my machines to process and return in 24 hours. After the one that's running either errors out or completes I will be aborting the NOELIA WUs. Wasting 24+ hours of GPU time per failure is not my favorite way to waste electricity. Sorry. BTW, the TONI WUs run fine.
	ID: 28933 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28934 - Posted: 3 Mar 2013 \| 14:10:34 UTC
	I've found NOELIA WUs to be highly unreliable, even on the short queue. I don't like getting one as I've no idea if it'll complete without errors. I had to abort a short NOELIA one yesterday as it kept crunching in circles meaning it crunched for some minutes and then returned to the beginning to do the same all over again. ____________ Team Belgium
	ID: 28934 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 28935 - Posted: 3 Mar 2013 \| 14:17:03 UTC Last modified: 3 Mar 2013 \| 14:18:20 UTC
	These new NOELIA tasks don't use a full CPU thread (core if you like) to feed a Kepler type GPU, like other workunits (like TONI_AGG) used to. Is this behavior intentional or not? Maybe that's why it takes so long to process them. It takes 40.400 secs for my overclocked GTX 580 to finish these tasks, while it takes 38.800 for a (slightly overclocked) GTX 670, so there is a demonstrable loss (~5%) in their performance.
	ID: 28935 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28936 - Posted: 3 Mar 2013 \| 15:19:13 UTC
	Some of the new NOELIA units are bugged somehow, I think. Some run fine, some of them not.
	ID: 28936 \| Rating: 0 \| rate: / Reply Quote

Jim Daniels (JD) Send message Joined: 20 Jan 13 Posts: 9 Credit: 206,731,892 RAC: 0 Level Scientific publications	Message 28937 - Posted: 3 Mar 2013 \| 17:17:41 UTC
	I posted this in the "long application updated to the latest version" but Firehawk inferred these issues should be reported in this thread. I don't know if this is a 6.18 issue or a NOELIA WU issue but I guess time will tell. So I apologize in advance for the double posting if that is a bigger faux pas than not knowing which thread is the appropriate one to post to. ;-) -------------------- While running my first 6.18 long run task my laptop locked up and I had to do a hard reboot. After the system was back up this WU had terminated with an error. The details are below. However, I have run two 6.18 WUs successfully since then. It appears one other host also terminated with an error on this WU. The NOELIA WUs seem to be averaging about 80% utilization GTX 680M and the run times are over 18 hours. I don't know how much effect having to share CPU time is having on these numbers. Error Details: i7-3740QM 16GB - Win7 Pro x64 - GTX 680m (Alienware 9.18.13.717) GPU: dedicated to GPUGRID - CPU: SETI, Poem, Milkyway, WUProp, FreeHAL ------------------------------------------------------------------------ Work Unit: 4209987 (041px21x2-NOELIA_041p-0-2-RND9096_0) Stderr output <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]>
	ID: 28937 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28938 - Posted: 3 Mar 2013 \| 17:40:37 UTC
	You're getting us hawks mixed up, I've been using this name sense 95 and that's the first time I think that's happend.
	ID: 28938 \| Rating: 0 \| rate: / Reply Quote

Jim Daniels (JD) Send message Joined: 20 Jan 13 Posts: 9 Credit: 206,731,892 RAC: 0 Level Scientific publications	Message 28940 - Posted: 3 Mar 2013 \| 19:09:22 UTC - in response to Message 28938.
	Mea Culpa.
	ID: 28940 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 186 Level Scientific publications	Message 28941 - Posted: 3 Mar 2013 \| 19:28:05 UTC Last modified: 3 Mar 2013 \| 19:31:29 UTC
	Am also getting several Noelia tasks making very slow progress. Same problem as flagged in the beta test. Also the size of the upload is causing issues for me as well.
	ID: 28941 \| Rating: 0 \| rate: / Reply Quote

Bikermatt Send message Joined: 8 Apr 10 Posts: 37 Credit: 3,839,902,185 RAC: 0 Level Scientific publications	Message 28942 - Posted: 3 Mar 2013 \| 20:00:13 UTC
	The Noelia workunits refuse to run on my 660ti linux system. They lock up or make no progress. I have finished one on two different linux systems with 670s without problems.
	ID: 28942 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 28943 - Posted: 3 Mar 2013 \| 20:33:00 UTC
	The Noelia longs either fail in the first 3 to 4 minutes on my GTX 560 and GTX 650 Ti (the only four failures I have had on GPUGRID), or else they complete successfully. I can't tell if they take any longer though; the last one took 23 hours, instead of the more usual 18 hours, but I have seen that on the 6.17 work units also, so it may just be the size of the work unit. I have not had any hangs thus far (Win7 64-bit; BOINC 7.0.52 x64). All in all, it is not that bad for me.
	ID: 28943 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28944 - Posted: 3 Mar 2013 \| 21:12:11 UTC
	Well Jim, now you can understand how most of us got 90% of our errors. If you had looked closer you would have noticed that almost all of them came from a first run of NOELIA's in early February. Instead, you thought you would display you're distributed computing prowess and give us you're expert advice and proceeded to tell us about our substandard components or our inability to overclock correctly and the overheating issue's we must be having. I'm referencing this thread. http://www.gpugrid.net/forum_thread.php?id=3299
	ID: 28944 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 28945 - Posted: 3 Mar 2013 \| 22:12:07 UTC - in response to Message 28944.
	flashawk, Thank you for your insight. But I just started on Feb. 14, and I think it was well past your first group of errors. At any rate, they ran fine on my cards even though not on some others, where they often failed after an hour or more. Maybe you can give better advice?
	ID: 28945 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28946 - Posted: 3 Mar 2013 \| 22:44:05 UTC
	I guess it's understandable, the best advice I could ever give in my 51 years "is wait and see". I don't walk in to another’s club house and start rearranging the furniture. There's been many a time when I've jumped to quick conclusions in my own mind only to find out later that I was wrong. Anyway, didn't mean to be too harsh, let me be the first to say "Welcome to GPU-GRID" and I'm sure you have allot to contribute.
	ID: 28946 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28947 - Posted: 4 Mar 2013 \| 0:19:33 UTC Last modified: 4 Mar 2013 \| 0:44:59 UTC
	I managed to get http://www.gpugrid.net/result.php?resultid=6567675 to run to completion, by making sure it wasn't interrupted during computation. But 12 hours on a GTX 670 is a long time to run without task switching, when you're trying to support more than one BOINC project. Edit - on the other hand, task http://www.gpugrid.net/result.php?resultid=6563457 following behind on the same card with the same configuration failed three times with SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. The TONI task following, again on the same card, seems to have started and to be running normally.
	ID: 28947 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28948 - Posted: 4 Mar 2013 \| 1:07:04 UTC - in response to Message 28947. Last modified: 4 Mar 2013 \| 1:27:51 UTC
	Richard Haselgrove wrote: SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. I just had the same thing happen to me Richard, right after the computation error, a TONI wu started on the same GPU card and it was at an idle with 0% GPU load and 0% memory controller usage. I had to suspend BOINC and reboot to get the GPU crunching again. As far as times go on my GTX670's, the NOELIA wu's have ranged from 112MB to 172MB so far and the smaller one took 7.5 hours and the large one took 11.75 hours. So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Edit: Check out this one, I just downloaded it a couple minutes ago. I noticed it ended in a 6, that means I'm the 7th person to get it. This is off the hook - man! http://www.gpugrid.net/workunit.php?wuid=4210634
	ID: 28948 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28950 - Posted: 4 Mar 2013 \| 9:41:40 UTC - in response to Message 28948.
	So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Far more likely that the tasks which run - by design - for a long time, generate a large output file. After the last NOELIA failure (which triggered a driver restart), I ran a couple of small BOINC tasks from another project. The first one errored, the second ran correctly. After that, I ran a long TONI - successful completion, no computer restart needed. I'm running the 314.07 driver.
	ID: 28950 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 28951 - Posted: 4 Mar 2013 \| 10:49:23 UTC
	My systems hasn't been changed since the application upgrade. I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5) They do all the strange behavior a workunit can do: - 95-100% GPU usage with no progress indicator increase (even after hours of processing) - the same thing as above, but 0% GPU usage. - Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this) - significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress. - the progress indicator stays at 0% when I abort a stuck task.
	ID: 28951 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28952 - Posted: 4 Mar 2013 \| 11:18:33 UTC
	I´m having some new weird issue, but only on my AMD 3x690 rig. For 3 times now, BSOD´s, systems restarts. It only go away if all the worunits (and the cache!!!) where aborted. I don´t have a clue on why this happens, but this AMD rig is rock solid in normal crunching and it´s doing more than 2m per day alone.
	ID: 28952 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 28953 - Posted: 4 Mar 2013 \| 11:41:00 UTC - in response to Message 28951.
	My systems hasn't been changed since the application upgrade. I've had no problems with these new NOELIA tasks until now. (I've received a couple of tasks with their name ending with _4 and _5) They do all the strange behavior a workunit can do: - 95-100% GPU usage with no progress indicator increase (even after hours of processing) - the same thing as above, but 0% GPU usage. - Causing the following workunit (a TONI for example) do the same strange behavior (a system restart can fix this) - significant change in the GPU usage (from 75-80% to 95-100%) after a couple of minutes, but no progress. - the progress indicator stays at 0% when I abort a stuck task. I have had the same issues and on top of that I got error message saying that acemd.2865.exe has crashed, and the video card ends up running at a slower speed. I have had more errors with this application than the last time I did beta testing.
	ID: 28953 \| Rating: 0 \| rate: / Reply Quote

Hans Sveen Send message Joined: 29 Oct 08 Posts: 3 Credit: 439,280,899 RAC: 5,234 Level Scientific publications	Message 28954 - Posted: 4 Mar 2013 \| 11:55:32 UTC
	Hello! I just want to add up my experience with the latest batch: Until late yesterday/ early this morning,my capable pc's run just fine! The two win pc's (ID: 67760 and ID: 145297)started to crash after running for about 4 minutes , when looking at the boinc messages they told me that output files were missing and during the short run before crashing no check pointing was done. I also did take a look at my wingmen, most errors was "The system cannot find the path specified. (0x3) - exit code 3 (0x3). Some times also exit code -1 and -9 occured. To elliminate Windows driver or other Window error, I loaded some wu's into this host (ID: 132991)running Ubuntu oh yes after running for about 5 minutes the crashed telling me ( by Boinc Message tab of course)"exited with zero status but no 'finished' file", did this several times before crashing and then with "Output file absent". When looking at the outcome after upload this is what I got. Stderr output <core_client_version>7.0.27</core_client_version> <![CDATA[ <message> process exited with code 255 (0xff, -1) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. MDIO: cannot open file "restart.coor" MDIO: cannot open file "restart.coor" </stderr_txt> ]]> Hope this can help debugging the batch! Ps: All three pc's now running "TONI" wu's without the need to restart! With regards, Hans Sveen Oslo, Norway
	ID: 28954 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28955 - Posted: 4 Mar 2013 \| 12:32:29 UTC - in response to Message 28953.
	I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread.
	ID: 28955 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 28957 - Posted: 4 Mar 2013 \| 13:18:10 UTC - in response to Message 28955.
	I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application.
	ID: 28957 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28958 - Posted: 4 Mar 2013 \| 13:51:39 UTC - in response to Message 28957.
	I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18
	ID: 28958 \| Rating: 0 \| rate: / Reply Quote

Jozef J Send message Joined: 7 Jun 12 Posts: 112 Credit: 1,140,895,172 RAC: 315,774 Level Scientific publications	Message 28959 - Posted: 4 Mar 2013 \| 14:31:08 UTC
	041px48x2-NOELIA_041p-1-2-RND9263--After 15 hours of when the work on this task ended, nvidia driver crashed and the work has been marked as faulty .. Another was marked correctly--nn016_r2-TONI_AGGd8-38-100-RND3157_0--- but these problems are already more than a week, it's insane..nvidia driver falls for a proper shut down boinc manager,exempl.. Now comming this tasks Ann166_r2-TONI_AGGd8-11-100-RND7649_0 and nn137_r2-TONI_AGGd8-20-100-RND8105_0 and Ann027_r2-TONI_AGGd8-19-100-RND9134_3 But I'm skeptical and I do not think that they will end well. Counting two week without any sense..as many volunteers now
	ID: 28959 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28961 - Posted: 4 Mar 2013 \| 14:51:54 UTC - in response to Message 28958.
	I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18 Richard, this is my experience exactly. All TONIs run fine and 50% of NOELIAs crash. TONI should maybe give a clinic to the others. I don't think it has much to do with 6.18 either, it's just that the new NOELIAS were released at the same time as 6.18.
	ID: 28961 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28962 - Posted: 4 Mar 2013 \| 14:59:41 UTC
	aborted a NOELIA one after it began crunching in circles... ____________ Team Belgium
	ID: 28962 \| Rating: 0 \| rate: / Reply Quote

Ken_g6 Send message Joined: 6 Aug 11 Posts: 8 Credit: 74,546,994 RAC: 157,187 Level Scientific publications	Message 28969 - Posted: 4 Mar 2013 \| 17:25:34 UTC Last modified: 4 Mar 2013 \| 17:31:53 UTC
	The first Noelia (the angels did say...) took over 48 hours (on a GTX 460 768mb that's completed most work in 25 or less, but it finally...) completed today. The second one I got, which many have apparently had a different problem with, kept restarting on my machine with: SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. That seems to be an out-of-GPU-memory error. So maybe someone should set stricter minimum memory limits on these Noelia tasks? Edit: Technically, that wasn't my first Noelia; just the first one of this batch. I got at least one, probably more, in February, and they took 25 hours but were otherwise fine.
	ID: 28969 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28970 - Posted: 4 Mar 2013 \| 17:43:12 UTC - in response to Message 28969.
	The first Noelia The second one I got... I see that both WUs are marked errors WU cancelled Something may be happening behind the scenes.
	ID: 28970 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 28971 - Posted: 4 Mar 2013 \| 18:02:38 UTC Last modified: 4 Mar 2013 \| 18:25:11 UTC
	These NOELIA WUs have been cancelled. Their successors will have a slightly different configuration that will hopefully be more stable. Note that with this app GPUs of compute capabilities 1.0, 1.1 and 1.2 are no longer supported. This means that only Geforce GTX260s and higher will get Long WUs. MJH
	ID: 28971 \| Rating: 0 \| rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 28972 - Posted: 4 Mar 2013 \| 18:03:17 UTC
	We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. As always, thanks for making your concerns known and alerting us to the issue. Nate
	ID: 28972 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 28974 - Posted: 4 Mar 2013 \| 18:09:50 UTC - in response to Message 28971.
	Be aware also these and subsequent WUs will fail if you have over-ridden the application version and are not running the latest. MJH
	ID: 28974 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28978 - Posted: 4 Mar 2013 \| 19:46:04 UTC - in response to Message 28972.
	We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine..
	ID: 28978 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 28980 - Posted: 4 Mar 2013 \| 20:01:21 UTC - in response to Message 28978.
	We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. And the two I have in progress are still fine, and shown as viable on the website.
	ID: 28980 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28981 - Posted: 4 Mar 2013 \| 20:32:27 UTC - in response to Message 28980.
	We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. And the two I have in progress are still fine, and shown as viable on the website. Just got a couple new ones. Seems the queue coincidentally ran dry for a while: GPUGRID 03-04-13 13:45 Requesting new tasks for NVIDIA GPUGRID 03-04-13 13:45 Scheduler request completed: got 0 new tasks GPUGRID 03-04-13 13:45 No tasks sent GPUGRID 03-04-13 13:45 No tasks are available for Long runs (8-12 hours on fastest card)
	ID: 28981 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28984 - Posted: 4 Mar 2013 \| 20:44:46 UTC - in response to Message 28972.
	We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. As always, thanks for making your concerns known and alerting us to the issue. Nate Thank you guys. Another thing that I really appreciate on this project is your awesome and fast support. Wich didn´t happen on the project I ran in the past 13 years.... sadly.
	ID: 28984 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 28989 - Posted: 5 Mar 2013 \| 0:49:34 UTC - in response to Message 28972.
	We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. Nate Were the issues related to the new application, the Wu's or both?
	ID: 28989 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28998 - Posted: 6 Mar 2013 \| 7:26:23 UTC
	How big are the uploads for these reworked NOELIA's supposed to be? The 3 I've finished were barely over 4MB after 11 1/2 hours of "crunching". Is this about right?
	ID: 28998 \| Rating: 0 \| rate: / Reply Quote

FrRie Send message Joined: 21 Dec 11 Posts: 2 Credit: 21,062,866 RAC: 0 Level Scientific publications	Message 29000 - Posted: 6 Mar 2013 \| 11:45:09 UTC
	... I got messages like "abort by user" - but I didn't abort any ... I observed incrementing of remaining time in one case ... W 7 Ultimate 64 BOINC 7.0.28 GTX 580 Thanks for reactions FrRie ____________
	ID: 29000 \| Rating: 0 \| rate: / Reply Quote

idimitro Send message Joined: 25 Jun 12 Posts: 3 Credit: 47,912,263 RAC: 0 Level Scientific publications	Message 29001 - Posted: 6 Mar 2013 \| 12:58:28 UTC
	I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. This is waste of our time and as somebody mentioned I prefer to waste my electricity to something useful. Can somebody PM me how to block this noelia packages?
	ID: 29001 \| Rating: 0 \| rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 29002 - Posted: 6 Mar 2013 \| 13:08:20 UTC - in response to Message 29001. Last modified: 6 Mar 2013 \| 13:08:38 UTC
	I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. This is waste of our time and as somebody mentioned I prefer to waste my electricity to something useful. Can somebody PM me how to block this noelia packages? it's not possible to block specific tasks. At least that's what I learned from my own tread. http://www.gpugrid.net/forum_thread.php?id=3315 ____________ Team Belgium
	ID: 29002 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 29003 - Posted: 6 Mar 2013 \| 13:08:30 UTC - in response to Message 29001.
	I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. This is waste of our time and as somebody mentioned I prefer to waste my electricity to something useful. Can somebody PM me how to block this noelia packages? It happened to me too. I had 2 noelia units that are aborted by user, which I didn't abort. They were otherwise running fine. So, what is happening?
	ID: 29003 \| Rating: 0 \| rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 29004 - Posted: 6 Mar 2013 \| 13:25:16 UTC - in response to Message 29001.
	I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. This is waste of our time and as somebody mentioned I prefer to waste my electricity to something useful. Can somebody PM me how to block this noelia packages? http://www.gpugrid.net/forum_thread.php?id=3311&nowrap=true#28972
	ID: 29004 \| Rating: 0 \| rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 29006 - Posted: 6 Mar 2013 \| 14:21:34 UTC Last modified: 6 Mar 2013 \| 14:22:13 UTC
	Were the issues related to the new application, the Wu's or both? The WUs were not set to upload the smaller file size format we are now trying to move to. They were set to use the old format, which could result in very large file upload sizes, as some people complained about. The problem with the application was an obscure one. It wasn't an issue with the application per se, but rather with how the application interacts with BOINC and this specific type of configuration file for the simulations. In short, the application was doing at the start of every WU a function that it was only supposed to do in the first WU in a chain. This caused all but the first WU in a chain to fail. This isn't a problem locally for us, but with how BOINC handles the files, it became a problem. We are working on a long-term fix, but we have simply found a way around it for now. ... I got messages like "abort by user" - but I didn't abort any ... It happened to me too. I had 2 noelia units that are aborted by user, which I didn't abort. They were otherwise running fine. So, what is happening? I am not sure what is happening. Even if we cancel a group of WUs, they should complete on your computer (if they are good simulations). The "Abort by user" can only come from the user/client, typically when you deliberately cancel a WU with the "Abort" button. Hopefully there is nothing else going on...
	ID: 29006 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 29008 - Posted: 6 Mar 2013 \| 14:46:03 UTC - in response to Message 29006.
	... I got messages like "abort by user" - but I didn't abort any ... It happened to me too. I had 2 noelia units that are aborted by user, which I didn't abort. They were otherwise running fine. So, what is happening? I am not sure what is happening. Even if we cancel a group of WUs, they should complete on your computer (if they are good simulations). The "Abort by user" can only come from the user/client, typically when you deliberately cancel a WU with the "Abort" button. Hopefully there is nothing else going on... Ah. That's one I can help you with. I got an 'aborted by user', too - task 6581613. If you look at the task details, it has "exit status 202". At some stage in the development of recent BOINC clients, David updated and expanded the range of error and exit status codes returned by the client. Unfortunately, he didn't - at first, and until prodded - update the decode tables used on project web sites. You need to update html/inc/result.inc on your web server to something later than http://boinc.berkeley.edu/trac/changeset/1f7ddbfe3a27498e7fd2b4f50f3bf9269b7dae25/boinc/html/inc/result.inc to get a proper website display using case 202: return "EXIT_ABORTED_BY_PROJECT"; Full story in http://boinc.berkeley.edu/dev/forum_thread.php?id=7704
	ID: 29008 \| Rating: 0 \| rate: / Reply Quote

Operator Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level Scientific publications	Message 29009 - Posted: 6 Mar 2013 \| 15:15:01 UTC - in response to Message 29008.
	I was surprised to see some "Aborted By User" tasks this morning, especially since they happened while I was sleeping! As an example: 290px20xbis-NOELIA_290p-0-2-RND4773_1 Created 5 Mar 2013 \| 19:22:08 UTC Sent 5 Mar 2013 \| 21:01:03 UTC Received 6 Mar 2013 \| 9:58:01 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 202 (0xca) But after viewing the details of the task itself it said "WU cancelled" in red. http://www.gpugrid.net/workunit.php?wuid=4227683 name 290px20xbis-NOELIA_290p-0-2-RND4773 application Long runs (8-12 hours on fastest card) created 5 Mar 2013 \| 16:59:14 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 7, 10, 6 errors WU cancelled So they got cut off in mid crunch, and on the surface it makes it look like we aborted them. Operator ____________
	ID: 29009 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 29013 - Posted: 6 Mar 2013 \| 17:37:14 UTC
	Out of all 4 of my machines, I had 7 "Aborted by user" errors last night. My computers will be on probation by tomorrow and I won’t be able to download work units.
	ID: 29013 \| Rating: 0 \| rate: / Reply Quote

Ken_g6 Send message Joined: 6 Aug 11 Posts: 8 Credit: 74,546,994 RAC: 157,187 Level Scientific publications	Message 29014 - Posted: 6 Mar 2013 \| 18:04:23 UTC
	I haven't had the server abort any Noelias lately. I've just had them all segfault within an hour or two. :(
	ID: 29014 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 29028 - Posted: 7 Mar 2013 \| 5:00:41 UTC
	Still having the same problems with noelias. That will put my biggest machine down, because this one is BSODing and ruining all the cache, with is very hard to build atm.
	ID: 29028 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 29029 - Posted: 7 Mar 2013 \| 6:22:10 UTC
	It seems to me that these NOELIA's are suffering from memory leaks, when my card finnishes one and starts the next the GPU pegs at 99 - 100% and the memory controller stays at 0%. If I reboot, all is well and works fine. The previous wu won't release the memory on the GPU, thus the reboot. This is Windows XP Pro 64 bit, different operating systems seem to be dealing with it differently, Windows 7 and 8 are getting BSOD's or driver crashes, I also get the "acemd.2865P.exe had to be terminated unexpectedly" error. Oh well, I don't even know if this stuff we post helps or gets read.
	ID: 29029 \| Rating: 0 \| rate: / Reply Quote

wdiz Send message Joined: 4 Nov 08 Posts: 20 Credit: 871,871,594 RAC: 0 Level Scientific publications	Message 29030 - Posted: 7 Mar 2013 \| 8:10:30 UTC Last modified: 7 Mar 2013 \| 8:10:49 UTC
	Same here..Seems that the new Noelia WU doesn't work well..it freeze the computer. had to reset the project. I'm running Archlinux 3.7.10-1-ARCH kernel with GTX 660Ti and GTX 580 Boinc 7.0.53
	ID: 29030 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 29032 - Posted: 7 Mar 2013 \| 10:57:04 UTC Last modified: 7 Mar 2013 \| 11:18:18 UTC
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs.
	ID: 29032 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 29033 - Posted: 7 Mar 2013 \| 11:21:03 UTC
	Just noelias incoming. Impossible to run the project atm. Too bad, is a big farm.
	ID: 29033 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 29036 - Posted: 7 Mar 2013 \| 12:07:34 UTC - in response to Message 29033.
	Just noelias incoming. Impossible to run the project atm. Too bad, is a big farm. I just moved to a different project too. Too bad, I liked helping out here but they don't seem to test anything before release.
	ID: 29036 \| Rating: 0 \| rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 29038 - Posted: 7 Mar 2013 \| 12:10:31 UTC - in response to Message 29032. Last modified: 7 Mar 2013 \| 12:18:54 UTC
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. Thanks, we're looking at it. Obviously this is pretty serious. I will submit some additional stuff to long that I know for sure are good simulations so that we can get a handle on this. Edit: So I have submitted to long queue some simulations we know are good. If it is an issue with the app, we will find out. They have name NATHAN_dhfr36_3
	ID: 29038 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 29039 - Posted: 7 Mar 2013 \| 12:19:38 UTC - in response to Message 29032.
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. I've just reported (in Number Crunching) a failure with a long queue task under Windows 7/64, which didn't freeze the computer or poison the GPU, while short queue tasks under XP/32 are (mostly) running.
	ID: 29039 \| Rating: 0 \| rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 29040 - Posted: 7 Mar 2013 \| 12:28:04 UTC - in response to Message 29038.
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. Thanks, we're looking at it. Obviously this is pretty serious. I will submit some additional stuff to long that I know for sure are good simulations so that we can get a handle on this. Edit: So I have submitted to long queue some simulations we know are good. If it is an issue with the app, we will find out. They have name NATHAN_dhfr36_3 I noticed the NATHAN units, they are coming really good. All machines are back.....will report results ASAP :D
	ID: 29040 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,381,738 RAC: 20,081,961 Level Scientific publications	Message 29041 - Posted: 7 Mar 2013 \| 12:47:12 UTC
	A NATHAN has started running OK here too, even with no reboot after the NOELIA failure (technique as described in NC).
	ID: 29041 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,108,783,435 RAC: 15,545,660 Level Scientific publications	Message 29042 - Posted: 7 Mar 2013 \| 13:18:55 UTC - in response to Message 29038.
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. Thanks, we're looking at it. Obviously this is pretty serious. I will submit some additional stuff to long that I know for sure are good simulations so that we can get a handle on this. Edit: So I have submitted to long queue some simulations we know are good. If it is an issue with the app, we will find out. They have name NATHAN_dhfr36_3 In my case, both the 32 bit windows xp and the 64 bit windows 7 are having errors, this morning. The units I downloaded yesterday, seem to be okay. Though, I did get a crash on my windows 7 computer, on a unit running fine, when I did a reboot, though another running on the other card didn't crash. The setting (speed of GPU, memory and fan) on the video card which the unit crashed on were reset. I had to do another reboot, with the units suspended to get the video card settings right. I also noticed the that on windows 7 machine the units take 18 hours plus to finish, while on the windows xp machine it takes about 13 hours. This difference seems to be excessive.
	ID: 29042 \| Rating: 0 \| rate: / Reply Quote

[AF>Belgique] bill1170 Send message Joined: 4 Jan 09 Posts: 13 Credit: 1,292,573,895 RAC: 3,498,181 Level Scientific publications	Message 29043 - Posted: 7 Mar 2013 \| 13:54:13 UTC - in response to Message 29042.
	It's not limited to XP64, My XP32 got the error in acemd.2865P.exe as well.
	ID: 29043 \| Rating: 0 \| rate: / Reply Quote

cciechad Send message Joined: 28 Dec 10 Posts: 13 Credit: 37,543,525 RAC: 0 Level Scientific publications	Message 29044 - Posted: 7 Mar 2013 \| 13:54:55 UTC - in response to Message 29042.
	5-6 WU's failed for me this morning. I'm on driver 313.26. As far as I can tell these WU's have also failed for everyone else they were distributed to. I'm seeing these in dmesg (I thought my card might be failing but I think there is a problem with the new WU's) [649702.679741] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000 [649705.790669] NVRM: Xid (0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000 [649715.295948] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649730.302031] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649745.308105] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649760.317500] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649776.323990] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649791.831930] NVRM: Xid (0000:01:00): 8, Channel 00000001 [649806.838000] NVRM: Xid (0000:01:00): 8, Channel 00000001
	ID: 29044 \| Rating: 0 \| rate: / Reply Quote

nate Send message Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level Scientific publications	Message 29047 - Posted: 7 Mar 2013 \| 14:24:13 UTC Last modified: 7 Mar 2013 \| 14:24:40 UTC
	New news here: http://www.gpugrid.net/forum_thread.php?id=3318 It looks like it might be an extension of the issue I discussed before, but we're not sure. We're going to run tests on the beta queue to try and figure it out.
	ID: 29047 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 29060 - Posted: 7 Mar 2013 \| 19:52:12 UTC - in response to Message 29032. Last modified: 7 Mar 2013 \| 19:55:03 UTC
	It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are running fine these NOELIAs. After my post (above), my 32 bit hosts had some failures and stuck workunits, so their previous relatively successful behavior maybe just by chance.
	ID: 29060 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : New project in long queue

	About	Science	Volunteers	Performance	Forum	Join us	Donate