Canceled SDOERR WU's

Message boards : News : Canceled SDOERR WU's

Author	Message
Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 30018 - Posted: 16 May 2013 \| 17:30:46 UTC Last modified: 16 May 2013 \| 20:49:46 UTC
	I apologize for all whose WU's were cancelled today in the process of stopping the old ones and re-sending the new fixed ones. At the beginning I didn't realize the effect and extend of the cancellation (since it was not intended to work this way), but I see now that many hours of computation and credits were lost due to this. It seems to have been a bug in a script that had not been seen before until now, so we will refrain from using it until it has been fixed. I realize that the computation time is very important for everyone so we hope that such hiccups won't happen too often in the future. Many thanks for your patience and calculations!
	ID: 30018 \| Rating: 0 \| rate: / Reply Quote

STE\/E Send message Joined: 18 Sep 08 Posts: 368 Credit: 3,511,269,035 RAC: 53,109,994 Level Scientific publications	Message 30032 - Posted: 16 May 2013 \| 20:35:01 UTC
	Well at least now I know why a lot of my Wu's were getting the Boot from the Server ... o_0 ____________ STE\/E
	ID: 30032 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,159,713,370 RAC: 15,084,780 Level Scientific publications	Message 30034 - Posted: 16 May 2013 \| 22:47:38 UTC - in response to Message 30018.
	Okay, fine, learn from your mistakes, and let's go on!
	ID: 30034 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 30086 - Posted: 19 May 2013 \| 13:25:08 UTC - in response to Message 30034. Last modified: 19 May 2013 \| 13:28:58 UTC
	This morning I saw that a system had recovered from a problem and was sitting at the Windows log-on screen. When I logged on it became unresponsive, though after waiting about a minute (due to my reg settings) I got 2 app crashed error messages for SDOERR WU's: I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506_3 4467647 18 May 2013 \| 21:10:00 UTC 19 May 2013 \| 11:13:06 UTC Error while computing 18,441.84 18,362.63 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42) I2HDQ_22R9-SDOERR_2HDQd-0-4-RND3283_1 4465782 18 May 2013 \| 17:35:46 UTC 19 May 2013 \| 11:36:31 UTC Aborted by user 33,710.79 33,611.30 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42) These 2 WU's were around 50% and 79% complete (9h). I tried Boinc exits and log -ns, cold restarts, but always got the same app crashes. Even tried a clean driver install. 2 Climate models also failed (only 87h lost this time). When I suspended and enabled the GPUGrid WU's they showed a fixed progress (~50% and 79%), but the elapsed and remaining time kept ticking over. No heartbeat from core client for 30 sec - exiting This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59 Whatever is wrong with these work units is bad. You would be doing the recipient of the re-sends a favor by server aborting them, even mid-run; it's better than a crash that kills the WU's and takes out other work! ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 30086 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 152 Level Scientific publications	Message 30089 - Posted: 19 May 2013 \| 15:50:06 UTC - in response to Message 30086. Last modified: 19 May 2013 \| 16:04:49 UTC
	You would be doing the recipient of the re-sends a favor by server aborting them, even mid-run; it's better than a crash that kills the WU's and takes out other work! Uh Oh. That's me, for the first one. Let's see if Linux can crunch it, fingers crossed. After looking at past attempts, two were with Linux, so I don't hold out much hope.
	ID: 30089 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 30091 - Posted: 19 May 2013 \| 16:53:30 UTC - in response to Message 30086. Last modified: 19 May 2013 \| 18:28:05 UTC
	Hm Skgiven, unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors. I am really sorry for the damage done to the other projects too, it must be very bad when so much work is lost, but at this point it doesn't seem that anyone has encountered similar problems to warrant cancelling the rest. I will definitely keep an eye on it though and if more people report the same problem I might cancel them.
	ID: 30091 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 30092 - Posted: 19 May 2013 \| 18:35:53 UTC - in response to Message 30091. Last modified: 19 May 2013 \| 18:37:26 UTC
	Stefan, I wasn't suggesting that you cancel the batch, just the individual WU's. It was just a grumpy suggestion. If you think these aren't WU specific problems then there is no point cancelling them. I was concerned about the I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506 WU, as it has already produced 4 different errors on different systems (with varying operating systems, GPU types/generations and probably drivers). While 2 systems have a high error rate, two don't. Stoneageman, it seems you ended up with resends for both of my failures! http://www.gpugrid.net/workunit.php?wuid=4465782 I hope they fare better for you. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 30092 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 30093 - Posted: 19 May 2013 \| 18:56:40 UTC - in response to Message 30092.
	Ah ok now I get it, sorry. I will keep the tab open and check if it fails again in which case it gets the boot. I don't mind much about canceling a single chain if it fails everywhere. Thanks for the heads up.
	ID: 30093 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 30099 - Posted: 20 May 2013 \| 0:31:43 UTC - in response to Message 30091.
	unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors. For me the SDOERR WUs aren't running quite as well as the NATHAN WUs but are running MUCH better than the NOELIA WUs. I had virtually no problems with the NATHANs, 2 errors with the SDOERR (both recovered and completed). I'd count the problem NOELIAS but probably don't have enough fingers and toes...
	ID: 30099 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 30102 - Posted: 20 May 2013 \| 8:37:57 UTC
	Completed and validated :) Yay! http://www.gpugrid.net/workunit.php?wuid=4467647
	ID: 30102 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 30108 - Posted: 20 May 2013 \| 11:58:29 UTC - in response to Message 30093.
	and the other one, http://www.gpugrid.net/workunit.php?wuid=4465782 ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 30108 \| Rating: 0 \| rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 30198 - Posted: 22 May 2013 \| 13:40:08 UTC Last modified: 22 May 2013 \| 13:41:30 UTC
	Darn, these things run and run and run. I2HDQ_8R2-SDOERR_2HDQd-3-4-RND2343_0 using acemdlong version 618 (cuda42) in slot 5 has been running 17:46 with 4:06 to go and 63.8% complete..... GTX 650 Ti with AMD A10 5800K @ 3.8 GHz John
	ID: 30198 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 30201 - Posted: 22 May 2013 \| 14:26:51 UTC - in response to Message 30198.
	John, They take about 11 1/2 hours on my GTX 660s, so your results are in line with that. I2HDQ_28R1-SDOERR_2HDQd-2-4-RND3969_2 11:24:55 (11:21:05) 5/22/2013 1:05:26 AM 5/22/2013 1:14:31 AM 0.629C + 1NV (d1) 99.44 Reported: OK * I have not picked one up yet on my GTX 650 Ti (just started it up again), but it looks like they will squeak in under the deadline.
	ID: 30201 \| Rating: 0 \| rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 30205 - Posted: 22 May 2013 \| 14:58:26 UTC - in response to Message 30201.
	Thanks, Jim.
	ID: 30205 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 30207 - Posted: 22 May 2013 \| 15:14:01 UTC - in response to Message 30198. Last modified: 22 May 2013 \| 15:20:50 UTC
	Darn, these things run and run and run. I2HDQ_8R2-SDOERR_2HDQd-3-4-RND2343_0 using acemdlong version 618 (cuda42) in slot 5 has been running 17:46 with 4:06 to go and 63.8% complete..... GTX 650 Ti with AMD A10 5800K @ 3.8 GHz John, the SDOERR WUs are averaging a little under 16 hours on my OCed 650 Ti GPUs in Win7-64. (No failed WUs on the 3 OCed cards I might add.) A GTX 460/768 ran a bit over 20 hours (stock clocks). Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours. What % GPU usage are you getting?
	ID: 30207 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 30237 - Posted: 22 May 2013 \| 23:46:14 UTC - in response to Message 30207.
	Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours. Finished in 17:33 on the stock MSI 650 Ti.
	ID: 30237 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : Canceled SDOERR WU's

	About	Science	Volunteers	Performance	Forum	Join us	Donate