Advanced search

Message boards : News : Canceled SDOERR WU's

Author Message
Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30018 - Posted: 16 May 2013 | 17:30:46 UTC
Last modified: 16 May 2013 | 20:49:46 UTC

I apologize for all whose WU's were cancelled today in the process of stopping the old ones and re-sending the new fixed ones.
At the beginning I didn't realize the effect and extend of the cancellation (since it was not intended to work this way), but I see now that many hours of computation and credits were lost due to this.
It seems to have been a bug in a script that had not been seen before until now, so we will refrain from using it until it has been fixed.
I realize that the computation time is very important for everyone so we hope that such hiccups won't happen too often in the future.

Many thanks for your patience and calculations!

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 3,511,269,035
RAC: 53,109,994
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30032 - Posted: 16 May 2013 | 20:35:01 UTC

Well at least now I know why a lot of my Wu's were getting the Boot from the Server ... o_0
____________
STE\/E

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,159,713,370
RAC: 15,084,780
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30034 - Posted: 16 May 2013 | 22:47:38 UTC - in response to Message 30018.

Okay, fine, learn from your mistakes, and let's go on!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30086 - Posted: 19 May 2013 | 13:25:08 UTC - in response to Message 30034.
Last modified: 19 May 2013 | 13:28:58 UTC

This morning I saw that a system had recovered from a problem and was sitting at the Windows log-on screen. When I logged on it became unresponsive, though after waiting about a minute (due to my reg settings) I got 2 app crashed error messages for SDOERR WU's:

I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506_3 4467647 18 May 2013 | 21:10:00 UTC 19 May 2013 | 11:13:06 UTC Error while computing 18,441.84 18,362.63 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42)

I2HDQ_22R9-SDOERR_2HDQd-0-4-RND3283_1 4465782 18 May 2013 | 17:35:46 UTC 19 May 2013 | 11:36:31 UTC Aborted by user 33,710.79 33,611.30 --- Long runs (8-12 hours on fastest card) v6.18 (cuda42)

These 2 WU's were around 50% and 79% complete (9h). I tried Boinc exits and log -ns, cold restarts, but always got the same app crashes. Even tried a clean driver install.

2 Climate models also failed (only 87h lost this time).

When I suspended and enabled the GPUGrid WU's they showed a fixed progress (~50% and 79%), but the elapsed and remaining time kept ticking over.

    No heartbeat from core client for 30 sec - exiting

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59



Whatever is wrong with these work units is bad. You would be doing the recipient of the re-sends a favor by server aborting them, even mid-run; it's better than a crash that kills the WU's and takes out other work!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 152
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30089 - Posted: 19 May 2013 | 15:50:06 UTC - in response to Message 30086.
Last modified: 19 May 2013 | 16:04:49 UTC

You would be doing the recipient of the re-sends a favor by server aborting them, even mid-run; it's better than a crash that kills the WU's and takes out other work!


Uh Oh. That's me, for the first one. Let's see if Linux can crunch it, fingers crossed.
After looking at past attempts, two were with Linux, so I don't hold out much hope.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30091 - Posted: 19 May 2013 | 16:53:30 UTC - in response to Message 30086.
Last modified: 19 May 2013 | 18:28:05 UTC

Hm Skgiven, unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors.
I am really sorry for the damage done to the other projects too, it must be very bad when so much work is lost, but at this point it doesn't seem that anyone has encountered similar problems to warrant cancelling the rest. I will definitely keep an eye on it though and if more people report the same problem I might cancel them.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30092 - Posted: 19 May 2013 | 18:35:53 UTC - in response to Message 30091.
Last modified: 19 May 2013 | 18:37:26 UTC

Stefan, I wasn't suggesting that you cancel the batch, just the individual WU's. It was just a grumpy suggestion. If you think these aren't WU specific problems then there is no point cancelling them.

I was concerned about the I2HDQ_21R8-SDOERR_2HDQd-1-4-RND9506 WU, as it has already produced 4 different errors on different systems (with varying operating systems, GPU types/generations and probably drivers). While 2 systems have a high error rate, two don't.


Stoneageman, it seems you ended up with resends for both of my failures!
http://www.gpugrid.net/workunit.php?wuid=4465782

I hope they fare better for you.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30093 - Posted: 19 May 2013 | 18:56:40 UTC - in response to Message 30092.

Ah ok now I get it, sorry. I will keep the tab open and check if it fails again in which case it gets the boot. I don't mind much about canceling a single chain if it fails everywhere. Thanks for the heads up.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30099 - Posted: 20 May 2013 | 0:31:43 UTC - in response to Message 30091.

unfortunately it is probably not related to the specific jobs. I cannot guarantee it obviously, but from the statistics I see that the jobs have a higher success rate than most others right now on GPUgrid. Also my modifications to Paola's jobs (which these jobs used to be) are essentially none except letting them run longer, meaning they have run before on GPUgrid without too grave errors.

For me the SDOERR WUs aren't running quite as well as the NATHAN WUs but are running MUCH better than the NOELIA WUs. I had virtually no problems with the NATHANs, 2 errors with the SDOERR (both recovered and completed). I'd count the problem NOELIAS but probably don't have enough fingers and toes...

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 30102 - Posted: 20 May 2013 | 8:37:57 UTC

Completed and validated :) Yay!
http://www.gpugrid.net/workunit.php?wuid=4467647

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30108 - Posted: 20 May 2013 | 11:58:29 UTC - in response to Message 30093.

and the other one,

http://www.gpugrid.net/workunit.php?wuid=4465782


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 30198 - Posted: 22 May 2013 | 13:40:08 UTC
Last modified: 22 May 2013 | 13:41:30 UTC

Darn, these things run and run and run.

I2HDQ_8R2-SDOERR_2HDQd-3-4-RND2343_0 using acemdlong version 618 (cuda42) in slot 5 has been running 17:46 with 4:06 to go and 63.8% complete.....

GTX 650 Ti with AMD A10 5800K @ 3.8 GHz

John

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30201 - Posted: 22 May 2013 | 14:26:51 UTC - in response to Message 30198.

John,

They take about 11 1/2 hours on my GTX 660s, so your results are in line with that.

I2HDQ_28R1-SDOERR_2HDQd-2-4-RND3969_2
11:24:55 (11:21:05) 5/22/2013 1:05:26 AM 5/22/2013 1:14:31 AM 0.629C + 1NV (d1) 99.44 Reported: OK *

I have not picked one up yet on my GTX 650 Ti (just started it up again), but it looks like they will squeak in under the deadline.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 30205 - Posted: 22 May 2013 | 14:58:26 UTC - in response to Message 30201.

Thanks, Jim.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30207 - Posted: 22 May 2013 | 15:14:01 UTC - in response to Message 30198.
Last modified: 22 May 2013 | 15:20:50 UTC

Darn, these things run and run and run.

I2HDQ_8R2-SDOERR_2HDQd-3-4-RND2343_0 using acemdlong version 618 (cuda42) in slot 5 has been running 17:46 with 4:06 to go and 63.8% complete.....

GTX 650 Ti with AMD A10 5800K @ 3.8 GHz

John, the SDOERR WUs are averaging a little under 16 hours on my OCed 650 Ti GPUs in Win7-64.
(No failed WUs on the 3 OCed cards I might add.)
A GTX 460/768 ran a bit over 20 hours (stock clocks).

Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours.

What % GPU usage are you getting?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30237 - Posted: 22 May 2013 | 23:46:14 UTC - in response to Message 30207.

Edit: A new 650 Ti GPU I installed yesterday (non-OCed version) is running it's first SDOERR WU now and is at 73% after 12:49 hours.

Finished in 17:33 on the stock MSI 650 Ti.

Post to thread

Message boards : News : Canceled SDOERR WU's

//