Work unit failure rate

Message boards : Graphics cards (GPUs) : Work unit failure rate

Author	Message
BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13290 - Posted: 27 Oct 2009 \| 17:49:29 UTC
	I just looked into this on the seven systems I have actively running GPUGrid at the moment. Of the past 70 workunits completed (10 per system), I had a total of 8 failures -- most of these quite close to completion. The systems involved have no overclocked GPU's. The GPU's range from one 9600GT to one 250GTS, the rest being 9800GT. The OS is either Windows XP, or Windows 7. Driver version is either the 190.38 or 190.62. All workstations had one failure (one had two). Seems to be a pretty high failure rate to cope with.
	ID: 13290 \| Rating: 0 \| rate: / Reply Quote

philip Send message Joined: 29 Jan 09 Posts: 1 Credit: 562,650 RAC: 0 Level Scientific publications	Message 13292 - Posted: 27 Oct 2009 \| 19:24:49 UTC - in response to Message 13290.
	Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it.
	ID: 13292 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13300 - Posted: 28 Oct 2009 \| 15:46:42 UTC - in response to Message 13292.
	Interesting -- I've not seen things a being worse with Win7 versus XP, but I think a 10% failure rate is 'suboptimal'. Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it.
	ID: 13300 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13304 - Posted: 29 Oct 2009 \| 7:49:51 UTC
	Recently I have the same kind of failure rate....2 of the last 15 workunits failed, both also quite close to completion with a total loss of about 34 hours of GPU time... Also on 9800 GT, however, I didn't change anything on the software part recently. Started a thread on my own to compare some details since the workunits don't seem to be total crashs as they were perfectly crunched by a GTX 260.
	ID: 13304 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13358 - Posted: 2 Nov 2009 \| 16:20:37 UTC
	And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project??
	ID: 13358 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13361 - Posted: 3 Nov 2009 \| 2:05:03 UTC - in response to Message 13358.
	The thing is, at the moment, GPU project options are rather limited. That 9800GT is supported on SETI -- when they have work, and on Collatz -- when it is running. One thing with Collatz, it is a very low resource project which also is the only one supporting lower power ATI GPU's as well and when the other CUDA ATI GPU project (MW -- which supports only double precision GPU's and not that 9800GT for instance) is in trouble -- like it is at the moment, the load on Collatz seems simply too much. At this moment both Collatz and MW are offline. I use Collatz as my primary GPU project with GPUGrid these days as my backup GPU project. Add to that the current work unit famine here (which probably is only short term), and for GPU BOINC folks, life can get rather tedious. One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project??
	ID: 13361 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13365 - Posted: 3 Nov 2009 \| 15:59:28 UTC - in response to Message 13361.
	The thing is, at the moment, GPU project options are rather limited. Still sad but true ... the only good news is that Einstein is working on a GPU version ... though it is taking a lot longer than I have been expecting... especially in that EaH has been one of the most reliable projects to have work up and available and to handle outages with grace ... and volume too ... One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. I have been wondering if this is just a symptom of the "admin fatigue" that we have been discussing elsewhere ... it is about that time here too ... As an answer to the situation, and as a reply to the admin's suggestion that problems with Resource Share are not a BOINC Problem that is of interest to projects I put GPU Grid in rotation with MW and Collatz and have noted that not only has my earnings here dropped though the floor, even the amount of time spend does not seem to be properly balanced ... of course I have seen this all for a long time but have not been able to get UCB's attention ... and am not likely to get GPU Grid's either ... then again ... a pox on both their houses ... I guess that means I am frustrated too ... :)
	ID: 13365 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13398 - Posted: 7 Nov 2009 \| 10:12:09 UTC - in response to Message 13365.
	the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF
	ID: 13398 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 13399 - Posted: 7 Nov 2009 \| 12:12:42 UTC
	OC, Heat, under powered PSU, drivers. Lately I have been taking BOINC offline, making a copy of my BOINC data folder and then using the copy to test out my new OC settings. This way if I crash I can keep making copies without hammering the GPUGrid servers and I also don't run into any "allowed WUs per day" limits. I have found that this is about the best way for me as there really does not seem to be any really good testing tools for GPU ... sorry but relying on my visual inspection for artifacts is not a particularly rigorous process and I wonderif they are stressing the GPU the same way that GPUGrid does. Aren;t we more concerned with shaders first, memory second , and core really does not matter? @GDF - Could you please tell me which type of WUs are typically the most GPU intensive? This way I could refine my test process to make sure I am doing the best testing possible. Thank you, Steve ____________ Thanks - Steve
	ID: 13399 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13403 - Posted: 7 Nov 2009 \| 16:09:03 UTC Last modified: 7 Nov 2009 \| 16:14:12 UTC
	I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. This was nice since now all my work units (that complete) get me the bonus without my having to manually delete the 4 or 5 it would download with the older version of boinc. A typical failure looks like: Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce 9600 GSO" # Clock rate: 1.46 GHz # Total amount of global memory: 805044224 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. I will open the box and blow the dust out of it in case it is overheating, but I can confirm that the error rate appears to have increased of late.
	ID: 13403 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13436 - Posted: 10 Nov 2009 \| 0:46:15 UTC - in response to Message 13403.
	but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.
	ID: 13436 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13441 - Posted: 10 Nov 2009 \| 6:09:38 UTC - in response to Message 13403.
	I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. Not sure of your video driver version but this was posted today in the BOINC change log: The new Nvidia API that BOINC 6.10 uses has a minimum driver set of CUDA 2.2, 185.85. If your present drivers are below this version, update first. If your present drivers are below this and they're the last available for your hardware, you cannot update to 6.10; stay at the last 6.6.x version for your OS. Spread the word.
	ID: 13441 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13554 - Posted: 14 Nov 2009 \| 20:57:25 UTC - in response to Message 13436. Last modified: 14 Nov 2009 \| 20:58:00 UTC
	It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424
	ID: 13554 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13601 - Posted: 18 Nov 2009 \| 16:09:50 UTC - in response to Message 13398.
	Right, I understand the natural administrative inclination to blame the user. This a natural and human failing, and it has the advantage of avoiding a review of work for which the administrator has some accountability. (in my earlier life I had the admin tasks and know this inclination). That being said, these are not overclocked CPU's or cards, and the problem is spread across 7 different configurations with three different GPU processors - 9600GT, 9800GT, GS 250). And it is getting WORSE. Over the past two weeks, 29 out 85 completed results were failures. Frankly, if Collatz were not so overburdened with work (they support normal ATI cards as well as normal CUDA cards), I'd simply back off of GPUGrid and wait for this project to resolve it's problems (note, I have not had failures over at Collatz for the much larger sampling over there). Perhaps when MW comes back to life, and Slicker returns from vacation over at GPU, I will fully move over to there and watch to see here if there is a response other than user error. the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF
	ID: 13601 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13602 - Posted: 18 Nov 2009 \| 16:13:51 UTC - in response to Message 13436.
	Same failure rate here -- and with the same systems the rate was much lower in September. The change for me included a move to 6.10.x for the client. It certainly is a possible culprit. But that is not something I can control in that any project supporting ATI GPU's requires the 6.10 client. Then again, getting Berkeley to accept that they are part of any problem is even less likely than getting folks here to step up to the plate. Curiously, I am NOT seeing these failures over at Collatz with either ATI or CUDA GPU's on the same computers. but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.
	ID: 13602 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13603 - Posted: 18 Nov 2009 \| 16:16:04 UTC - in response to Message 13554.
	Perhaps if enough of us report this here with enough variation in OS and hardware, eventually the spotlight on user error might be changed to a mirror.... It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424
	ID: 13603 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,894,537,378 RAC: 19,753,259 Level Scientific publications	Message 13604 - Posted: 18 Nov 2009 \| 16:21:48 UTC
	I second this call. My recent failure rate has been 12 out of 58 - over 20% - across three cards: two Zotac 9800GT at completely stock speeds, and a Zotac 'AMP' edition (factory overclock) 9800GTX+. The cards have all been running on SETI since January with no sign of failure, and are doing so again as I type.
	ID: 13604 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13605 - Posted: 18 Nov 2009 \| 16:27:13 UTC - in response to Message 13604. Last modified: 18 Nov 2009 \| 16:45:25 UTC
	We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are now distributing only a cuda2.2 application. gdf
	ID: 13605 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13606 - Posted: 18 Nov 2009 \| 16:32:41 UTC - in response to Message 13605.
	OK -- let folks know when that is in place -- for the moment, by GPUGrid processing is going on hold. We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are not distributing only a cuda2.2 application. gdf
	ID: 13606 \| Rating: 0 \| rate: / Reply Quote

Richard Bertrand Send message Joined: 15 Nov 09 Posts: 1 Credit: 0 RAC: 0 Level Scientific publications	Message 13614 - Posted: 19 Nov 2009 \| 11:26:47 UTC - in response to Message 13603.
	To put in my 2cts... I have got cuda working on a 9600M GS card within Kubuntu amd64 with the nVidia 190.42 driver and have a 100% failure rate thusfar. Just started the 15th of November with Boinc 6.4.5 from the Ubuntu repositories. GPU grid is running the v6.70 cuda version. Error messages state that a file couldn't be renamed, so I am not 100% sure whether it is the same issue as discussed here, but inspection of the permissions revealed no problems as far as I can see.
	ID: 13614 \| Rating: 0 \| rate: / Reply Quote

Andrew Send message Joined: 9 Dec 08 Posts: 29 Credit: 18,754,468 RAC: 0 Level Scientific publications	Message 13642 - Posted: 22 Nov 2009 \| 1:12:47 UTC
	In November I've had 3 failures to 11 successes, non-overclocked on a 8800Gt, so it's interesting that others are reporting failures on 8800 or 9800.
	ID: 13642 \| Rating: 0 \| rate: / Reply Quote

BlackNite Send message Joined: 21 Mar 09 Posts: 1 Credit: 2,518,637 RAC: 0 Level Scientific publications	Message 13643 - Posted: 22 Nov 2009 \| 1:33:02 UTC
	I had 9 failures in the last 32 WUs on a 8800GTS512. ____________
	ID: 13643 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13644 - Posted: 22 Nov 2009 \| 2:39:16 UTC
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks.
	ID: 13644 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13649 - Posted: 22 Nov 2009 \| 12:02:16 UTC
	Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ...
	ID: 13649 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13666 - Posted: 23 Nov 2009 \| 19:13:24 UTC - in response to Message 13644.
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 If you check you'll see that almost all of these WUs were the one's talked about in this thread: http://www.gpugrid.net/forum_thread.php?id=1468 They were later successfully completed by GTX 260 (and above) cards. Seems these WUs were pulled right around the 16th. I moved my sub GTX 260 cards to other projects for a few days because they were experiencing the same errors you were having. Now it seems things are sorted out and the sub GTX 260 cards are running better.
	ID: 13666 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,894,537,378 RAC: 19,753,259 Level Scientific publications	Message 13686 - Posted: 24 Nov 2009 \| 18:05:13 UTC Last modified: 24 Nov 2009 \| 18:37:50 UTC
	Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. It also looks as if in the meantime, it has been trashing SETI Beta tasks in the characteristically SETI way, i.e. reporting 'success' but exiting early (after 17 seconds or so) with a false -9 overflow message and no useful science. This happened just before SETI closed for weekly maintenance, so I can't check their logs until later. But I've looked through the local log, and it was definitely the GPUGrid task which was the first to fail: the subsequent problems lasted long enough to drive SETI DCF way down (0.0219), so now I've got a major excess to work off. I rebooted the machine, and it's completed the next SETI Beta in a much saner 17m 34s (DCF 0.0889). I'll do one more SETI, then start the new queued GPUGrid. But I would be worried if it turns out that GPUGrid errors are wrecking the science, not only of your own project, but potentially other projects too. Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed.
	ID: 13686 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13698 - Posted: 25 Nov 2009 \| 15:54:35 UTC - in response to Message 13686. Last modified: 25 Nov 2009 \| 15:56:44 UTC
	Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. [...] Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed. I had 4 faulty ...TONI_HERG... on a 9800GT in the last few days. Each "ERROR" crashed the Driver (reboot needed). One of this WU's (>http://www.gpugrid.net/workunit.php?wuid=961479) already errored out (Too many error results).
	ID: 13698 \| Rating: 0 \| rate: / Reply Quote

CTAPbIi Send message Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level Scientific publications	Message 13700 - Posted: 26 Nov 2009 \| 3:23:41 UTC - in response to Message 13698. Last modified: 26 Nov 2009 \| 4:08:16 UTC
	3 last WUs died just before the end... 32-IBUCH_2_reverse_TRYP_0911-9-40-RND8911 85-GIANNI_BIND_166_119-23-100-RND0667 8-GIANNI_BIND_2-34-100-RND3540 I just did new OCing, it looks stable, at least POEM's WUs are OK... GPU's been flashed years ago, so it's not the case. ____________
	ID: 13700 \| Rating: 0 \| rate: / Reply Quote

Daniel.Ahlborn Send message Joined: 12 Jan 09 Posts: 5 Credit: 3,359,168 RAC: 0 Level Scientific publications	Message 13701 - Posted: 26 Nov 2009 \| 9:40:50 UTC
	it dont seem like a OC Problem. Since a couple days i have a failure rate of nearly 100% on my machine with a GTS 250 as well. <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> # Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTS 250" # Clock rate: 1.84 GHz # Total amount of global memory: 536543232 bytes # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. </stderr_txt> ]]> they are all failing after a couple hrs of running with random reasons. http://www.gpugrid.net/results.php?hostid=56508 to me it appears like the current WU's are running well only on G200 based Chips, since my other machine with a GTX260 (G200b, 55nm, 216SP's) , same OS and same driver, is just working well with anything they feed it. ____________
	ID: 13701 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13717 - Posted: 28 Nov 2009 \| 18:19:28 UTC
	I took a closer look at my results (last 2 weeks). - no single error on my high overclocked GT200 (GTX260/GTX295) - 12 errors (55 valid) on my not overclocked 4x 9800GT -- 9 of 12 errors on '...TONI_HERG... WUs', 3 on '...IBUCH_..._TRYPE...' I found no single valid '...TONI_HERG...' on all four 9800GT. (I tried BM 6.6.38 up to 6.10.17 and NV-Driver 190.38/ 190.62/ 191.07 -no difference in failure rate)
	ID: 13717 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13742 - Posted: 1 Dec 2009 \| 5:52:14 UTC - in response to Message 13644.
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks. It looks like cthulhu ate everything he was given and wants more. I am back to 100% error rate. I look at the WU's I failed and others fail them as well. Should this be taken as a formal announcement that G92 boards are no longer welcome on GPUGRID? Finding g92/linux friendly projects is becoming more and more difficult...
	ID: 13742 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13756 - Posted: 2 Dec 2009 \| 6:06:19 UTC - in response to Message 13649.
	Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ... Ok. I will admit finding factual information is hard. Very hard... But. Newer GPU's like the GT240, based on the G215 GPU are compute level 1.2. The only difference between compute level 1.2 and compute level 1.3 that nvidia documents is that compute level 1.3 supports double precision. This begs to wonder. Are GT240's, based on the G215 chipset, supported by GPUGRID? We all knwo that GTS250's, based on the G92b chipsset are not, as are many GTX280's based on the G200 chipset while GTX280's based on the G200b chipset DO work with GPUGRID. Can boards that are expected to work be defined by their chipset, their compute level or by something else? Nvidia's conventions are hard to understand, but it is clear that G92 is not welcome on GPUGRID, nor is G200. G200b is. Is G215?
	ID: 13756 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13835 - Posted: 8 Dec 2009 \| 17:26:03 UTC - in response to Message 13756.
	For Factual Information try here, http://en.wikipedia.org/wiki/GeForce_200_Series You should note that NVidias Newer Cards are not necessarily based on anything in particular, and NVidias naming system is beyond ridiculous. The G200 series seems to include GT92a, GT92b and GT96 based cards. There are 40nm, 55nm and 65nm cores, and the transistor count varies from 0.26 Billion to 2.8 Billion. Release dates don’t seem to matter much either. Of particularly annoying specs are: GTX 280 cards, used a 65nm fabricated core and were usually slower than the GTX 275s. The Older GTX 260s, used 65nm and the GTX260M used a GT92 core. The GTS 250, uses a 65nm core, GT92 A2, but still sort of works here! The GTS 240. Uses a 55nm G92b core – an afterthought or a fulfil contracts card perhaps. The GT 220M uses a 65nm G96M core. I doubt that would work. The combination of Card Factors that presently seem important to GPUGrid functionality include: Core Size; 40nm Good, 55nm OK, 65nm Bad GPU Codename; GT216 Good, GT215 Good, GT200b Good, GT200 Poor/OK, G92 A2 Poor/OK-ish, G92 Bad. The G90 is no longer Compatible. Memory; DDR3+DDR5 Good, DDR3 Mix of Good and Bad, DDR2 Presumably Bad Overall Performance; A combination of the Amount and Speed of the Cores, Shaders, Memories, Bus width, and other performance factors. Determines if the card can finish in time. Tempertures; Too hot and it will crash. Depends on Physical Architecture of GPU and Computer, use of fans, the GPUGrid WorkUnit and what else you are crunching... and not forgetting, How much use the card has seen; or how close it is to failure, given the cards other Factors!
	ID: 13835 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Work unit failure rate

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13290 - Posted: 27 Oct 2009 \| 17:49:29 UTC
	I just looked into this on the seven systems I have actively running GPUGrid at the moment. Of the past 70 workunits completed (10 per system), I had a total of 8 failures -- most of these quite close to completion. The systems involved have no overclocked GPU's. The GPU's range from one 9600GT to one 250GTS, the rest being 9800GT. The OS is either Windows XP, or Windows 7. Driver version is either the 190.38 or 190.62. All workstations had one failure (one had two). Seems to be a pretty high failure rate to cope with.
	ID: 13290 \| Rating: 0 \| rate: / Reply Quote

philip Send message Joined: 29 Jan 09 Posts: 1 Credit: 562,650 RAC: 0 Level Scientific publications	Message 13292 - Posted: 27 Oct 2009 \| 19:24:49 UTC - in response to Message 13290.
	Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it.
	ID: 13292 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13300 - Posted: 28 Oct 2009 \| 15:46:42 UTC - in response to Message 13292.
	Interesting -- I've not seen things a being worse with Win7 versus XP, but I think a 10% failure rate is 'suboptimal'. Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it.
	ID: 13300 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13304 - Posted: 29 Oct 2009 \| 7:49:51 UTC
	Recently I have the same kind of failure rate....2 of the last 15 workunits failed, both also quite close to completion with a total loss of about 34 hours of GPU time... Also on 9800 GT, however, I didn't change anything on the software part recently. Started a thread on my own to compare some details since the workunits don't seem to be total crashs as they were perfectly crunched by a GTX 260.
	ID: 13304 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13358 - Posted: 2 Nov 2009 \| 16:20:37 UTC
	And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project??
	ID: 13358 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13361 - Posted: 3 Nov 2009 \| 2:05:03 UTC - in response to Message 13358.
	The thing is, at the moment, GPU project options are rather limited. That 9800GT is supported on SETI -- when they have work, and on Collatz -- when it is running. One thing with Collatz, it is a very low resource project which also is the only one supporting lower power ATI GPU's as well and when the other CUDA ATI GPU project (MW -- which supports only double precision GPU's and not that 9800GT for instance) is in trouble -- like it is at the moment, the load on Collatz seems simply too much. At this moment both Collatz and MW are offline. I use Collatz as my primary GPU project with GPUGrid these days as my backup GPU project. Add to that the current work unit famine here (which probably is only short term), and for GPU BOINC folks, life can get rather tedious. One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project??
	ID: 13361 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13365 - Posted: 3 Nov 2009 \| 15:59:28 UTC - in response to Message 13361.
	The thing is, at the moment, GPU project options are rather limited. Still sad but true ... the only good news is that Einstein is working on a GPU version ... though it is taking a lot longer than I have been expecting... especially in that EaH has been one of the most reliable projects to have work up and available and to handle outages with grace ... and volume too ... One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. I have been wondering if this is just a symptom of the "admin fatigue" that we have been discussing elsewhere ... it is about that time here too ... As an answer to the situation, and as a reply to the admin's suggestion that problems with Resource Share are not a BOINC Problem that is of interest to projects I put GPU Grid in rotation with MW and Collatz and have noted that not only has my earnings here dropped though the floor, even the amount of time spend does not seem to be properly balanced ... of course I have seen this all for a long time but have not been able to get UCB's attention ... and am not likely to get GPU Grid's either ... then again ... a pox on both their houses ... I guess that means I am frustrated too ... :)
	ID: 13365 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13398 - Posted: 7 Nov 2009 \| 10:12:09 UTC - in response to Message 13365.
	the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF
	ID: 13398 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 13399 - Posted: 7 Nov 2009 \| 12:12:42 UTC
	OC, Heat, under powered PSU, drivers. Lately I have been taking BOINC offline, making a copy of my BOINC data folder and then using the copy to test out my new OC settings. This way if I crash I can keep making copies without hammering the GPUGrid servers and I also don't run into any "allowed WUs per day" limits. I have found that this is about the best way for me as there really does not seem to be any really good testing tools for GPU ... sorry but relying on my visual inspection for artifacts is not a particularly rigorous process and I wonderif they are stressing the GPU the same way that GPUGrid does. Aren;t we more concerned with shaders first, memory second , and core really does not matter? @GDF - Could you please tell me which type of WUs are typically the most GPU intensive? This way I could refine my test process to make sure I am doing the best testing possible. Thank you, Steve ____________ Thanks - Steve
	ID: 13399 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13403 - Posted: 7 Nov 2009 \| 16:09:03 UTC Last modified: 7 Nov 2009 \| 16:14:12 UTC
	I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. This was nice since now all my work units (that complete) get me the bonus without my having to manually delete the 4 or 5 it would download with the older version of boinc. A typical failure looks like: Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce 9600 GSO" # Clock rate: 1.46 GHz # Total amount of global memory: 805044224 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. I will open the box and blow the dust out of it in case it is overheating, but I can confirm that the error rate appears to have increased of late.
	ID: 13403 \| Rating: 0 \| rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13436 - Posted: 10 Nov 2009 \| 0:46:15 UTC - in response to Message 13403.
	but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.
	ID: 13436 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13441 - Posted: 10 Nov 2009 \| 6:09:38 UTC - in response to Message 13403.
	I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. Not sure of your video driver version but this was posted today in the BOINC change log: The new Nvidia API that BOINC 6.10 uses has a minimum driver set of CUDA 2.2, 185.85. If your present drivers are below this version, update first. If your present drivers are below this and they're the last available for your hardware, you cannot update to 6.10; stay at the last 6.6.x version for your OS. Spread the word.
	ID: 13441 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13554 - Posted: 14 Nov 2009 \| 20:57:25 UTC - in response to Message 13436. Last modified: 14 Nov 2009 \| 20:58:00 UTC
	It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424
	ID: 13554 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13601 - Posted: 18 Nov 2009 \| 16:09:50 UTC - in response to Message 13398.
	Right, I understand the natural administrative inclination to blame the user. This a natural and human failing, and it has the advantage of avoiding a review of work for which the administrator has some accountability. (in my earlier life I had the admin tasks and know this inclination). That being said, these are not overclocked CPU's or cards, and the problem is spread across 7 different configurations with three different GPU processors - 9600GT, 9800GT, GS 250). And it is getting WORSE. Over the past two weeks, 29 out 85 completed results were failures. Frankly, if Collatz were not so overburdened with work (they support normal ATI cards as well as normal CUDA cards), I'd simply back off of GPUGrid and wait for this project to resolve it's problems (note, I have not had failures over at Collatz for the much larger sampling over there). Perhaps when MW comes back to life, and Slicker returns from vacation over at GPU, I will fully move over to there and watch to see here if there is a response other than user error. the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF
	ID: 13601 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13602 - Posted: 18 Nov 2009 \| 16:13:51 UTC - in response to Message 13436.
	Same failure rate here -- and with the same systems the rate was much lower in September. The change for me included a move to 6.10.x for the client. It certainly is a possible culprit. But that is not something I can control in that any project supporting ATI GPU's requires the 6.10 client. Then again, getting Berkeley to accept that they are part of any problem is even less likely than getting folks here to step up to the plate. Curiously, I am NOT seeing these failures over at Collatz with either ATI or CUDA GPU's on the same computers. but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate.
	ID: 13602 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13603 - Posted: 18 Nov 2009 \| 16:16:04 UTC - in response to Message 13554.
	Perhaps if enough of us report this here with enough variation in OS and hardware, eventually the spotlight on user error might be changed to a mirror.... It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424
	ID: 13603 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,894,537,378 RAC: 19,753,259 Level Scientific publications	Message 13604 - Posted: 18 Nov 2009 \| 16:21:48 UTC
	I second this call. My recent failure rate has been 12 out of 58 - over 20% - across three cards: two Zotac 9800GT at completely stock speeds, and a Zotac 'AMP' edition (factory overclock) 9800GTX+. The cards have all been running on SETI since January with no sign of failure, and are doing so again as I type.
	ID: 13604 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13605 - Posted: 18 Nov 2009 \| 16:27:13 UTC - in response to Message 13604. Last modified: 18 Nov 2009 \| 16:45:25 UTC
	We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are now distributing only a cuda2.2 application. gdf
	ID: 13605 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 13606 - Posted: 18 Nov 2009 \| 16:32:41 UTC - in response to Message 13605.
	OK -- let folks know when that is in place -- for the moment, by GPUGrid processing is going on hold. We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are not distributing only a cuda2.2 application. gdf
	ID: 13606 \| Rating: 0 \| rate: / Reply Quote

Richard Bertrand Send message Joined: 15 Nov 09 Posts: 1 Credit: 0 RAC: 0 Level Scientific publications	Message 13614 - Posted: 19 Nov 2009 \| 11:26:47 UTC - in response to Message 13603.
	To put in my 2cts... I have got cuda working on a 9600M GS card within Kubuntu amd64 with the nVidia 190.42 driver and have a 100% failure rate thusfar. Just started the 15th of November with Boinc 6.4.5 from the Ubuntu repositories. GPU grid is running the v6.70 cuda version. Error messages state that a file couldn't be renamed, so I am not 100% sure whether it is the same issue as discussed here, but inspection of the permissions revealed no problems as far as I can see.
	ID: 13614 \| Rating: 0 \| rate: / Reply Quote

Andrew Send message Joined: 9 Dec 08 Posts: 29 Credit: 18,754,468 RAC: 0 Level Scientific publications	Message 13642 - Posted: 22 Nov 2009 \| 1:12:47 UTC
	In November I've had 3 failures to 11 successes, non-overclocked on a 8800Gt, so it's interesting that others are reporting failures on 8800 or 9800.
	ID: 13642 \| Rating: 0 \| rate: / Reply Quote

BlackNite Send message Joined: 21 Mar 09 Posts: 1 Credit: 2,518,637 RAC: 0 Level Scientific publications	Message 13643 - Posted: 22 Nov 2009 \| 1:33:02 UTC
	I had 9 failures in the last 32 WUs on a 8800GTS512. ____________
	ID: 13643 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13644 - Posted: 22 Nov 2009 \| 2:39:16 UTC
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks.
	ID: 13644 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13649 - Posted: 22 Nov 2009 \| 12:02:16 UTC
	Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ...
	ID: 13649 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13666 - Posted: 23 Nov 2009 \| 19:13:24 UTC - in response to Message 13644.
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 If you check you'll see that almost all of these WUs were the one's talked about in this thread: http://www.gpugrid.net/forum_thread.php?id=1468 They were later successfully completed by GTX 260 (and above) cards. Seems these WUs were pulled right around the 16th. I moved my sub GTX 260 cards to other projects for a few days because they were experiencing the same errors you were having. Now it seems things are sorted out and the sub GTX 260 cards are running better.
	ID: 13666 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,894,537,378 RAC: 19,753,259 Level Scientific publications	Message 13686 - Posted: 24 Nov 2009 \| 18:05:13 UTC Last modified: 24 Nov 2009 \| 18:37:50 UTC
	Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. It also looks as if in the meantime, it has been trashing SETI Beta tasks in the characteristically SETI way, i.e. reporting 'success' but exiting early (after 17 seconds or so) with a false -9 overflow message and no useful science. This happened just before SETI closed for weekly maintenance, so I can't check their logs until later. But I've looked through the local log, and it was definitely the GPUGrid task which was the first to fail: the subsequent problems lasted long enough to drive SETI DCF way down (0.0219), so now I've got a major excess to work off. I rebooted the machine, and it's completed the next SETI Beta in a much saner 17m 34s (DCF 0.0889). I'll do one more SETI, then start the new queued GPUGrid. But I would be worried if it turns out that GPUGrid errors are wrecking the science, not only of your own project, but potentially other projects too. Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed.
	ID: 13686 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13698 - Posted: 25 Nov 2009 \| 15:54:35 UTC - in response to Message 13686. Last modified: 25 Nov 2009 \| 15:56:44 UTC
	Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. [...] Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed. I had 4 faulty ...TONI_HERG... on a 9800GT in the last few days. Each "ERROR" crashed the Driver (reboot needed). One of this WU's (>http://www.gpugrid.net/workunit.php?wuid=961479) already errored out (Too many error results).
	ID: 13698 \| Rating: 0 \| rate: / Reply Quote

CTAPbIi Send message Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level Scientific publications	Message 13700 - Posted: 26 Nov 2009 \| 3:23:41 UTC - in response to Message 13698. Last modified: 26 Nov 2009 \| 4:08:16 UTC
	3 last WUs died just before the end... 32-IBUCH_2_reverse_TRYP_0911-9-40-RND8911 85-GIANNI_BIND_166_119-23-100-RND0667 8-GIANNI_BIND_2-34-100-RND3540 I just did new OCing, it looks stable, at least POEM's WUs are OK... GPU's been flashed years ago, so it's not the case. ____________
	ID: 13700 \| Rating: 0 \| rate: / Reply Quote

Daniel.Ahlborn Send message Joined: 12 Jan 09 Posts: 5 Credit: 3,359,168 RAC: 0 Level Scientific publications	Message 13701 - Posted: 26 Nov 2009 \| 9:40:50 UTC
	it dont seem like a OC Problem. Since a couple days i have a failure rate of nearly 100% on my machine with a GTS 250 as well. <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> # Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTS 250" # Clock rate: 1.84 GHz # Total amount of global memory: 536543232 bytes # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. </stderr_txt> ]]> they are all failing after a couple hrs of running with random reasons. http://www.gpugrid.net/results.php?hostid=56508 to me it appears like the current WU's are running well only on G200 based Chips, since my other machine with a GTX260 (G200b, 55nm, 216SP's) , same OS and same driver, is just working well with anything they feed it. ____________
	ID: 13701 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13717 - Posted: 28 Nov 2009 \| 18:19:28 UTC
	I took a closer look at my results (last 2 weeks). - no single error on my high overclocked GT200 (GTX260/GTX295) - 12 errors (55 valid) on my not overclocked 4x 9800GT -- 9 of 12 errors on '...TONI_HERG... WUs', 3 on '...IBUCH_..._TRYPE...' I found no single valid '...TONI_HERG...' on all four 9800GT. (I tried BM 6.6.38 up to 6.10.17 and NV-Driver 190.38/ 190.62/ 191.07 -no difference in failure rate)
	ID: 13717 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13742 - Posted: 1 Dec 2009 \| 5:52:14 UTC - in response to Message 13644.
	Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks. It looks like cthulhu ate everything he was given and wants more. I am back to 100% error rate. I look at the WU's I failed and others fail them as well. Should this be taken as a formal announcement that G92 boards are no longer welcome on GPUGRID? Finding g92/linux friendly projects is becoming more and more difficult...
	ID: 13742 \| Rating: 0 \| rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13756 - Posted: 2 Dec 2009 \| 6:06:19 UTC - in response to Message 13649.
	Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ... Ok. I will admit finding factual information is hard. Very hard... But. Newer GPU's like the GT240, based on the G215 GPU are compute level 1.2. The only difference between compute level 1.2 and compute level 1.3 that nvidia documents is that compute level 1.3 supports double precision. This begs to wonder. Are GT240's, based on the G215 chipset, supported by GPUGRID? We all knwo that GTS250's, based on the G92b chipsset are not, as are many GTX280's based on the G200 chipset while GTX280's based on the G200b chipset DO work with GPUGRID. Can boards that are expected to work be defined by their chipset, their compute level or by something else? Nvidia's conventions are hard to understand, but it is clear that G92 is not welcome on GPUGRID, nor is G200. G200b is. Is G215?
	ID: 13756 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13835 - Posted: 8 Dec 2009 \| 17:26:03 UTC - in response to Message 13756.
	For Factual Information try here, http://en.wikipedia.org/wiki/GeForce_200_Series You should note that NVidias Newer Cards are not necessarily based on anything in particular, and NVidias naming system is beyond ridiculous. The G200 series seems to include GT92a, GT92b and GT96 based cards. There are 40nm, 55nm and 65nm cores, and the transistor count varies from 0.26 Billion to 2.8 Billion. Release dates don’t seem to matter much either. Of particularly annoying specs are: GTX 280 cards, used a 65nm fabricated core and were usually slower than the GTX 275s. The Older GTX 260s, used 65nm and the GTX260M used a GT92 core. The GTS 250, uses a 65nm core, GT92 A2, but still sort of works here! The GTS 240. Uses a 55nm G92b core – an afterthought or a fulfil contracts card perhaps. The GT 220M uses a 65nm G96M core. I doubt that would work. The combination of Card Factors that presently seem important to GPUGrid functionality include: Core Size; 40nm Good, 55nm OK, 65nm Bad GPU Codename; GT216 Good, GT215 Good, GT200b Good, GT200 Poor/OK, G92 A2 Poor/OK-ish, G92 Bad. The G90 is no longer Compatible. Memory; DDR3+DDR5 Good, DDR3 Mix of Good and Bad, DDR2 Presumably Bad Overall Performance; A combination of the Amount and Speed of the Cores, Shaders, Memories, Bus width, and other performance factors. Determines if the card can finish in time. Tempertures; Too hot and it will crash. Depends on Physical Architecture of GPU and Computer, use of fans, the GPUGrid WorkUnit and what else you are crunching... and not forgetting, How much use the card has seen; or how close it is to failure, given the cards other Factors!
	ID: 13835 \| Rating: 0 \| rate: / Reply Quote