Advanced search

Message boards : Graphics cards (GPUs) : getting too many initialization errors

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,387,358,634
RAC: 1,238,535
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3667 - Posted: 4 Nov 2008 | 18:55:03 UTC
Last modified: 4 Nov 2008 | 18:59:03 UTC

I can crunch along just fine as shown here
http://www.gpugrid.net/results.php?hostid=16255
then after a while get this:
<core_client_version>6.3.21</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
Cuda error in file 'deviceQuery.cu' in line 59 : initialization error.
=============
Once this happens I have to do a power off, not just issue a reboot thru remote desktop. If I dont catch it right away then I burn up the 4 or 8 or so work units I am allocated in 24 hours.

This is unsat, I think I have plenty of cooling and this systems seem to have been working fine until I bought and started using that 9800gtx+ board.

Any ideas? I could move the board to an intel platform and see if the problem goes away.

Since I have a work unit "caught" that has not been uploaded, is it possible to undo the "client compute error" and restart the work unit after a hard reboot? That way at least I can process one work unit until my quota is back.

BTW, most of the above errors were those checksum ones when the server disk was full. The more recent ones were from (it seems) my graphic board locking up just after a successful upload.

Profile Krunchin-Keith [USA]
Avatar
Send message
Joined: 17 May 07
Posts: 512
Credit: 111,288,061
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3668 - Posted: 4 Nov 2008 | 21:24:17 UTC - in response to Message 3667.

I can crunch along just fine as shown here
http://www.gpugrid.net/results.php?hostid=16255
then after a while get this:
<core_client_version>6.3.21</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
Cuda error in file 'deviceQuery.cu' in line 59 : initialization error.
=============
Once this happens I have to do a power off, not just issue a reboot thru remote desktop. If I dont catch it right away then I burn up the 4 or 8 or so work units I am allocated in 24 hours.

This is unsat, I think I have plenty of cooling and this systems seem to have been working fine until I bought and started using that 9800gtx+ board.

What size power supply does it have ?
NVIDIA site says for the 9800gtx a minimum of 450W.


Any ideas? I could move the board to an intel platform and see if the problem goes away.

Certainly would not hurt to try this too.


Since I have a work unit "caught" that has not been uploaded, is it possible to undo the "client compute error" and restart the work unit after a hard reboot? That way at least I can process one work unit until my quota is back.

NO

BTW, most of the above errors were those checksum ones when the server disk was full. The more recent ones were from (it seems) my graphic board locking up just after a successful upload.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,387,358,634
RAC: 1,238,535
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3671 - Posted: 5 Nov 2008 | 1:26:07 UTC - in response to Message 3668.

thanks keith

Using thermaltake 700 on a pair of dual opterons. Tyan S2892 server. I managed to add another fan and in the process discovered that i had the nvidia board in an x4 slot instead of a x16. There are two slots but the second PCIe is only an x16 form factor. I discovered this by looking at the nvidia control panel. I moved the board to the 2nd slot after the first lockup. The x4 signals account for why my first WU took 18,000 seconds but all subsequent took about 25,000 seconds. I put the board back in the full x16 slot.

I tried installing the 6.02 nvidia tools but it froze the system whenever it ran. I am not using the VISTA nag security option and a right click and run as administrator still froze. I googled and other prople are reporting freezes. I saw a reference to 6.03 tool version but I could not find it on the nvidia site and I assume it was for xp or 32 bit stuff.


Anyway I put a fan in an awkward place and proped the case up off the floor to let the air get in my new hole. The case I have is not designed for the server board and I am waiting on a pair of heat pipe heat sinks as one cpu runs about 10c hotter than the other.

The quad intel system runs much cooler in comparison and I have a 550watt that is more then sufficient. If the board hangs up again I will move it to the intel platform.

I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3686 - Posted: 5 Nov 2008 | 22:49:21 UTC - in response to Message 3671.

I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete.


Well.. no. Actually one would expect the PCIe speed not to have an influence, because that's just the speed of the interconnect between GPU and Chipset / CPU, which is used seldomly.

The WU time which you are probably looking at is the CPU time, which is not directly related to GPU time. If you take a look at your individual tasks the actual GPU time is written there. You first WU (the "fast" one) needed 47.3 ms/step, the 2nd 47.6 and the 3rd 47.3 ms/step.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,387,358,634
RAC: 1,238,535
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3694 - Posted: 6 Nov 2008 | 12:08:06 UTC - in response to Message 3686.

I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete.


Well.. no. Actually one would expect the PCIe speed not to have an influence, because that's just the speed of the interconnect between GPU and Chipset / CPU, which is used seldomly.

The WU time which you are probably looking at is the CPU time, which is not directly related to GPU time. If you take a look at your individual tasks the actual GPU time is written there. You first WU (the "fast" one) needed 47.3 ms/step, the 2nd 47.6 and the 3rd 47.3 ms/step.

MrS


OK - I agree, the "WU time" should not change much since the GPU runs on its own clock and memory. However, one of the CPU cores is spending 25,0000 seconds fiddleing around while I had the card in the x4 slot and only fiddeled for 18,000 or so seconds when I had it in the correct x16 slot. I dont remember if I had the CPU utilization set to use all 4 or just 3 cpus so possibly the 25k and 18k is not related to the x4 and x16 slot. After reading your response I posted a question about whether I should be using all 4 cpus or just 3 as Krunching Keith recommended.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3701 - Posted: 6 Nov 2008 | 18:50:50 UTC

I posted an answer in the other thread. Anyway, running it in 16x mode can not be worse, so I'd recommend it. Just don't expect miracles ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : getting too many initialization errors

//