Author |
Message |
JStateson Send message
Joined: 31 Oct 08 Posts: 186 Credit: 3,387,358,634 RAC: 1,238,535 Level
Scientific publications
|
I can crunch along just fine as shown here
http://www.gpugrid.net/results.php?hostid=16255
then after a while get this:
<core_client_version>6.3.21</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
Cuda error in file 'deviceQuery.cu' in line 59 : initialization error.
=============
Once this happens I have to do a power off, not just issue a reboot thru remote desktop. If I dont catch it right away then I burn up the 4 or 8 or so work units I am allocated in 24 hours.
This is unsat, I think I have plenty of cooling and this systems seem to have been working fine until I bought and started using that 9800gtx+ board.
Any ideas? I could move the board to an intel platform and see if the problem goes away.
Since I have a work unit "caught" that has not been uploaded, is it possible to undo the "client compute error" and restart the work unit after a hard reboot? That way at least I can process one work unit until my quota is back.
BTW, most of the above errors were those checksum ones when the server disk was full. The more recent ones were from (it seems) my graphic board locking up just after a successful upload. |
|
|
|
I can crunch along just fine as shown here
http://www.gpugrid.net/results.php?hostid=16255
then after a while get this:
<core_client_version>6.3.21</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
Cuda error in file 'deviceQuery.cu' in line 59 : initialization error.
=============
Once this happens I have to do a power off, not just issue a reboot thru remote desktop. If I dont catch it right away then I burn up the 4 or 8 or so work units I am allocated in 24 hours.
This is unsat, I think I have plenty of cooling and this systems seem to have been working fine until I bought and started using that 9800gtx+ board.
What size power supply does it have ?
NVIDIA site says for the 9800gtx a minimum of 450W.
Any ideas? I could move the board to an intel platform and see if the problem goes away.
Certainly would not hurt to try this too.
Since I have a work unit "caught" that has not been uploaded, is it possible to undo the "client compute error" and restart the work unit after a hard reboot? That way at least I can process one work unit until my quota is back.
NO
BTW, most of the above errors were those checksum ones when the server disk was full. The more recent ones were from (it seems) my graphic board locking up just after a successful upload.
|
|
|
JStateson Send message
Joined: 31 Oct 08 Posts: 186 Credit: 3,387,358,634 RAC: 1,238,535 Level
Scientific publications
|
thanks keith
Using thermaltake 700 on a pair of dual opterons. Tyan S2892 server. I managed to add another fan and in the process discovered that i had the nvidia board in an x4 slot instead of a x16. There are two slots but the second PCIe is only an x16 form factor. I discovered this by looking at the nvidia control panel. I moved the board to the 2nd slot after the first lockup. The x4 signals account for why my first WU took 18,000 seconds but all subsequent took about 25,000 seconds. I put the board back in the full x16 slot.
I tried installing the 6.02 nvidia tools but it froze the system whenever it ran. I am not using the VISTA nag security option and a right click and run as administrator still froze. I googled and other prople are reporting freezes. I saw a reference to 6.03 tool version but I could not find it on the nvidia site and I assume it was for xp or 32 bit stuff.
Anyway I put a fan in an awkward place and proped the case up off the floor to let the air get in my new hole. The case I have is not designed for the server board and I am waiting on a pair of heat pipe heat sinks as one cpu runs about 10c hotter than the other.
The quad intel system runs much cooler in comparison and I have a 550watt that is more then sufficient. If the board hangs up again I will move it to the intel platform.
I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete. |
|
|
|
I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete.
Well.. no. Actually one would expect the PCIe speed not to have an influence, because that's just the speed of the interconnect between GPU and Chipset / CPU, which is used seldomly.
The WU time which you are probably looking at is the CPU time, which is not directly related to GPU time. If you take a look at your individual tasks the actual GPU time is written there. You first WU (the "fast" one) needed 47.3 ms/step, the 2nd 47.6 and the 3rd 47.3 ms/step.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
JStateson Send message
Joined: 31 Oct 08 Posts: 186 Credit: 3,387,358,634 RAC: 1,238,535 Level
Scientific publications
|
I would not have thought the nvidia board would work in a X4 slot. It is interesting that moving from x16 to x4 caused about %50 increase in WU time. One would think it would be linear and the WU time would be 4 times as long to complete.
Well.. no. Actually one would expect the PCIe speed not to have an influence, because that's just the speed of the interconnect between GPU and Chipset / CPU, which is used seldomly.
The WU time which you are probably looking at is the CPU time, which is not directly related to GPU time. If you take a look at your individual tasks the actual GPU time is written there. You first WU (the "fast" one) needed 47.3 ms/step, the 2nd 47.6 and the 3rd 47.3 ms/step.
MrS
OK - I agree, the "WU time" should not change much since the GPU runs on its own clock and memory. However, one of the CPU cores is spending 25,0000 seconds fiddleing around while I had the card in the x4 slot and only fiddeled for 18,000 or so seconds when I had it in the correct x16 slot. I dont remember if I had the CPU utilization set to use all 4 or just 3 cpus so possibly the 25k and 18k is not related to the x4 and x16 slot. After reading your response I posted a question about whether I should be using all 4 cpus or just 3 as Krunching Keith recommended. |
|
|
|
I posted an answer in the other thread. Anyway, running it in 16x mode can not be worse, so I'd recommend it. Just don't expect miracles ;)
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|