Author |
Message |
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
I'm getting a lot or errors as below
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting Gn30779-TEST12-0-5-acemd_0
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting task Gn30779-TEST12-0-5-acemd_0 using acemd version 625
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Computation for task Gn30779-TEST12-0-5-acemd_0 finished
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_1 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_2 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_3 for task Gn30779-TEST12-0-5-acemd_0 absent
This usually happens when the 2nd WU of a download batch runs (or 3rd/4th), I don't think my rig has successfully processed a full batch of WUs.
The 1st Wu of a batch normally runs to completion so I set my "connect every" to 0.1 with 0 cache to try to download only 1 WU at a time but the above error came from a single WU download this one.
The WU can process anything from 13 seconds (as above) to 4 hours (this one) before failing.
Is anyone else getting errors like these?
Any ideas why its happening?
Anyone else using an 8800GS successfully?
Fedora 7, Q6600 (running 3xseti & 1xps3grid), Asus 8800GS running 173.14 drivers (also happened with 169.09) |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
I'm getting a lot or errors as below
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting Gn30779-TEST12-0-5-acemd_0
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting task Gn30779-TEST12-0-5-acemd_0 using acemd version 625
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Computation for task Gn30779-TEST12-0-5-acemd_0 finished
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_1 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_2 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_3 for task Gn30779-TEST12-0-5-acemd_0 absent
This usually happens when the 2nd WU of a download batch runs (or 3rd/4th), I don't think my rig has successfully processed a full batch of WUs.
The 1st Wu of a batch normally runs to completion so I set my "connect every" to 0.1 with 0 cache to try to download only 1 WU at a time but the above error came from a single WU download this one.
The WU can process anything from 13 seconds (as above) to 4 hours (this one) before failing.
Is anyone else getting errors like these?
Any ideas why its happening?
Anyone else using an 8800GS successfully?
Fedora 7, Q6600 (running 3xseti & 1xps3grid), Asus 8800GS running 173.14 drivers (also happened with 169.09)
Thanks for the accurate description. We knew there was a problem with WU at start after a successful one, but this is much more clear.
Keep in touch. We hope to fix it soon.
GDF
|
|
|
|
I also have had 1 workunit fail right at the start.
Task ID 38220, and by the looks of things it's one of the ones Temujin had fail on him too (but after 924 Seconds)
Mine was moaning about "error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory" |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :(
I've tried restarting boinc but that didn't work, would restarting the machine help?
Is there anything I can run to clean things up?
I don't mean to sound ungrateful/impatient but any idea how long "soon" will be?
are we talking days or weeks?
How has the take up of the GPU app been?
Any idea how many GPU users you have? |
|
|
|
I guess the fix will be implemented very soon... ;)
I also have had issues with one of my cards (driver problems plus some failing tasks like you described them), and hit the max. of 4 WUs per day. Unfortunately there is nothing you can do to reset it...
____________
pixelicious.at - my little photoblog |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :(
I've tried restarting boinc but that didn't work, would restarting the machine help?
Is there anything I can run to clean things up?
I don't mean to sound ungrateful/impatient but any idea how long "soon" will be?
are we talking days or weeks?
How has the take up of the GPU app been?
Any idea how many GPU users you have?
Hi,
we are looking into it. The problem is that we cannot replicate it here. It does happen to others but much less frequently. At the moment your machine and Stefan's are summing up 90% of all errors, which otherwise is going well. It could be a driver problem for both. I hope that we can do something in days.
GDF |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
At the moment your machine and Stefan's are summing up 90% of all errors Oops
I hope that we can do something in days. Many thanks
|
|
|
|
Just got the same error: Task ID: 38629
Mon 21 Jul 2008 17:32:53 BST|PS3GRID|Restarting task xD30815-TEST12-1-5-acemd_0 using acemd version 625
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Computation for task xD30815-TEST12-1-5-acemd_0 finished
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_0 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_1 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_2 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_3 for task xD30815-TEST12-1-5-acemd_0 absent
Mind you Boinc had been jumping about the workunits like a demented flea yesterday, due to it thinking that it would not reach the workunit deadline.
That workunit was listed as 0% and had just started after one had just finnished.
I don't know if it's anyway related, but a workunit for the project I run along side gpugrid had also just finnished. |
|
|
|
...
It could be a driver problem for both. I hope that we can do something in days.
GDF
Well, I'm not sure if I can believe that it is a driver problem only...
I run a 8800GT with Ubuntu 7.10 and 169.14 drivers,
a 9800GTX with Fedora 9 and 173.14.09 drivers
and a GTX 260 with Ubuntu 8.04 and 177.13 drivers...
Three different machines, three different OS and three different driver versions, and all show the same errors from time to time...
The error with the WUs at start after a successful one with the error
"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory"
and the second error
"process exited with code 1 (0x1, -255)"
I know that the 177.13 driver for the GTX260 is really crap because the PowerMizer does not work and it slows down the core clock speed of the card after the first successful WU, but the other two computers (drivers) too?
I really hope you can find out what's going wrong.
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory" I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying
ln -s libcudart64.so libcudart.so
??
and the second error "process exited with code 1 (0x1, -255)" Thats the one I get on all but 1 of my fails :(
|
|
|
|
"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory" I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying
ln -s libcudart64.so libcudart.so
??
and the second error "process exited with code 1 (0x1, -255)" Thats the one I get on all but 1 of my fails :(
libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory.
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory.
Yep, you're right
I didn't think to look in there
|
|
|
|
Hmm, the only thing I can see is that we both use Quadcore CPUs...
Gianni, could this problems be related to Quadcores, or is this only a coincidence?
What are the CPU types of the other computers which throw out these errors?
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
What are the CPU types of the other computers which throw out these errors? You may be on to something.
I know of the following GPU users
UBT - NaRyan
Q6600, 5 good, 2 errors, 1 abort
Athlon 6000+, 11 good
sneakysaurus
Q6600, 6 good, 5 errors
JG4KEZ(Koichi Soraku)
Xeon X3360, 7 good, 1 error |
|
|
|
Looks obvious that there's something wrong with Quads, but who knows...
Let's see what G is saying.
____________
pixelicious.at - my little photoblog |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
What are the CPU types of the other computers which throw out these errors? You may be on to something.
I know of the following GPU users
UBT - NaRyan
Q6600, 5 good, 2 errors, 1 abort
Athlon 6000+, 11 good
sneakysaurus
Q6600, 6 good, 5 errors
JG4KEZ(Koichi Soraku)
Xeon X3360, 7 good, 1 error
The vast majority of errors are at start-up. We have submitted a series of very fast WUs to check it now.
If you go over quota for the day, let me have your hostid.
GDF |
|
|
|
Ok, but my queue is pretty full with ps3grid WUs, I doubt I'll get new WUs until tomorrow.
But I'll try to stop running tasks, maybe I can get some new ones...
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
If you go over quota for the day, let me have your hostid. PM sent
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Hi,
your card seems to be overclocked which makes it unstable and causes the errors!
Is it right?
GDF
|
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
Hi,
your card seems to be overclocked which makes it unstable and causes the errors!
Is it right?
GDF who, me?
not as far as I know, I've certainly not tweaked anything.
nvclock gives the following
-- General info --
Card: Unknown Nvidia card
Architecture: G92 A2
PCI id: 0x606
GPU clock: 601.712 MHz
Bustype: PCI-Express
-- Shader info --
Clock: 1674.000 MHz
Stream units: 96 (1b)
ROP units: 12 (1b)
-- Memory info --
Amount: 384 MB
Type: 128 bit DDR3
Clock: 899.996 MHz
-- PCI-Express info --
Current Rate: 16X
Maximum rate: 16X
-- Sensor info --
Sensor: GPU Internal Sensor
GPU temperature: 18C
-- VideoBios information --
Version: 62.92.29.00.00
Signon message: ASUS EN8800GS TOP VGA BIOS Ver 62.92.29.00.AS13
Performance level 0: gpu 600MHz/shader 1700MHz/memory 900MHz/0.00V/100%
VID mask: 3
Voltage level 0: 0.95V, VID: 0
Voltage level 1: 1.00V, VID: 1
Voltage level 2: 1.05V, VID: 2
Voltage level 3: 1.10V, VID: 3
|
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
If you go over quota for the day, let me have your hostid.
got 3,
2 normal & 1 shortie, running the shortie now
thanks
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Hi,
your card seems to be overclocked which makes it unstable and causes the errors!
Is it right?
GDF who, me?
not as far as I know, I've certainly not tweaked anything.
nvclock gives the following
-- General info --
Card: Unknown Nvidia card
Architecture: G92 A2
PCI id: 0x606
GPU clock: 601.712 MHz
Bustype: PCI-Express
-- Shader info --
Clock: 1674.000 MHz
Stream units: 96 (1b)
ROP units: 12 (1b)
-- Memory info --
Amount: 384 MB
Type: 128 bit DDR3
Clock: 899.996 MHz
-- PCI-Express info --
Current Rate: 16X
Maximum rate: 16X
-- Sensor info --
Sensor: GPU Internal Sensor
GPU temperature: 18C
-- VideoBios information --
Version: 62.92.29.00.00
Signon message: ASUS EN8800GS TOP VGA BIOS Ver 62.92.29.00.AS13
Performance level 0: gpu 600MHz/shader 1700MHz/memory 900MHz/0.00V/100%
VID mask: 3
Voltage level 0: 0.95V, VID: 0
Voltage level 1: 1.00V, VID: 1
Voltage level 2: 1.05V, VID: 2
Voltage level 3: 1.10V, VID: 3
Maybe it was overclocked by the vendor.
According to this
http://en.wikipedia.org/wiki/GeForce_8_Series
the shader should be clocked at 1375
You could try to reduce it.
GDF |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Hi,
According to the Asus website, at [url=http://www.asus.co.nz/news_show.aspx?id=9578], your card is a factory overclocked 8800GS. Whilst overclocking is often acceptable for games, because the GPUGRID application works the card very hard it is quite possible that it becoming unstable. I suggest that you try reducing the clock back to the standard settings shown in the Wikipedia article griven by GDF.
Before you can change the clock frequencies, you may need to add the following to the screen or device section of /etc/X11/xorg.conf:
Option "Coolbits" "1"
Then restart X. The nvidia-settings program will then have a panel called "clock settings".
MJH |
|
|
|
If you meant me with the oc'd card - no they aren't overclocked...
My 9800GTX is an EVGA 9800 GTX SC (super clocked) and is a little bit overclocked by default, but it was stable 24/7 over one week when I was running the Folding@home GPU Client under Windows - Haven't had one error on this card at FAH. And the other two cards do show the same errors with ps3grid and they are for sure not oc'd - not by me and not by the vendor...
-----
Just had another error after nine hours, with a new error code-
The WU was http://www.ps3grid.net/PS3GRID/result.php?resultid=38719 ,and the error was
<core_client_version>6.3.5</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
</stderr_txt>
]]>
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
Hi,
According to the Asus website, at [url=http://www.asus.co.nz/news_show.aspx?id=9578], your card is a factory overclocked 8800GS. Whilst overclocking is often acceptable for games, because the GPUGRID application works the card very hard it is quite possible that it becoming unstable. I suggest that you try reducing the clock back to the standard settings shown in the Wikipedia article griven by GDF. Yep, that would make sense.
Must admit, I didn't realise it was an overclock card, i just bought it because it was a cheap 8800.
As far as temps go, its 16C at idle and a max of 28C while running WUs.
Before you can change the clock frequencies, you may need to add the following to the screen or device section of /etc/X11/xorg.conf:
Option "Coolbits" "1"
Then restart X. The nvidia-settings program will then have a panel called "clock settings". Yep, done that and up pops the extra panel but I can only adjust GPU (at default of 600Mhz) and Memory (at default of 900 Mhz) settings. There's no access to the Shader settings.
I'm willing to knock both GPU & Memory down if that will help, any suggestions what to set it to?
I've also done a "nvclock -r" (reset) but it didn't seem to change the output from "nvclock -i -f"
|
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
ok, I've gone for GPU @ 550 and Memory @ 850
shader is still at 1700 though, anyone know how to adjust that?
edit
oops, nope, its dropped down to 1566Mhz
I'll have a play around with GPU & Mem settings |
|
|
|
My 8800GT is factory overclocked should be 600MHz Core, 1500MHz Shader & 1800MHz Memory.
However it runs at 700MHz Core, 1700MHz Shader & 2000MHz Memory.
And it's the computer that's so far been 100% stable.
I do however have the fan set to 100% on it.
*EDIT*
oops forgot it was Quad systems having probs not dual core ones *ahem* |
|
|
|
Ok, that's your Dualcore without problems, but the Quadcore has also some WUs with errors...
Any word from GDF or MJH about that it looks like those errors only appear on Quadcore computers?
____________
pixelicious.at - my little photoblog |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Ok, that's your Dualcore without problems, but the Quadcore has also some WUs with errors...
Any word from GDF or MJH about that it looks like those errors only appear on Quadcore computers?
I don't see any reason why quad cores could create problems.
GDF |
|
|
|
Ok, that's your Dualcore without problems, but the Quadcore has also some WUs with errors...
Any word from GDF or MJH about that it looks like those errors only appear on Quadcore computers?
I don't see any reason why quad cores could create problems.
GDF
Thing is, looking at the Top Computers every single Quad Core has them :(
Need someone with an AMD Quad to join to see if that also has the same probs.
|
|
|
|
Ok, let's look a little bit further at the top hosts list...
Computers with computations errors:
stefan@home Intel Quad Core
stefan@home Intel Quad Core
JG4KEZ(Koichi Soraku) Intel Quad Core Xeon
sneakysaurus Intel Quad Core
Anonymous user Intel Quad Core
UBT-NaRyan Intel Quad Core
stefan@home Intel Quad Core
Anonymous user Intel Dual Core !
[AF>Linux>Gentoo] elgrande71 Intel Dual Core !
Computers without computation errors:
UBT-NaRyan AMD Dual Core
Ok that's - this are all GPU computers I could find (Except the computer of GDF, I have excluded it, because he also had computation errors, but they were dated before the official start of gpugrid)... Seems the errors are not only related to Quad Cores, but the only computer without errors is an AMD Dual Core...
Don't know what it means, but at least all Intel computers show these errors...
____________
pixelicious.at - my little photoblog |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
got 3, 2 normal & 1 shortie, running the shortie now
The shortie turned out to be not a shortie :(
1st one crashed again but that was before I underclocked the card.
Current WU is now 12 hours in with an estimated 1.5 hours to go.
|
|
|
|
got 3, 2 normal & 1 shortie, running the shortie now
The shortie turned out to be not a shortie :(
1st one crashed again but that was before I underclocked the card.
Current WU is now 12 hours in with an estimated 1.5 hours to go.
The shorties have the name "xxxxxxx-FASTTEST-x-x-xxxxx" and for me it took about 47 Seconds to complete for 1.99 credits :)
____________
Down with the Kredit Kops!!! |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
The shorties have the name "xxxxxxx-FASTTEST-x-x-xxxxx" and for me it took about 47 Seconds to complete for 1.99 credits :) I didn't get any of them then :(
|
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
I think I may have sorted my GPU problems.
I run my own little boinc stats database for my team and mysqld gets a tad busy during updates for about 10 minutes.
I'm also in the process of upgrading my machine from Fedora 7 to Fedora 9.
To do that, I've borrowed a machine from work and have moved the stats over to that machine while I do the upgrade.
I'm waiting a couple of days before upgrading the main machine untill I'm sure everything is running ok on the temp machine.
Since moving the stats all GPU WUs have completed successfully.
Ok, its only done 2 and a bit WUs but its never managed that before, I was lucky if I had 1 in 5 succeed and never 2 sequentially.
Its a bit early to claim 100% victory but its looking good, maybe, touch wood :D |
|
|
|
I just got another error on my quad. Task ID 39247
Last one I had on the Quad was 4 days ago, so can't moan about it.
And the AMD dual core is still plodding along as happy as can be :)
____________
Down with the Kredit Kops!!! |
|
|
|
Hi guys! A couple of error types to report.
I loaded up two boxes with 8800GT's (the first box ran a 8600GT as a test bed for a couple of days).
One box (this one)seems a bit unstable. Lots of compute errors. I've reduced the clock using nvclock as discussed earlier in the thread and we'll see what happens
On the other box, I have some missing libcudart.so errors. Was there a fix found for the missing libcudart.so discussed earlier? This host seems to do that on every second WU - immediately it tries to start up the second WU after completing the first, it fails for missing libcudart. I've checked and the file is present in the /projects/ps3grid.net folder, so stumped really.
Both boxes are running ubuntu 8.04, with 8800GT's fed from dual core E4300's, which have been fine on boinc up till now. |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Hi guys! A couple of error types to report.
I loaded up two boxes with 8800GT's (the first box ran a 8600GT as a test bed for a couple of days).
One box (this one)seems a bit unstable. Lots of compute errors. I've reduced the clock using nvclock as discussed earlier in the thread and we'll see what happens
On the other box, I have some missing libcudart.so errors. Was there a fix found for the missing libcudart.so discussed earlier? This host seems to do that on every second WU - immediately it tries to start up the second WU after completing the first, it fails for missing libcudart. I've checked and the file is present in the /projects/ps3grid.net folder, so stumped really.
Both boxes are running ubuntu 8.04, with 8800GT's fed from dual core E4300's, which have been fine on boinc up till now.
We have a workaround for the libcudart missing problem. It seems to happen in a strange way, which we cannot replicate on our fedora box.
The workaround is to install the Nvidia toolkit (same page of the the driver) and set in your .bashrc file the following command:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib
This should not be needed in the future. I keep you updated for when we find a solution.
gdf
|
|
|
|
We have a workaround for the libcudart missing problem. It seems to happen in a strange way, which we cannot replicate on our fedora box.
The workaround is to install the Nvidia toolkit (same page of the the driver) and set in your .bashrc file the following command:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib
This should not be needed in the future. I keep you updated for when we find a solution.
gdf
Thanks for your tip, I'll keep a watch on the boxes and patch if needed.
I reduced the clocks slightly on the machine that appeared to have a stability problem, it seems fine now. The other machine has not encountered a libcudart.so fail since Saturday evening.
____________
|
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
Hi
I had a WU running for ~9 hours when the benchmarks kicked in.
The WU resumed after the benchmarks and promptly failed with error 1
I know my card isn't the most stable but this failure looked to be caused by the benchmarks.
If a benchmark is due within the estimated WU run period is there any way that the benchmark could be run before starting the WU? |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Hi
I had a WU running for ~9 hours when the benchmarks kicked in.
The WU resumed after the benchmarks and promptly failed with error 1
I know my card isn't the most stable but this failure looked to be caused by the benchmarks.
If a benchmark is due within the estimated WU run period is there any way that the benchmark could be run before starting the WU?
The only thing we saw is that you were first running with the factory default frequency
# Clock rate: 1674000 kilohertz
After the restart, you were running with the underclocked values and crashed
# Clock rate: 1350000 kilohertz
It would have been less surprising the other way round...
gdf |
|
|
FilipeSend message
Joined: 28 Apr 08 Posts: 3 Credit: 1,994,582 RAC: 0 Level
Scientific publications
|
Hi all,
keep up the good work !! and look forward to progressing with the first boinc project utilizing gpu's.
I've started 4 gpu test work units and 3 have failed (compute error), system details below.
1 x 9600GT slightly OC'ed card (by manufacturer)
Ubuntu hardy 8.04
cuda driver 177.13
client 6.3.5
Pentium D processor
out of 4 units 3 units failed with a stderr error of
"process exited with code 1 (0x1, -255)"
but the 4th work unit passed with exit code 0(0x0).
The 3 units completed processing times before compute error of 35k, 384 (yes small) and 63k cpu seconds. Now i know that the 1st failure occurred when at 35k i hit the sleep button (ridiculously placed on keyboard), when system came out of sleep mode compute error occured. The others failed on their own!..really. So my questions are below and pls excuse any questions that may seem simple with Linux as i'm still picking it up after a few years off..and getting used to Ubuntu differences.
1) it seems to me that the logging of actual gpu results as they are processed (or at set point)are a little bugy. since after sleep mode it just didnt restart, the 384 seconds unit i actually closed boinc manager normaly and opened it again and compute error happened. So does the data get stored correctly in this beta version as its being processed? or are there known issues here? or havent i done something right?
2) do i need to set the boinc files location in my path or Lib path? as i have to click the boinc client then boinc manager to start project. clicking boinc manager by itself doesnt start the process..thats were i had connection refused coming up before i realised running the client created the files needed for the boinc manager to connect to client 6.3.5 and start processing.
3) my dual core cpu indicates that both cores are 90-100% all the time? i saw from other posts that only 1 core should indicate close to 100% because of polling. no other apps were running.
4) in messages section of boinc manager it says cant download anymore files because of 1 cpu limit..i guess this should say i gpu limit? or is it cpu limit then that could explain 3) above also.
Other comments
After compute error the claimed credit says 1987.41 no matter what the computation time before error occurred i.e 384s or 63,000s. Not too worried if i dont get any credit for these errors as really this is all about science and testing a new system for future advancement....but its fun :-)
Cheers,
Fil.
p.s sorry about any spelling mistakes...it's late and i'm going to sleep now.
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
[quote]Hi all,
1) restart is usually rock solid. What went wrong here is that the sleep mode caused the GPU program to crash notifying a compute error, so the client did not even try to restart. Even using the desktop heavily could cause the application to crash, display has priority of cuda runs.
2) There is a problem when running directly the manager, this is a boinc problem related to SELinux present even before the GPU. Workaround start first the client with boinc -daemon and then the graphical interface.
3) We use only 1 CPU to sync. What are the processes using the other one?
4) At the moment, boinc checks on the number of cpus. If you guys collect in a thread proven BOINC issues I will present them to D. Anderson when he comes to Barcelona in September.
gdf |
|
|
TemujinSend message
Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level
Scientific publications
|
Hi
I had a WU running for ~9 hours when the benchmarks kicked in.
The WU resumed after the benchmarks and promptly failed with error 1
I know my card isn't the most stable but this failure looked to be caused by the benchmarks.
If a benchmark is due within the estimated WU run period is there any way that the benchmark could be run before starting the WU?
The only thing we saw is that you were first running with the factory default frequency
# Clock rate: 1674000 kilohertz
After the restart, you were running with the underclocked values and crashed
# Clock rate: 1350000 kilohertz
It would have been less surprising the other way round...
gdf Ahh, thats interesting.
I upgraded my machine to Fedora 9 yesterday, installed 173 cuda drivers, downloaded a WU and it crashed straightaway.
I checked nvclock and the card had reset to factory overclock, so I underclocked the card back down to 450 & 800 and then downloaded the WU in question.
The NVidia X utility and nvclock both reported the lower clock values before that WU started but boinc saw the higher values untill after the benchmarks, almost 9 hours later ??
me=puzzled :D
|
|
|
FilipeSend message
Joined: 28 Apr 08 Posts: 3 Credit: 1,994,582 RAC: 0 Level
Scientific publications
|
Thanks gdf for your comments,
In relation to point 3)the client seems to be using 48-50% of my dual core cpu, as reported by processes of the system monitor in Ubuntu.
However i previously mentioned that 100% or both cores were being used . This is true if viewing the resources animation tab and reports both at 100% (most of the time). Looking on the web i found that there is a bug with the compiz plugins 0.7.4 for Ubunto (hardy 8.04 version included) ..actually there are a whole list of issues to be addressed with the compiz plugins as reported at
https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/218726
i unloaded all the compiz plugins but i still get the problem. I guess its not a ps3grid/Boinc issue but a Ubuntu/Linux issue.
cheers,
Fil.
quote]Hi all,
1) restart is usually rock solid. What went wrong here is that the sleep mode caused the GPU program to crash notifying a compute error, so the client did not even try to restart. Even using the desktop heavily could cause the application to crash, display has priority of cuda runs.
2) There is a problem when running directly the manager, this is a boinc problem related to SELinux present even before the GPU. Workaround start first the client with boinc -daemon and then the graphical interface.
3) We use only 1 CPU to sync. What are the processes using the other one?
4) At the moment, boinc checks on the number of cpus. If you guys collect in a thread proven BOINC issues I will present them to D. Anderson when he comes to Barcelona in September.
gdf[/quote]
|
|
|
|
Hi, I just added my AMD Quad to the mix.
Nothing here OC'd
Evga 8800GS
AMD 9550
Gigabyte GA-M78SM-S2H mb, with 4 gig ram
Ubuntu 8.04
Boinc 6.3.5
Nvidia 173.14 drivers
I have been running other Boinc project WU's for about a week, just to test out the setup (with Boinc 6.3.5), and everything has been fine until today...
I just got it running a few minutes ago on PS3Grid. And went through 4 failed wu's with this error:
"process exited with code 193 (0xc1, -63)"
I'm not sure what is going on. And I am a Linux noob....
____________
Consciousness: That annoying time between naps......
Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.
|
|
|
|
Ooops...Here are the tasks...
http://www.ps3grid.net/results.php?hostid=5914
____________
Consciousness: That annoying time between naps......
Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.
|
|
|
|
I had one of these overnight too.
Running from a 3GHz P4, not overclocked. On Ubuntu 8.04 a slow 8600GT and 6.3.8 client.
Link to WU: http://www.ps3grid.net/result.php?resultid=42123
It failed right at the start of the run. I noticed some error messages in the message tab - Sorry, I'm not at the box now, so I can't be certain, but it was something like "result file not found". I'll post the exact message later today if anyone needs it.
/edit/ I should add that other boxes were converted over to 6.3.8 last night without issue.
____________
|
|
|
|
I installed the new 6.3.8 over the 6.3.5 version. It runs PrimeGrid wu's fine. I'll try PS3Grid wu's this morning. Again
____________
Consciousness: That annoying time between naps......
Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.
|
|
|
|
So after some test, i think i have a problem but i can't solve it.
I have only one WU finished without problems!!
All others are with errors.
I don't understand why, but sometimes, after Boinc finished a WU for another project, and start PS3Grid, my PC freeze!
And not at the same time.
Yesterday it was near 2H. But i saw it his morning!
And when i restart, the WU is dead with computation errors!
As i can't be all the times in front of the PC, i think i will no try to crunch a WU. And after 4 WU with errors, i can't download another one.
If i try others projects, i don't have any problems. So i think taht's not Ubuntu who have a problem. I don't use any applications with it. Just for testing PS3Grid and the GPU.
For information:
Ubuntu 8.04.1
Nvidia 177.13
Boinc 6.3.8
Q9450 @ 3.4Ghz
GTX 260 not OC
8Go DDR2
May be i will try one other time, but i wait with expectations, the windows version.
I don't have any problems with windows, and i continue to think that Linux is not so stable. |
|
|
|
I found on one of my 8800GT boxes that a heavy O/C on the CPU made it unstable on this project. It has been 100% stable on CPU projects before. I reduced the O/C on the CPU by about 10% and it has crunched without fail since.
Hope that helps.
____________
|
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
I found on one of my 8800GT boxes that a heavy O/C on the CPU made it unstable on this project. It has been 100% stable on CPU projects before. I reduced the O/C on the CPU by about 10% and it has crunched without fail since.
Hope that helps.
What is O/C?
gdf |
|
|
|
Sorry.....O/C = OverClock
His CPU is heavily overclocked and on one of my machines that made the GPU card unstable on this project. Just a 10% reduction in the CPU overclock brought it back - might be the same for him?
____________
|
|
|
|
Ok, I tried changing the default GPU (550) and MEM (800) settings to 500 and 700. I picked up 2 new wu's (http://www.ps3grid.net/results.php?hostid=5914). So far the 1st wu has been running for ~10 minutes. I have another Boinc project running on the other 3 cpu's also.
If this werks, I will bump up the settings a bit.
Time will tell....
Evga 8800GS
AMD 9550
Gigabyte GA-M78SM-S2H mb, with 4 gig ram
Ubuntu 8.04
Boinc 6.3.5
Nvidia 173.14 drivers
____________
Consciousness: That annoying time between naps......
Experience is a wonderful thing: it enables you to recognize a mistake every time you repeat it.
|
|
|
|
So after some test, i think i have a problem but i can't solve it.
I have only one WU finished without problems!!
All others are with errors.
I don't understand why, but sometimes, after Boinc finished a WU for another project, and start PS3Grid, my PC freeze!
And not at the same time.
Yesterday it was near 2H. But i saw it his morning!
And when i restart, the WU is dead with computation errors!
As i can't be all the times in front of the PC, i think i will no try to crunch a WU. And after 4 WU with errors, i can't download another one.
If i try others projects, i don't have any problems. So i think taht's not Ubuntu who have a problem. I don't use any applications with it. Just for testing PS3Grid and the GPU.
For information:
Ubuntu 8.04.1
Nvidia 177.13
Boinc 6.3.8
Q9450 @ 3.4Ghz
GTX 260 not OC
8Go DDR2
May be i will try one other time, but i wait with expectations, the windows version.
I don't have any problems with windows, and i continue to think that Linux is not so stable.
I have had the same problems with a GTX260 and the 177.13 drivers!
Seems this are driver related problems...
I already have sent a bug report to NVIDIA a few weeks ago, but got no answer, so I had to install Vista on this PC to crunch for Folding@home...
Hope there will be a Windows version of the GPUGRID application very soon! ;-)
____________
pixelicious.at - my little photoblog |
|
|