Advanced search

Message boards : Number crunching : NOELIAs are back!

Author Message
tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29705 - Posted: 6 May 2013 | 15:52:37 UTC

Just got one. Glad I have a 1GB GPU 'cause it's using 874MB of its memory!
____________

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 11
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29706 - Posted: 6 May 2013 | 18:32:48 UTC

Have been getting many. None completed as yet, but so far no problems. Are these the same type as before?

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29708 - Posted: 6 May 2013 | 20:41:43 UTC

These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,535,121
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29709 - Posted: 6 May 2013 | 22:00:51 UTC - in response to Message 29708.

After over 4 hours execution with 12 hours estimated runtime to completion, I aborted one. The second one is showing the same issues. Estimated runtime to completion is rising with almost every second of execution.

GPU is hardly working, temps are lower than the CPU, that is unusual.

The process is using 30-35% of GPU memory (12% display, balance for computing)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29711 - Posted: 6 May 2013 | 23:46:31 UTC - in response to Message 29709.
Last modified: 6 May 2013 | 23:50:10 UTC

I suspended my Nathan WU’s to have a look at these NOELIA_klebe WU:

    GPU power 91%
    temperature 67C
    GPU usage 94% (with CPU tasks suspended)
    Fan speed 77%
    Core clock 1202MHz

CPU usage is fairly low; looks like being ~44% of the GPUs’ runtime.
3% complete in just over 18min14sec. So estimated run time is 36,466sec (just over 10h on a GTX660Ti in a W7x64 system). That’s around twice as long as Nate’s recent WU’s.
When I allowed CPU tasks to run again (only 67% of the CPU in use overall) the GPU usage dropped to ~88%, and the power dipped to around 89%. With no CPU usage the CPU usage was a straight line. With 67% of the CPU in use the GPU usage is jagged. Even when I dropped the CPU to 50% in Boinc Manager the GPU usage graph remained jagged but ~92% GPU usage on average.
With CPU set to 25% the GPU usage line was reasonably straight with some downward spikes.

When I set the CPU usage to 100% the graph started going all over the place, initially dropping to 76%, spiked up to 91%, fluctuated more than before but mostly between 80% and 90% GPU usage. While I only saw a drop from 1202MHz to 1189MHz, I expect the GPU usage on other systems would fluctuate more depending on the CPU/GPU setups. So I think running at 100% CPU usage could well result in the GPU downclocking when running these WU’s.
I will let it run with only 1 CPU core being used, to get an accurate idea of an optimal settings performance on the 660Ti.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,031,722
RAC: 571,307
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29712 - Posted: 7 May 2013 | 1:57:36 UTC

Just finished my first NOELIA_klebe workunit on my GTX 670 (W7)! Nothing unusual: runtime 34480.037 s, credit 127,800.00, which is by the way similar to NOELIA_PEPTGPRC: 29748.680 s, 113,250.00, last/oldest NOELIA WU I was able to find on my register (after that I only had NATHANs). About same credit/time ratio within NOELIAs and hey it is not all about credits, so no reason to abort just because NATHANs do better on credits, within the GPUGRID it will level out. So I am quite happy with these new NOELIAs, if the WUs don't start to crash down the road.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,535,121
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29714 - Posted: 7 May 2013 | 2:53:37 UTC - in response to Message 29712.

I have a short run NOELIA klebe in process. on a 4 core AMD with 2 boincsimap tasks and wuprop running CPU use is low in comparison to NATHAN. CPU use is varying between 15% and 100% depending on which core, but the CPU is never saturated.

After 4 hours, it is 10.7% complete and BOINC 7.0.28 on Ubuntu 13.04 forecasts completion after another 9 hours 54 minutes. This forecast is rising with every few seconds of execution. Nvidia driver version is 319.17. All updates to Ubuntu are applied daily. This is on a GTX660Ti that processes NATHANs in 5 hours 15 minutes.

This wu will NOT finish before I have to go to work in 5 hours so I will report results when I get back home after work.

I did not want to abort the workunits, the Long Runs were working so slowly, I did not expect them to complete in less than 24 hours. This Short Run may take nearly that long to execute.

I personally hope that this batch of NOELIA klebe are exhausted soon.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,535,121
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29715 - Posted: 7 May 2013 | 7:33:18 UTC - in response to Message 29714.

After 9 hours running. only 18% complete. Remaining time is still climbing.

I will come back to GPUGrid in a few days once these wu are gone. ACEMD for Linux cannot run these tasks correctly.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29717 - Posted: 7 May 2013 | 7:52:50 UTC

It might be better or even neccessary to for straight usage of one core, as the Nathans do. Freeing half a core for such a performance hit (on fully loaded systems) doesn't seem worth it.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29719 - Posted: 7 May 2013 | 9:24:47 UTC - in response to Message 29717.
Last modified: 7 May 2013 | 10:25:35 UTC

It might be better or even neccessary to for straight usage of one core, as the Nathans do. Freeing half a core for such a performance hit (on fully loaded systems) doesn't seem worth it.

MrS

Agreed.

148px47x4-NOELIA_klebe_run-0-3-RND5398_0 4431950 6 May 2013 | 20:26:40 UTC 7 May 2013 | 9:18:47 UTC Completed and validated 36,452.79 16,142.69 127,800.00
I32R6-NATHAN_dhfr36_5-21-32-RND2187_1 4426158 5 May 2013 | 23:09:30 UTC 6 May 2013 | 8:22:25 UTC Completed and validated 18,944.86 18,944.86 70,800.00

I went into app_config and set the cpu_usage for GPUGrid to 1.0 to see what impact this had on running a NOELIA_klebe WU, with the CPU usage set in Boinc Manager to 75% (the most I typically use with 2 GPU's in the system):
GPU power 88%
temperature 65C
GPU usage ~90% still with some variation
Fan speed 74%
Core clock at 1189MHz (this dropped yesterday from 1202MHz and hasn’t risen yet).

There is still a 4% loss, going by GPU usage, but probably more if I ran an entire WU with these settings.

I also suspended and restarted Noelia’s task and closed, used snooze GPU and opened Boinc without problems.

My WU returned in just over 10h, as expected, but might have dipped below 10h if I hadn't been suspended and restarting...

-
With SWAN_SYNC set to 0 (and still using app_config) it didn’t make much difference:
GPU power 88% (no change)
temperature 65C (no change)
GPU usage ~90% still with some variation but very occasionally rising to 94% and staying at that for a few seconds.
Fan speed 74%
Core clock at 1189MHz (no change).
Actual progress was about the same. 1% after just over 6min.

Got a driver error when I exited Boinc. That suggests to me that the WU isn’t closing down gracefully. Maybe Boinc is to blame there.

Did a restart and removed app_config file, but left SWAN_SYNC in place. No change. Opened the init_data.xml file in the slot that the WU was running and it's still set to <ncpus>1.000000</ncpus>. So there's that theory confirmed. Once <ncpus> is set then that's it fixed for the duration of the run.
When I suspended the WU I got another driver restart.

-
Before I previously restarted I had set <max_concurrent>2</max_concurrent> in app_config (thinking this would limit the GPUGrids WU cache to 2). Then I removed the app_config file from the project directory. I aborted 2 WU's and restarted. I now have 4 WU's in the queue, but only 1 NVidia GPU. Either the number of WU's has been fixed in relation to time (1day, so unlikely given that the WU's are 10h each and Boinc is saying that) due to app_config, and you have to use the app_config always (or do a project restart to properly flush it out of the system), or it's in some way related to the GPU count and ignores the fact that the other GPU is an ATI. Either way having to do a project reset just to make configuration changes is far from ideal.
Anyway, if anyone is planning to do a project reset because they were using app_config, I suggest you choose no more tasks, finish any running WU's and abort any queued WU's, then do the project reset. The project doesn't resend tasks after a project reset, so by aborting the upstarted WU's they won't be in limbo until they time out and can be resent early (better for the project). If you don't they will appear as In Progress online for two weeks.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29720 - Posted: 7 May 2013 | 10:23:37 UTC
Last modified: 7 May 2013 | 10:26:07 UTC

Hi, everyone:

I believe my system may not be running at optimal settings. Can someone please advise this non-technical user how to achieve maximum system usage?

Many thanks!

I get the following results with my AMD A10 5800K and GTX 650 Ti GPUs processing NOELIA tasks. I am also running three WCG SN2S tasks. I do not have an app_config file.

Device 0

temperature 46C
CPU usage 0.61
GPU usage ~70 -85% fluctuating
Fan speed 54%
Core clock at 979.8 MHz

Device 1

temperature 54C
CPU usage 0.61
GPU usage ~66 -92% fluctuating
Fan speed 46%
Core clock at 979.8 MHz

Running NATHAN tasks I was averaging 33,400 sec with 70,800 credits. The NOELIA tasks completed in about 67,700 sec, returning 127,800 credits.

John

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29721 - Posted: 7 May 2013 | 10:29:19 UTC - in response to Message 29720.
Last modified: 7 May 2013 | 10:30:08 UTC

By running less than 3 CPU tasks on your triple core CPU you should see some improvements in GPU usage (at least for these NOELIA WU's). If you don't run any CPU projects you will see the most improvement, but what you run is personal choice. I would probably run one CPU WU at a time with your setup, or just not use the CPU; your NVidia is vastly more powerful than your AMD CPU.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 460
Credit: 842,648,730
RAC: 1,654,231
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29723 - Posted: 7 May 2013 | 12:27:17 UTC
Last modified: 7 May 2013 | 12:30:12 UTC

Still computing on the first one, but seems to run normal 12h on my 560ti 448 cores @ 98% gpu load and 2-6% pentium 4 load :)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29725 - Posted: 7 May 2013 | 12:53:01 UTC - in response to Message 29705.
Last modified: 7 May 2013 | 12:59:42 UTC

Just got one. Glad I have a 1GB GPU 'cause it's using 874MB of its memory!

Maybe that's why they're locking up my 4 GTX 460 768MB cards. Can Nathan and/or Toni perhaps help with designing these WUs? Please? Please?? Please???

Though they walk though the valley of the shadow of death, my GTX 460s fear no evil EXCEPT these crappy NOELIA WUs. Guess the 4 of them are off to other projects :-(

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29729 - Posted: 7 May 2013 | 14:11:08 UTC - in response to Message 29725.

If WU's are going to use more than 750MB GDDR, some crunchers need an easy way to select what tasks they run, or better still the server would allocate WU's based on the amount of GDDR the cards have (might be problematic for people with multiple mixed series cards). No point sending out work that will never complete.

If you're only going to send such tasks in the Long queue then it would be useful if there were alternative tasks in the short queue, and that this was announced.

The main GPU's that would suffer from this problem are, GTX460, GTS450, GT440, and some entry level laptop GPU's. It doesn't really effect the GeForce 500 or 600 cards, only the 400 series. However these cards are CC2.1, just like others in the 400 and 500 series.

Consider Short, Long and Extra-Long queues, though that might impact on other plans.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29730 - Posted: 7 May 2013 | 15:18:04 UTC

They start to crash on all machines. Lost like 50hr of processing since yesterday. Uncool to say the least.

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29731 - Posted: 7 May 2013 | 16:36:56 UTC - in response to Message 29730.

They start to crash on all machines. Lost like 50hr of processing since yesterday. Uncool to say the least.

There's quite a lot of doom & gloom over this latest batch of NOELIAs.
My own experience, on an oc'd GTX 460 1MB, is positive; one completed in under 18 hours with 127,800 credit, and one now running, almost six hours elapsed and 12+ hours remaining.

____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29732 - Posted: 7 May 2013 | 16:48:58 UTC - in response to Message 29731.
Last modified: 7 May 2013 | 16:54:57 UTC

They start to crash on all machines. Lost like 50hr of processing since yesterday. Uncool to say the least.

There's quite a lot of doom & gloom over this latest batch of NOELIAs.
My own experience, on an oc'd GTX 460 1MB, is positive; one completed in under 18 hours with 127,800 credit, and one now running, almost six hours elapsed and 12+ hours remaining.

An experience of 1 is not much experience. Firehawk is having crashing problems on his very fast GPUs, sure a few completed there too. They don't seem to run at all on GPUs with under 1GB ram. Another user posted above that they will not run on Linux on his 660 Ti. Other people are also having crashes. Why are we inflicted with these WUs without testing or warning? It's ridiculous IMO. It's a waste of our resources and our money.

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,000,351,335
RAC: 2,641,140
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29733 - Posted: 7 May 2013 | 17:51:06 UTC - in response to Message 29732.

They start to crash on all machines. Lost like 50hr of processing since yesterday. Uncool to say the least.

There's quite a lot of doom & gloom over this latest batch of NOELIAs.
My own experience, on an oc'd GTX 460 1MB, is positive; one completed in under 18 hours with 127,800 credit, and one now running, almost six hours elapsed and 12+ hours remaining.

An experience of 1 is not much experience. Firehawk is having crashing problems on his very fast GPUs, sure a few completed there too. They don't seem to run at all on GPUs with under 1GB ram. Another user posted above that they will not run on Linux on his 660 Ti. Other people are also having crashes. Why are we inflicted with these WUs without testing or warning? It's ridiculous IMO. It's a waste of our resources and our money.

I too have had a less-than-ideal experience with these new Noelia workunits. After weeks of successfully completing Nathans, my stable GTX 680 cruncher failed 17 Noelias in a row before I detected the issue and rebooted the box. I am still awaiting expiration of the WU limit timer to confirm whether this cruncher will even be able to complete one of these WUs. I concur with Beyond that this is a very unfortunate waste of resources that might have been avoided with some advance notification.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29734 - Posted: 7 May 2013 | 18:56:14 UTC

I have had two NOELIA tasks successfully complete with two more half way through....fingers crossed.

Trotador
Send message
Joined: 25 Mar 12
Posts: 103
Credit: 13,920,977,393
RAC: 423,698
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29735 - Posted: 7 May 2013 | 18:56:38 UTC
Last modified: 7 May 2013 | 19:29:03 UTC

Up to now 4 of them in Linux in 660Ti without problem, the fourth one is about to finish yet, between 11 and 12 hours. Lower ppd than Nathan's but better for the summer as they stress less the GPUs :)

Edit: crunching times seems to improve, maybe they were that high due to the problems I'm having with WCG CEP unit uploading. I will report once I have additional data

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 11
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29736 - Posted: 7 May 2013 | 19:00:21 UTC

Have processed 40 Noelias so far without error. I've had 15 of Nathans fail in the last 7 days!

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29737 - Posted: 7 May 2013 | 19:03:45 UTC

I've completed 21 NOELIA's so far and have had 0 errors.

tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29738 - Posted: 7 May 2013 | 19:08:03 UTC - in response to Message 29732.

An experience of 1 is not much experience.

65 recently-reported successes is an experience!!

____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29739 - Posted: 7 May 2013 | 19:30:43 UTC - in response to Message 29738.

An experience of 1 is not much experience.

65 recently-reported successes is an experience!!

So because they run for maybe 1/2 the people (looks like mainly XP and Linux, although I did so far have 1 finish in Win7/64), all is OK with you? I remember VERY well how loudly you screamed when YOU were having problems with some WUs.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29740 - Posted: 7 May 2013 | 19:35:28 UTC

Guys, let's be fair here: the issues with this run seem to be less than with previous Noelias. Whether the job was fit for the production queue is up to debate, but don't assume they didn't do any internal testing just because errors happen.

And don't assume it's all Noelias fault and that the others could have avoided these errors easily - she is using features which had previously been implemented but had not been used in production runs before.

@neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one (320.something).

@Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver.

@Beyond: the error message you're getting rather quickly after a WU started on your GTX460 768MB is

swanMakeTexture2D failed -- array create failedAssertion failed: a, file swanlibnv2.cpp, line 59

That looks like the creation / allocation of some array. This might indeed be due to running out of memory. BOINC should prevent this, but the amount of memory needed has to be reported properly by the WU, otherwise BOINC can't do anything about it. I'll forward this to the Devs.

@John: your CPU has 2 cores disguised as 4. As SK said, I'd first try to reduce the number of CPU tasks and see how GPU performance improves. Once you've got these numbers you can still decide whether this trade-off is worth it for you.

MrS
____________
Scanning for our furry friends since Jan 2002

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29741 - Posted: 7 May 2013 | 20:29:12 UTC

Hey guys, I wasn't trying to diminish from the problems you are having and I apologize for coming off wrong.

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29742 - Posted: 7 May 2013 | 20:48:52 UTC - in response to Message 29740.
Last modified: 7 May 2013 | 21:05:04 UTC



@Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver.


Well remembered mate. I had to roll back to this driver because the newer ones had a way worse temperature control, but since these new units run cooler, I may give it a try prior to change projects. Thanks for the heads up.

Update: and this seems to did the trick on the AMD 6x690 machine, wich was suffering from poor performance on half of the gpus. They are warming right now, wich always is a good sign. Will see about the stability in some hours. Thanks again ETApes

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29743 - Posted: 7 May 2013 | 21:18:27 UTC - in response to Message 29742.

I'm presently using 311.06 on W7x64. I've had a few driver resets and app crashes, but so far these appear close to the start of runs (up to 5%), and if I just close and open Boinc, Noelia's WU's restart and crunch away reasonably well.

The last time I had a driver restart, it was when I resumed a suspended Albert CPU WU. As soon as it resumed it suspended the second GPUGrid WU running so it could start. This immediately resulted in a driver restart. However the Noelia WU managed to restart, after the Albert WU finished, without any intervention on my part. When I did several suspends on a WU yesterday I didn’t have any driver problems, but they were further into the run.

I really dislike Boinc suspending GPU WU’s to use a CPU thread. Obviously I won't be running Albert tasks any time soon.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,532,759
RAC: 3,459,749
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29744 - Posted: 7 May 2013 | 21:22:44 UTC

I had two failures, and many successful 'NOELIA_klebe_run-0-3's.
The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59".
The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,532,759
RAC: 3,459,749
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29745 - Posted: 7 May 2013 | 21:24:20 UTC - in response to Message 29708.

These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware.

+1

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29746 - Posted: 7 May 2013 | 21:59:22 UTC - in response to Message 29745.

I had two failures, and many successful 'NOELIA_klebe_run-0-3's.
The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59".
The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking.

file swanlibnv2.cpp, line 59 - might be a cuda bug/app issue.
The Cellresize error might be due to OC, or it might be something else/new; I don't recall seeing it before. Probably best to report such errors, just in case.

These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware.

+1

Yeah, we think these tasks could perhaps do with using more CPU resources(or possibly be higher priority), but conversely, they are using too much GPU memory for some cards. So, not using enough resources for some and using too much for others. Making everybody happy is a cinch :))

The reported downclocking may depend on setup and OS/drivers.

When I reduce the CPU usage I'm getting reasonably high GPU performances (W7x64). I'm running 4 climate models (for stability ;p) and two GPUGRid WU's. The GTX660Ti is presently using 88% power, 89% GPU utilization, 848MB GDDR5 (for the bigger equations perhaps) and the shaders are up to 1215MHz. The GTX470 is using 93% GPU and 751MB GDDR (smaller equation maybe).

Are there larger and smaller equations in use and are they being used based on GDDR capacity or Compute Capability (CC2.0 small CC2.1 or above large)?
Perhaps there is some variation in memory usage of these tasks (different WU's use different amounts of memory); Im seeing 751MB, 801MB and 848MB on different GPU's. Only the 751MB WU is running on a card not used for a display. Again, might be a coincidence or might be something in it? Those with 2 GPU's could check.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,000,351,335
RAC: 2,641,140
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29747 - Posted: 7 May 2013 | 22:28:46 UTC - in response to Message 29740.


@neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one

MrS, many thanks for the suggestion! I normally avoid changing my crunching rig config as much as possible once it is stable, but I see that my other rig that has a mix of 650Ti & 560Ti cards running version 314.22 has now completed 4 Noelia WUs. I'll try upgrading the 680 rig as soon as I get home.

Do you tend to keep your crunch platforms on the latest drivers, or do you wait for systematic issues to arise before upgrading?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29749 - Posted: 8 May 2013 | 7:57:05 UTC - in response to Message 29747.

I like to play it half-safe: wait a few weeks if others discover any issues with newer drivers, and if not I'll upgrade when I feel like it. And I won't use beta drivers for crunching, unless there's a very good reason to do so.
Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p

MrS
____________
Scanning for our furry friends since Jan 2002

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,000,351,335
RAC: 2,641,140
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29757 - Posted: 8 May 2013 | 18:33:57 UTC - in response to Message 29749.

Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p

Unfortunately, staying still for the sake of 'stability' seems to be just as bloody...

I've upgraded to version 314.22 WHQL driver, and my GTX680 platform (ID 87170) is still failing the long run WU within 6 seconds, with the following error in the BOINC log:
5/8/2013 10:57:22 | GPUGRID | [sched_op] Reason: Unrecoverable error for task 290px29x1-NOELIA_klebe_run-1-3-RND8276_0 ( - exit code 98 (0x62))

No hardware or software config changes have been made to this platform since the new long run WUs were queued. I have no clue why my other rig (ID 137898) (in which I am constantly changing GPUs setups, and runs hotter than the GTX680 platform) is completing long runs successfully, but 87170 can't even initialize the new long runs.

Any further insights would be greatly appreciated. Thank you in advance for the help and the patience!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29760 - Posted: 8 May 2013 | 19:40:48 UTC - in response to Message 29757.

I've just seen that you're using an anonymous platform on that rig. There's been an app update in the not too distant past, introducing features which hadn't been used by the previous Nathans, but are being used now.

If you're still running an older app there we'd have an easy explanation.

MrS
____________
Scanning for our furry friends since Jan 2002

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29761 - Posted: 8 May 2013 | 19:47:00 UTC

When Tomba first started this thread, I jumped up and checked my computers and discovered that I had 3 running. I use Precision X to adjust my Evga cards, the 3 NOELIA's that were running had caused those cards boost clock to jump up considerably along with more frequent voltage spikes. Now that all my cards are running these new NOELIA's, I've had to adjust the voltage and core clock on all the cards back down to where I know they are in a safe range.

Every card is different from each other and I have now turned on K-Boost which locks all the voltages to where I want them to be. I've had pretty good luck with that feature of Precision X, I did have issue's getting it to work in Windows XP Pro x64 because of the lack of SP3. When I installed it, I put it in compatibility mode (the installation files and the installed program) and it worked. That doesn't mean that any of this stuff will work for anyone else, I just thought it might be worth mentioning especially how my cards core clock jumped up so high on their own, if I hadn't used a utility to adjust the clocks and voltages, then these issue's may have not showed at all.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,535,121
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29762 - Posted: 8 May 2013 | 19:56:00 UTC - in response to Message 29761.

I tested NOELIAs on Ubuntu 13.04 with Nvidia 319.17 on a GTX660Ti. Progress thru 9 hours of execution of a SHORT RUN, showed less than 20% progress. Further tests with Long Runs indicated a runtime of over 5000 minutes, minimum.

I will wait for these to be exhausted, or for the GPUGrid admins to fix ACEMD for Linux so that it runs these klebe NOELIAs as well as the Windoze version of ACEMD

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,000,351,335
RAC: 2,641,140
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29767 - Posted: 9 May 2013 | 1:21:25 UTC - in response to Message 29760.

If you're still running an older app there we'd have an easy explanation.


Yes - I believe you've found the root cause. I adapted an app_info.xml from the forum to run on rig 87170 August last year to permit execution of two low-utilization Paola 3EKO WUs at a time. The cudart32_42_9.dll, cufft32_42_9.dll and acemd.2562.cuda42.dll are dated 6/17/2012, and the tcl85.dll is 11/23/2010. I see from my other rig that app updates occurred on 11/4/2012 and 2/25/2013.

What's the fastest/simplest resolution - delete the current app_info.xml?

Thanks!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29774 - Posted: 9 May 2013 | 11:13:05 UTC - in response to Message 29767.
Last modified: 9 May 2013 | 12:01:56 UTC

Yes, delete the app_info file and restart Boinc.

While I'm not overly impressed with app_config (confusing terminology, misleading reporting, and the requirement to reset the project to get rid of it), it's probably a better option than building and maintaining app_info files, and it might improve in later versions.

In the past some WU's didn't utilize the GPU to a high extent (<60%). These were mostly Beta's and it isn't usually the situation. Although having some non-recommended configurations/setups with your GTX 680 (4095MB) might allow you to etch out some slight overall performance advantage, I can get both Nate's and Noelia's WU's to run at ~90% GPU utilization, while running 5 CPU WU's, or 94/95% simply by reducing the CPU usage for CPU projects.
At present I doubt that you will get a significant improvement running two tasks, and you could simply tweak your setup to optimize for GPUGrid throughput. So I don't really see the need for using app_info or app_config.
Also, the very concept of trying to reach 100% GPU utilization with Kepler's is a bit dubious; these card self-adjust towards a GPU specific power optimal. If your GPU utilization rises but your core clock rate drops, what's the point?

-
Just want to add that I've not had any Noelia WU failures, despite messing, and my second GPU [GTX470] (operating at PCIE2x8 in PCIE slot 1 going by GPUZ, the motherboard manufacturer and Afterburner, but GPU (device 0) according to BM) has only used 751MB GDDR on since I installed it (for 3 of Noelia's WU's). The GTX660Ti (top PCIE slot; operating at PCIE3x8) has used varying amounts of GDDR from ~800MB to almost 1GB.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,000,351,335
RAC: 2,641,140
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29780 - Posted: 9 May 2013 | 14:50:37 UTC - in response to Message 29774.

Yes, delete the app_info file and restart Boinc.

Thank you skgiven & MrS - I'm crunching long runs again!

At present I doubt that you will get a significant improvement running two tasks...

I agree completely. I only ran 2 WUs at once with Paola 3EKO long runs. IMHO, every GPUGRID long run WU since has had high enough GPU utilization to justify dedicating the GPU (even a GTX 680) to it alone. And as you can probably tell, I'd really much rather leave my cruncher config alone, and let it do its thing :)

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29783 - Posted: 9 May 2013 | 22:18:08 UTC

I have one Noelia finished in 147,962.58 seconds on a GTX550Ti with driver 314.7.
I have now two running one on the same card and one on the older GTX285 and is expected to exceed 40 hours.
____________
Greetings from TJ

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29784 - Posted: 10 May 2013 | 0:11:18 UTC

I have recently completed seven NOELIA tasks in an average of 69,589 seconds (19.3h) on my GTX 650 Ti GPUs, driver 314.22. CPU is AMD A10 5800K. Two more tasks are running with about 15h to go.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29787 - Posted: 10 May 2013 | 8:12:35 UTC - in response to Message 29784.

Going by that, and I know you both use W7, Johns GTX650Ti is 2.12 times as fast as TJ's GTX550Ti. Don't know the CPU usage on both systems though, and that's important.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29789 - Posted: 10 May 2013 | 8:35:43 UTC - in response to Message 29787.

Going by that, and I know you both use W7, Johns GTX650Ti is 2.12 times as fast as TJ's GTX550Ti. Don't know the CPU usage on both systems though, and that's important.

I use only x% of the CPU, so that 1 core is always free.
The GTX560Ti WU is using 0.49 CPU and the GTX285WU is using 0.571 CPU.
By the way, both are Vista driven.
My W7 has ATI cards which I will replace as soon as I have some money left by nVidia 680 or 690.
____________
Greetings from TJ

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 460
Credit: 842,648,730
RAC: 1,654,231
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29791 - Posted: 10 May 2013 | 9:37:58 UTC

Computed already 9 units, no problems so far :), 42500 secs on (my new 24/7) 570 and 45000 on 560ti 448cores.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29792 - Posted: 10 May 2013 | 12:10:23 UTC - in response to Message 29787.

Going by that, and I know you both use W7, Johns GTX650Ti is 2.12 times as fast as TJ's GTX550Ti. Don't know the CPU usage on both systems though, and that's important.

My 3 650 TI GPUs are running 59,400 - 64,900 seconds/WU OCed. The 4 GTX 460s are unfortunately off to other projects until the NOELIAS go away, so I'm 3 for 7...

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 29811 - Posted: 11 May 2013 | 7:50:45 UTC - in response to Message 29792.

We are looking into it.

Noelia is not to blade. She is simply trying desperately to run some new science which uses a new algorithm within acemd. The others are not.

We absolutely needs this to run, but it is giving us problems it seems.

gdf

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29815 - Posted: 11 May 2013 | 9:27:35 UTC - in response to Message 29811.

We are looking into it.

Noelia is not to blade. She is simply trying desperately to run some new science which uses a new algorithm within acemd. The others are not.

We absolutely needs this to run, but it is giving us problems it seems.
gdf

Since May 8th I have done 3 Noelia's which all finished without error. Only my GPU's are slow so it took almost 2 days to complete on WU. And I have seen other PC's that finish these Noelia's so it could be something of a driver, hardware, BOINC or the OS? So perhaps is the algorithm Noelia is using okay.
I don't mind to run these Noelia's especially if it is important for the project.
____________
Greetings from TJ

Dirk
Send message
Joined: 10 Oct 08
Posts: 18
Credit: 39,100,916
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29816 - Posted: 11 May 2013 | 10:35:13 UTC

Ran a few Noelias recently on my 660 TI, no problems finishing them. Did have the issue with drivers crashing after suspending the wu. But just now I suspended the gpu to do some gaming and it didn't crash, odd.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29817 - Posted: 11 May 2013 | 11:54:29 UTC
Last modified: 11 May 2013 | 12:53:16 UTC

Just finished my first Noella on a newly acquired 660TI. 10 hours to complete, no problems on Linux Mint, 127,800 credits. Will be rebooting to XP soon to see how they run there.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29818 - Posted: 11 May 2013 | 12:05:50 UTC - in response to Message 29816.
Last modified: 11 May 2013 | 15:51:48 UTC

I have had one Noelia WU failure and 17 WU's complete successfully. For me that's a slightly higher success rate than with Nate's less challenging WU's. Though I would add that I dropped using another CPU core to improve their performances.

In many cases (but not all) the key to successfully crunching these WU's is using GPUGrid's recommended Boinc setup; set Boinc to use <100% of the CPU, allow the GPU to be used when the system is being used by the user, don't run multiple projects (as this results in apps being suspended to run other GPU apps).

WRT the task crashes, driver restarts sometimes result in the WU's crashing, so if people affected by this tried the registry change suggestion Nanoprobe made, we might have a quick-fix/work around.

The driver restarts are a generic driver issue seen in several versions of recent NVidia drivers. It effects other GPU projects and other non-Boinc applications. It's not down to GPUGrid or any other Boinc project to fix NVidia's drivers or OS problems. However, GPUGrid still has to use the drivers, OS's and Boinc.

I totally agree that these WU's are essential to GPUGrid. It's down to GPUGrid to facilitate running these WU's, and down to the crunchers to facilitate GPUgrid where possible; GPUGrid doesn't force Boinc configurations upon us, so crunchers need to make the recommended adjustments to improve performance.

The obvious and simplest solution as far as GPUGrid is concerned is to put Noelia's WU's into a different queue. Have Short, Long and Very Long queues, or just use the Beta queue. This would allow users with cards that can't run these WU's to opt out or in. Ideally the server would allocate tasks based on their success rates. Perhaps this can be done on a system and queue basis?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

[AF>Le_Pommier] McRoger
Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29849 - Posted: 12 May 2013 | 7:24:00 UTC

Problem for me: much too long, won't finish on time on my 470GTX. :(

185 hours and a deadline on Monday 13...

GoodFodder
Send message
Joined: 4 Oct 12
Posts: 53
Credit: 333,467,496
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29850 - Posted: 12 May 2013 | 8:22:45 UTC

No problems at all for me with the Noelia's - 17 have gone through without a hitch - there again I am not overclocking my systems to their limit.

[AF>Le_Pommier] McRoger
Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29852 - Posted: 12 May 2013 | 8:58:05 UTC - in response to Message 29850.

Not overclocking at all, but this is obviously not the point. :(

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,532,759
RAC: 3,459,749
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29854 - Posted: 12 May 2013 | 9:17:20 UTC - in response to Message 29849.
Last modified: 12 May 2013 | 9:20:13 UTC

Problem for me: much too long, won't finish on time on my 470GTX. :(

185 hours and a deadline on Monday 13...

Your GTX 470 should be fast enough to complete a NOELIA_klebe workunit in time (about 12 hours).
It takes 10h45m ~ 11h45m to process them on my (slightly overclocked) GTX 480.
So the problem is at your end, try to restart your host, or try not to crunch any CPU tasks on your P4.

[AF>Le_Pommier] McRoger
Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29855 - Posted: 12 May 2013 | 9:54:36 UTC - in response to Message 29854.
Last modified: 12 May 2013 | 9:58:15 UTC

Thanks, but that will not help I'm afraid, this is apparently not a performance issue.

Just looked at my account, all previous Noelia WU ended in error this week.

Like this one:
Stderr output
<core_client_version>7.0.27</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SIGSEGV: segmentation violation
Stack trace (12 frames):
../../projects/www.gpugrid.net/acemd.2868(boinc_catch_signal+0x4d)[0x56709d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f775251c030]
/lib/x86_64-linux-gnu/libc.so.6(fwrite+0x34)[0x7f77517cf034]
../../projects/www.gpugrid.net/acemd.2868[0x47f9c7]
../../projects/www.gpugrid.net/acemd.2868[0x4813a0]
../../projects/www.gpugrid.net/acemd.2868[0x492d74]
../../projects/www.gpugrid.net/acemd.2868[0x47f18a]
../../projects/www.gpugrid.net/acemd.2868[0x422c27]
../../projects/www.gpugrid.net/acemd.2868[0x408c04]
../../projects/www.gpugrid.net/acemd.2868[0x407bc9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f7751784ead]
../../projects/www.gpugrid.net/acemd.2868[0x407a39]

Exiting...

</stderr_txt>
]]>

I've upgraded to the latest Nvidia PAU driver in the Debian Wheezy packages, and of course, stopped CPU WU, restarted the PC (was up for 35 days).

So far, the current task shows some progress (10% after 02:25:00) even though the ETC remains at 12 and half hour. I keep you posted !


Edit: only uses up to 4% CPU, so this is really not the point, even for an old PIV.

[AF>Le_Pommier] McRoger
Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29856 - Posted: 12 May 2013 | 10:02:49 UTC

Now, the very same WU that was in progress a couple of minutes ago also ended up in error.

Sorry, but have to stop the Noelia long runs, they all end up in error, for some of them pretty fast, for others after days......

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29858 - Posted: 12 May 2013 | 12:22:22 UTC - in response to Message 29856.
Last modified: 12 May 2013 | 12:32:06 UTC

There is a problem on some Linux operating systems; they want to take for ever to complete. I think it's the more recent versions of Linux that are impacted, but not all. It's possible there is something missing in the drivers or Linux that is preventing the correct use of the drivers; missing libraries.

If you consistently get SIGSEGV errors on Linux or these WU's are going to take hundreds of hours (go by the % complete) just abort them.

If anyone finds a fix, post it up.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29859 - Posted: 12 May 2013 | 13:01:43 UTC

Finished my first Noelia without errors on XP running a 660TI. Took about 35 minutes longer than the same card on Linux Mint

[AF>Le_Pommier] McRoger
Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29861 - Posted: 12 May 2013 | 13:14:27 UTC - in response to Message 29858.
Last modified: 12 May 2013 | 13:15:10 UTC

There is a problem on some Linux operating systems; they want to take for ever to complete. I think it's the more recent versions of Linux that are impacted, but not all. It's possible there is something missing in the drivers or Linux that is preventing the correct use of the drivers; missing libraries.

If you consistently get SIGSEGV errors on Linux or these WU's are going to take hundreds of hours (go by the % complete) just abort them.

If anyone finds a fix, post it up.


I'm running Debian Wheezy, so yes, a very recent version of kernel, drivers and libraries.

And if I want to install « glibc-2.13-1 » (containing libpthread.so.0 mentioned in the error message), apt tells me that « libc6 » is already installed instead.

So yes indeed, might be that this is a choice of library to compile the Linux application that is not compatible with latest versions (but I'm no developer, so that is just an assumption).

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29880 - Posted: 12 May 2013 | 18:44:01 UTC - in response to Message 29859.

Finished my first Noelia without errors on XP running a 660TI. Took about 35 minutes longer than the same card on Linux Mint

While you have only run one task type each on Linux and XP, it looks like Linux Mint (3.5.0-17-generic) is ~5% faster (4.5% for Nathan's and 5.8% for Noelia's):

Linux
306px37x2-NOELIA_klebe_run-1-3-RND7661_1 4440201 10 May 2013 | 20:53:39 UTC 11 May 2013 | 7:16:08 UTC Completed and validated 36,653.31 16,521.36 127,800.00
I40R14-NATHAN_dhfr36_6-10-32-RND5144_0 4440304 10 May 2013 | 20:53:39 UTC 11 May 2013 | 12:08:30 UTC Completed and validated 17,866.73 17,627.17 70,800.00

XP
306px2x1-NOELIA_klebe_run-1-3-RND0127_0 4442470 11 May 2013 | 19:28:14 UTC 12 May 2013 | 6:21:56 UTC Completed and validated 38,796.59 17,414.91 127,800.00
I12R11-NATHAN_dhfr36_6-13-32-RND4528_0 4442251 11 May 2013 | 19:29:37 UTC 12 May 2013 | 11:33:18 UTC Completed and validated 18,676.30 18,565.16 70,800.00

All 'Long runs (8-12 hours on fastest card) v6.18 (cuda42)'

That's more than I thought it would be (1%, possibly 3%). There might be some task variation, but running on the same system is a very solid.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 17,863
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 30024 - Posted: 16 May 2013 | 18:44:03 UTC

I2HDQ_17R4-SDOERR_2HDQc-1-4-RND1951_0
I99R11-NATHAN_dhfr36_6-8-32-RND2501_0 202 (0xca) EXIT_ABORTED_BY_PROJECT


http://www.gpugrid.net/result.php?resultid=6878277
http://www.gpugrid.net/result.php?resultid=6851367

Errors and errors again and again, and another noelias incomimng-(((
After cca two weeks without problems and restarting/blue screen.
Every time when i have rac about 620k incoming some wrong jobs noelia..but is not the first time when im complained to the problem when I have just about 600-620 rac..Now it is all my participation in the project after two weeks in the ass. Is there any conspiracy behind it or just incompetence?

Can I do something more than complain here?...-)))

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30026 - Posted: 16 May 2013 | 19:13:25 UTC - in response to Message 30024.

Is there any conspiracy behind it or just incompetence?

Neither.

Noelia is not to blame, she's just the first to use new features which the project needs in the future.

MrS
____________
Scanning for our furry friends since Jan 2002

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 17,863
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 30031 - Posted: 16 May 2013 | 20:01:48 UTC - in response to Message 30026.
Last modified: 16 May 2013 | 20:04:57 UTC

Every time when i have rac about 620k incoming some wrong jobs noelia..but is not the first time when im complained to the problem when I have just about 600-620 rac..It is the third time exactly when again I have a problem of Noelia and just when I got 620k rac, it is amazing Mr. Scientist----------And that's your answer, Mr. moderator?

I have to prove it in the logs of work.

do you think that people in not perfect English can not understand simulations of proteins?! shame om you

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30038 - Posted: 17 May 2013 | 3:16:50 UTC
Last modified: 17 May 2013 | 3:17:14 UTC

It's obviously an international conspiracy to keep your RAC low. We're all involved and participate in LJRAC (Lowering Jozef's Recent Average Credit). BTW, the checks are in the mail...

JK. Seriously, we're all suffering the same problem. This is not what one would call a smoothly running project. Just saying...

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30039 - Posted: 17 May 2013 | 7:41:04 UTC

Well I have two systems with nVidia cards, slow ones though. However they do Noelia's without problems so far, taking between 2 and 3 days. The systems are stable nothing is overclocked and not the latest BOINC or drivers. If it works than I leave it as is. If not I'll try to update the video drivers. One CPU core is always free, that seems to be important.

It could be the setup of system and drivers that results in errors, Microsoft Windows overall is a complex and heavy controlling OS.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30047 - Posted: 17 May 2013 | 13:16:55 UTC - in response to Message 30039.
Last modified: 17 May 2013 | 13:46:33 UTC

If it works than I leave it as is.

Always the best advice.
Unfortunately I test stupid problems and get errors for my efforts. Today while testing something and looking into another issue/fix, I had to suspend WU's. This caused the driver to restart two or three times, and then I got a blue screen. On reboot lots of C+ errors and all my running WU's crashed and burned. Not an issue if I had been running FightMalaria@home, but I was running 5 climate models and lost several hundred hours - scunnered!

Possible fix here - works for me.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30049 - Posted: 17 May 2013 | 14:06:38 UTC - in response to Message 30047.

I was running 5 climate models and lost several hundred hours - scunnered!

Ouch.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30054 - Posted: 17 May 2013 | 15:29:26 UTC

I have read here in the forum many times that suspending a GPUGRID WU will cause error and blue screen. That is why I have never tried it. For Albert and Einstein at home it can be done without harm (in my case).
____________
Greetings from TJ

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 100
Credit: 1,347,601,646
RAC: 2,174,675
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30058 - Posted: 18 May 2013 | 10:52:54 UTC - in response to Message 30054.

Hi all,
after 25 hours (72 % completed) i suspend this Noelia's WU.
During resume i had this error, after my abort because it starts from 0%.
Did you have an idea about this ?
Thanks in advance.
K.

http://www.gpugrid.net/result.php?resultid=6879994

____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30060 - Posted: 18 May 2013 | 12:04:11 UTC

I think it's pretty safe to say that with the curent Noelias suspending a WU almost certainly triggers a driver reset. For me this has taken down a few hours of Einstein work, twice. Now I do my testing whenever I have other WUs running. Not ideal, but better than the alternative.

@Jozef: when was the last time that throwing insults at people actually helped you? Looking at your tasks I can see that in the last 2 weeks you had 3 Noelias and 2 Nathans fail with computation errors. That's unfortunate, but not unusual and you can be sure the scientists are looking into it. But it's nowhere near the scale of the global conspiracy which you seem to suspect.

Actually everyone gets bonus credits for each long-run WU as the risk of loosing them is higher than for shorter WUs. So you're being compensated for a certain failure rate from the beginning on.

MrS
____________
Scanning for our furry friends since Jan 2002

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30075 - Posted: 18 May 2013 | 21:39:15 UTC

The suspend-restart blue screen has never happened to me and I suspend quite often (Windows XP Pro x64). Maybe it's an OS specific issue, I also have my checkpoints set to 900 seconds (15 minutes), I did this mainly for the climate models I run. I do have problems when finishing a SDOERR and starting a NOELIA on the same GPU, no crashes, just the card running wild on the GPU clock.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30088 - Posted: 19 May 2013 | 13:44:05 UTC - in response to Message 30075.
Last modified: 19 May 2013 | 13:52:56 UTC

I've only seen the 'suspend & crash' problem on W7. Saying as different OS's handle the drivers differently it's bound to be OS related.

On XP I think you still can't set Prefer maximum performance in NVidia control panel - might explain the 'card running wild on the GPU clock' issue. Anyway, it's a driver issue; they took that feature away.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30090 - Posted: 19 May 2013 | 15:54:47 UTC - in response to Message 30088.

I've only seen the 'suspend & crash' problem on W7.

Add W8 to that!

MrS
____________
Scanning for our furry friends since Jan 2002

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,031,722
RAC: 571,307
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30098 - Posted: 20 May 2013 | 0:17:07 UTC - in response to Message 30047.

This caused the driver to restart two or three times, and then I got a blue screen. On reboot lots of C+ errors and all my running WU's crashed and burned. Not an issue if I had been running FightMalaria@home, but I was running 5 climate models and lost several hundred hours - scunnered!


Sometimes I do think it is not necessary the GPUGRID WUs, which causes the bluescreen, I think it might be also the CLIMATEPREDICTION.NET WUs: I just had a bluescreen around the same time as you, and then one of the CLIMATEPREDICTION.NET WUs did not work anymore, and the GPUGRID did continue. However it mostly on a system with a GTX 570 card.

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30100 - Posted: 20 May 2013 | 2:46:37 UTC

The climate models (CPDN) are very, very sensitive to any kind of an interruption. When I set my checkpoints to every 15 minutes, my computation error rate dropped by 70% and if I do 3 or more suspend/restarts within 10 minutes, I'll get at least 1 error.

When I reboot my computers (every 200 hours), I suspend the tasks and close BOINC by clicking exit, that works every time for me. If I get any kind of a crash or system freeze (neither have happened in over a year), it's a guarantee that I will lose some Climate models, I even have new APC units just incase.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30115 - Posted: 20 May 2013 | 13:58:37 UTC - in response to Message 30100.

The climate models (CPDN) are very, very sensitive to any kind of an interruption. When I set my checkpoints to every 15 minutes, my computation error rate dropped by 70% and if I do 3 or more suspend/restarts within 10 minutes, I'll get at least 1 error.

I've been wondering about CPDN, because the people reporting crashes often mention that they loose CPDN work. I'm not running that project and also have never had any hard crashes, nothing but some ACEMD errors on certain WU types. Nothing else running on the machine is ever affected.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 30286 - Posted: 24 May 2013 | 9:53:46 UTC
Last modified: 24 May 2013 | 9:56:19 UTC

Hi all,

My GTX 650Ti is currently working on a NOELIA, but the GPU utilization looks pretty low: elapsed 10h, remaining 17h. That will be a total of 27 hours! A previous SDOERR took 18h.

I use Linux and so can't observe GPU utilization directly, but judging by the temperature (50C), the GPU is clearly not being fully used. It gets at >60C when it is.

Has anyone else observed this?

Edit: The WU's process (acemd.2868) consumes 5-10% CPU, but this doesn't seem to be the cause of the under-utilization, rather the symptom of it. I tried suspending CPU tasks and projects and it didn't change.

My configuration:
Ubuntu Server 12.04 x86_64
Kernel 3.2.0-41-generic.
NVIDIA driver 319.17
BOINC 7.0.65

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30293 - Posted: 24 May 2013 | 14:54:04 UTC - in response to Message 30286.

My GTX 650Ti is currently working on a NOELIA, but the GPU utilization looks pretty low: elapsed 10h, remaining 17h. That will be a total of 27 hours! A previous SDOERR took 18h.

I use Linux and so can't observe GPU utilization directly, but judging by the temperature (50C), the GPU is clearly not being fully used. It gets at >60C when it is.

Has anyone else observed this?

On my 650 Ti GPUs (and others) the GPU utilization runs 5-6% lower on NOELIA and NATHAN_KID WUs than on SDOERR WUs. (Win7-64)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30294 - Posted: 24 May 2013 | 14:54:53 UTC - in response to Message 30286.
Last modified: 24 May 2013 | 14:55:36 UTC

Is Prefer Maximum Performance 'presently' selected (as in, did you set it since you last rebooted)?


PS. Finger pointed at CPDN (no system or WU failures since I stopped crunching climate models)!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30298 - Posted: 24 May 2013 | 15:27:49 UTC - in response to Message 30294.

PS. Finger pointed at CPDN (no system or WU failures since I stopped crunching climate models)!

I suspected such. Maybe a conflict between the apps?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30311 - Posted: 24 May 2013 | 18:30:53 UTC - in response to Message 30298.

Possibly, or with Boinc, but on at least 2 occasions it/something has caused a blue screen/restart, which would have prevented Boinc and running apps from closing down properly. Most likely it was the CPDN app/WU's that caused Windows to fail, and the GPUGrid WU failures were just co-incidental. I had presumed it was the GPUGrid app that had failed triggering everything else to fail, due to the startup error messages and logs. I didn't think it was the CPDN apps as several models had been running for several days. Since I've stopped running the CPDN WU's, I've had no problems...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

flashawk
Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30312 - Posted: 24 May 2013 | 18:46:35 UTC

I run CPDN too and had BSoD's a while back and traced my problem to the USB 3.0, I uninstalled the drivers, rebooted, went in to the BIOS and turned off USB 3.0. That was 4 to 5 months ago and it seems to have fixed it, maybe CPDN and the new USB doesn't get along. I figured that the drivers weren't mature enough, I don't need it so it's no big deal for me (knocking on wood).

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 30314 - Posted: 24 May 2013 | 20:24:46 UTC - in response to Message 30294.

Is Prefer Maximum Performance 'presently' selected (as in, did you set it since you last rebooted)?


PS. Finger pointed at CPDN (no system or WU failures since I stopped crunching climate models)!

I don't use the Nvidia tool (nvidia-settings in Linux), as this is a headless machine, so effectively everything is "stock".

I aborted the NOELIA, as it was going to take waaaay too long (something like 4 days!) and didn't want to risk a midway - or worse - failure. I am crunching a NATHAN_KID right now at full speed, the GPU at 60-something degrees and a whole CPU core consumed. Total estimated runtime at ~22h.

Bottom line, it has to be something with the NOELIAs, at least some of them.Here is the WU discussed.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30319 - Posted: 24 May 2013 | 21:34:48 UTC
Last modified: 24 May 2013 | 21:35:48 UTC

There's a discussion about 319.17 performing poorly for someone else running Ubuntu Server 12.04 x86_64. Going back to 310.44 fixed it for him.

BTW: I had all Einstein WUs crash upon a driver reset triggered by suspending Noelias, so it's not purely CPDN related. No bluescreen, though.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30326 - Posted: 25 May 2013 | 9:30:10 UTC - in response to Message 30319.
Last modified: 25 May 2013 | 9:31:17 UTC

I also saw a driver restart with Einstein (couple of weeks ago) and a POEM WU yesterday (vista rig). So it's common to many NVidia apps and WDDM OS's. The reg fix I posted has thus far prevented the driver restarts, but not the BSD/Restarts. So, two different issues.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30328 - Posted: 25 May 2013 | 10:49:54 UTC

My last task on the GTX285 was a NATHAN and took 196,443.98 seconds. That is 10,000 seconds than a NOELIA on the same card.
So these new NOELIA's are not all bad.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30337 - Posted: 25 May 2013 | 12:30:10 UTC - in response to Message 30328.
Last modified: 25 May 2013 | 12:49:26 UTC

My GTX650TiBoost finished a NOELIA_klebe in 13h 51min (49,849sec). Ubuntu 13.04, NVidia 304.88, Boinc 7.0.27.
Despite restarting several times while configuring things it was still 5.5% faster than on a 2008server (which is a bit faster than W7) on the same rig.

6890757 4473004 24 May 2013 | 21:20:02 UTC 25 May 2013 | 12:13:07 UTC Completed and validated 49,848.80 21,179.21 127,800.00 Long runs (8-12 hours on fastest card) v6.18 (cuda42)

TJ, you really should sell that GTX285 heater and get something new (cheap to buy, much faster, less expensive to run). A GTX650Ti would more than triple the performance, a GTX650TiBoost would almost quadruple it and a GTX660 would be around 4.5 times as fast.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30344 - Posted: 25 May 2013 | 13:27:32 UTC - in response to Message 30337.
Last modified: 25 May 2013 | 13:28:34 UTC

TJ, you really should sell that GTX285 heater and get something new (cheap to buy, much faster, less expensive to run). A GTX650Ti would more than triple the performance, a GTX650TiBoost would almost quadruple it and a GTX660 would be around 4.5 times as fast.


I did. That is way i wrote the "last task" it's out the rig and the new GTX660 is in it.
Running MilkyWay now for testing, but seems slower to me as a WU takes longer to finish around 8 minutes was 6 minutes on the old GTX285, but the WU's are not the same. But all seems slow, even browsing, I posted this under the cards threat.

Indeed a heather, the GTX660 is cooler now 54°C
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30348 - Posted: 25 May 2013 | 13:56:06 UTC - in response to Message 30344.
Last modified: 25 May 2013 | 13:57:45 UTC

MilkyWay requires FP64 (double precision). It's a bad project for GK104 cards.
The GTX285 was better because it has better double precision than a GTX660.
For Single Precision and CUDA4.2 the GTX660 is much better than a GTX285.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,031,722
RAC: 571,307
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30357 - Posted: 25 May 2013 | 16:01:51 UTC

Just a quick question: I run a 8600 GT and a 9800 GT and some 8400 (never dump old gear) - not GPUGRID rather PRIMEGRID - as I am still thinking of buying a GTX 660 up to GTX 770 in the near future, but looking on my electric bill from last month (roughly USD 350.00) I was discouraged to invest in an additional card. So as you have discussed the GTX 285 at this very moment, do you think that it would pay off to buy a GTX 650ti and dump the 9800 GT and the 8600 GT, the later is causing trouble with about 10000 credits each day. The 8400 have just been lying around so I thought why not put in some computers.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30360 - Posted: 25 May 2013 | 16:42:25 UTC - in response to Message 30357.
Last modified: 25 May 2013 | 17:38:41 UTC

After a quick look I think that PG's credits are roughly comparable to here and a 650Ti is a descent enough GPU for here.

The 8600GT, 9800GT and 8400's are of no use for here and are SP only, so no use at MW either.
While they might work at PG and several other projects their use also depends on project up-time, so POEM might work on an 8400GS but there aren't enough WU's around to justify keeping the card for that project.

Individually these cards are not very expensive to run but their performance is relatively poor, and project compatibility is limited.

They might still be of some use as entry-level gaming cards, so you might get something for them, but not much.

If the $350/month only came down by ~$50 a GTX650Ti would pay for itself in a couple of months, and a 660 would probably pay for itself within 3 or 4 months, just by getting rid of the old cards running cost.

A 8400GS has a TDP of between 25W and 38W, so there isn't a lot of electric being used per GPU.
The 8600GT only has a TDP of 43W, so again there isn't much of a saving from one, but neither of these cards can do much crunching anyway.
The 9800GT however varies between 59W and 125W, depending on the model (?)

I see you have two 8400GS GPU's running at PG, along with an 8600GT and a 9800GT. These are bringing you a RAC of <35K for ~(40or65+35+60to110)W = 125W to 210W (depending on the models). Obviously your choice what you crunch with and who you crunch for, but if you pulled those GPU's and ran a GTX650Ti here you could get a RAC of ~190,000, and for ~100W. So you would be saving some electric (between 35W and 120W) and earning more than five times the credits. Presuming your 9800GT has either a 105W or a 125W TDP (it's not a GE version) then the Electric saving would be at least 50W.

If you want to save further on the Electric front, get rid of your E6750 system, and possibly your E8500. Even if you just stopped using those CPU's to crunch you would save as much as you would from getting rid of an old GPU.
You have 3 good rigs at PG, but two old beasts that don't really do much other than eat electric.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30385 - Posted: 26 May 2013 | 9:01:07 UTC - in response to Message 30360.

I seconds SKs suggestion. And you wouldn't have to throw those Core 2 Duos away: they're still decent surf stations / office boxes, especially if equipped with 4 GB RAM and an SSD.

MrS
____________
Scanning for our furry friends since Jan 2002

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,031,722
RAC: 571,307
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30437 - Posted: 27 May 2013 | 4:29:30 UTC - in response to Message 30385.

If the $350/month only came down by ~$50 a GTX650Ti would pay for itself in a couple of months, and a 660 would probably pay for itself within 3 or 4 months, just by getting rid of the old cards running cost.

That’s what I was thinking, dismiss all the old cards and replace it with an GTX 650 Ti and save enouth in th electric bill to justify the investment.
A 8400GS has a TDP of between 25W and 38W, so there isn't a lot of electric being used per GPU.

Does about 3371 credits a day in PG = 88 credits / 1 W
The 8600GT only has a TDP of 43W, so again there isn't much of a saving from one, but neither of these cards can do much crunching anyway.

Around 10000 credits a day in PG = 232 /1 W
The 9800GT however varies between 59W and 125W, depending on the model (?)

Around 25000 credits a day in PG = 200 / 1W
I see you have two 8400GS GPU's running at PG, along with an 8600GT and a 9800GT. These are bringing you a RAC of <35K for ~(40or65+35+60to110)W = 125W to 210W (depending on the models). Obviously your choice what you crunch with and who you crunch for, but if you pulled those GPU's and ran a GTX650Ti here you could get a RAC of ~190,000, and for ~100W. So you would be saving some electric (between 35W and 120W) and earning more than five times the credits. Presuming your 9800GT has either a 105W or a 125W TDP (it's not a GE version) then the Electric saving would be at least 50W.

Just made me thinking, my GTX 570 (Factory overclocked) has a TDP of 218 W and gives around 250000 Credits a /day… ok you can’t throw it away, because of the grey energy after 2 years of operation… so, it should be a GTX 650 TI 2 GB and not a boast because of the TDP?

If you want to save further on the Electric front, get rid of your E6750 system, and possibly your E8500. Even if you just stopped using those CPU's to crunch you would save as much as you would from getting rid of an old GPU.
You have 3 good rigs at PG, but two old beasts that don't really do much other than eat electric.

I seconds SKs suggestion. And you wouldn't have to throw those Core 2 Duos away: they're still decent surf stations / office boxes, especially if equipped with 4 GB RAM and an SSD.

PG, Einstein and Collatz are just a side show my real interest is on climateprediction.net, GPUGRID (skgiven there we have the same interests) and to some extend Malariacontrol. So I thought I will use all my gear I have for BOINC, but the Electric Bill made me think... So the two Core 2 Duos will come in just if there is climateprediction work around in the future.
MrS you are right the two Core 2 Duos are just reserve computers, if I have some Practicants or an other help how needs a computer - work well for them.

Finally about the Bluescreen`s topic I still think Climateprediction.net goes very well along with GPUGRID.

Post to thread

Message boards : Number crunching : NOELIAs are back!

//