Advanced search

Message boards : Graphics cards (GPUs) : NOELIA WUs getting "stuck"

Author Message
ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26611 - Posted: 15 Aug 2012 | 20:36:39 UTC

I am seeing about 10% of my NOELIA WUs getting "stuck" - the "fraction done" output stops moving. This seems to happen most often with "run4" WUs, but I have also seen with the other run numbers. If I restart BOINC, it starts the WU over from 0.00000. Sometimes it will freeze again at another spot, sometimes the WU will finish successfully after this restart, sometimes they error out.

Link to the machine - http://www.gpugrid.net/show_host_detail.php?hostid=111125

Linux x86_64 (Fedora 17 3.4.6-2.fc17.x86_64)
NVIDIA UNIX x86_64 Kernel Module 304.37
GeForce GTX 560 Ti

Dylan
Send message
Joined: 16 Jul 12
Posts: 98
Credit: 386,043,752
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 26612 - Posted: 15 Aug 2012 | 21:15:16 UTC

Hmm, I haven't encountered this issue, but I run Windows, so that might be why.

For me, sometimes tasks restart when Boinc does, but not all the time, though.

Sorry I can't help you.

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 389,810,184
RAC: 1,368,385
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26615 - Posted: 16 Aug 2012 | 3:55:32 UTC
Last modified: 16 Aug 2012 | 3:56:07 UTC

I had this happen twice on me to date (I am running Windows). One time it was 2 days before I noticed, so I just aborted (given the utilization was practically 0% and progress not moving). The 2nd time I saw it, I closed BOINC and reopened, after which the WU errored out.

Kinda a pain, but just keeping an out eye at this point.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,217,465,968
RAC: 1,257,790
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26617 - Posted: 16 Aug 2012 | 7:30:45 UTC
Last modified: 16 Aug 2012 | 7:34:15 UTC

One of my hosts is crunching such a workunit right now. It's running for 21h26m now, and it's at 78.320%, progressing very slowly. This type of workunits usually take less than 7 hours to complete. I've tried to pause and restart the task, then I put it on another GPU in the same host, but there's no change in its speed. I've double checked that none of this host's GPUs is downclocked.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26622 - Posted: 16 Aug 2012 | 11:31:21 UTC - in response to Message 26617.
Last modified: 16 Aug 2012 | 11:55:48 UTC

The same workunit was aborted on another system with these verbosely challenged details:


    Stderr output

    <core_client_version>6.12.34</core_client_version>



Most likely some problem with the tasks, but perhaps this needs more CPU or bandwidth. Does freeing up another CPU core/thread help (if its bandwidth it would take you to suspend any CPU tasks to notice)? Are GPU temps/fan speeds normal?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,217,465,968
RAC: 1,257,790
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26628 - Posted: 16 Aug 2012 | 23:09:32 UTC - in response to Message 26622.

It's finished after 27 hours...

Stderr output:

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
MDIO: cannot open file "restart.coor"
No heartbeat from core client for 30 sec - exiting
# Time per step (avg over 545000 steps): 19.800 ms
# Approximate elapsed time for entire WU: 99000.143 s
called boinc_finish

</stderr_txt>
]]>


Since then my host finished a couple of workunits without any problems, and without any restart.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26629 - Posted: 16 Aug 2012 | 23:26:56 UTC - in response to Message 26622.

I was seeing this issue with the much shorter "trypsin_lig" runs. When these ran successfully, they ran very quickly (like under an hour). The CPU has been under 20% utilization and the GPU/fan are normal. I have been running other ACEMD and long runs on this same machine without issue for almost a year.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26630 - Posted: 16 Aug 2012 | 23:29:11 UTC - in response to Message 26611.

Here are some examples of failed WUs - three errored out and two I aborted after they got stuck.

http://www.gpugrid.net/result.php?resultid=5726005
http://www.gpugrid.net/result.php?resultid=5724879
http://www.gpugrid.net/result.php?resultid=5723761
http://www.gpugrid.net/result.php?resultid=5719313
http://www.gpugrid.net/result.php?resultid=5713069

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26632 - Posted: 17 Aug 2012 | 12:44:47 UTC - in response to Message 26630.

We are seeing three different errors here.
Zoltan's system had a "No heartbeat from core client for 30 sec" error, and ETQuestor had 2 different errors (3 tasks were aborted). One error was a sig abort and the other was an "energies have become nan" error:

SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574.
acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed.
SIGABRT: abort called
Stack trace (15 frames):
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d]
/lib64/libc.so.6(+0x359a0)[0x7fdff7ad39a0]
/lib64/libc.so.6(gsignal+0x35)[0x7fdff7ad3925]
/lib64/libc.so.6(abort+0x148)[0x7fdff7ad50d8]
/lib64/libc.so.6(+0x2e6a2)[0x7fdff7acc6a2]
/lib64/libc.so.6(+0x2e752)[0x7fdff7acc752]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdff7abf735]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9]

Exiting...

--
ERROR: file deven.cpp line 1106: # Energies have become nan

Each of this errors can be caused by more than one problem. There have been suggestions about these errors in the past.

While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set).

The SIGABRT (an abort task signal) and the Not a Number errors could well be task related, but also Linux setup/driver/library issues. In the past similar errors were supposedly caused by Boinc running CPU benchmarks, amongst other things. A lib.so.6 "double free or corruption" was reported back in Jan when crunching one a TONI task, though there are no suggestions in that thread. Soneageman also reported this error. Again I don't see any specific helpful info. If the problem continues look at hardware (temps/fan speed/noise), the driver and the Boinc client (config/updates).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,217,465,968
RAC: 1,257,790
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26633 - Posted: 17 Aug 2012 | 16:23:17 UTC - in response to Message 26632.

We are seeing three different errors here.
Zoltan's system had a "No heartbeat from core client for 30 sec" error, ....

While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set).

While this is true, it's not the source of the slowness of this workunit. It was slow right from the start.
My rosetta@home tasks were going wild, using 400 to 850MB, so when I started Skype, it caused the BOINC manager to shut down every task, because the memory used by BOINC applications exceeded the treshold of maximum usable physical memory (90%). Then it restarted the tasks one by one, but rosetta@home tasks read 1.3GB (and write 130MB) at startup, and since I don't have SSD in this PC, this could overwhelm the file system, causing tasks starting and stopping several times (because the "No Heartbeat from core client" error), and rendering my PC unusable for a couple of minutes. Since then I've doubled the RAM in this host (now it has 12GB).
The other 3 tasks running at the same time were experiencing this "No heartbeat error", but they didn't slow down. This error makes the BOINC manager to stop the task, and restart it from the last checkpoint.
Here is a list of my workunits which experienced this error, but didn't slow down:
5743222 2 times
5743031 2 times
5742951 2 times
5742426 4 times
5741987 2 times
5741772 2 times
5741561 2 times
5741537 2 times
5740471
5740130
5739827
5739821

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26635 - Posted: 17 Aug 2012 | 20:07:54 UTC - in response to Message 26633.
Last modified: 17 Aug 2012 | 20:39:22 UTC

Sort of an aside, as the (no heartbeat) 30sec no response->stop task issue is clouding the problem here, but I think a delay needs to be introduced during task startup/restart, but this should really be done at the Boinc level, rather than app or by the users. That said a script to do so would be a good workaround, similar to the Linux startup delay, but on a task by task basis (allow a few seconds for each task to load).

If possible using a secondary hard drive should help avoid this issue. That said, for normal usage you would want an SSD for the system to boost system performance, especially startup/shutdown, rather than just being used for Boinc. Of course I'm guilty of buying an SSD just to support some of the more challenging projects, but then I like a challenge.

I really see this as a problem with Rosetta and Boinc. Basically I think Boinc should always prioritise GPU projects over CPU projects, even if it means using delayed write or suspending the CPU project. Frankly I never want any CPU project to interfere with a GPUGrid Long run for any reason (HDD, CPU, RAM...). If CPU projects could ascertain how much resources were available to them, after the GPU project starts as a priority, this sort of problem should never happen.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Raptures Riot
Send message
Joined: 30 Apr 11
Posts: 6
Credit: 220,588,795
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26636 - Posted: 17 Aug 2012 | 21:09:13 UTC

I am getting a lot of 'Energies have become nan' the same as many others. This usually occurs 3 or 4 hours into the calc. I'm presuming 'nan' means 'indeterminate' which is a legitimate conclusion to the model. Therefore I do not understand why the calculation ends in an error. Please, if this result is useful info, can a 'completeded successfully' be awarded? There seems to be some disenchantment in the forums over this topic. I know everyone here is dedicated and respectful and we try real hard for results. Let's understand why 'nan' can be a good 3 letter word.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,217,465,968
RAC: 1,257,790
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26637 - Posted: 17 Aug 2012 | 21:18:05 UTC - in response to Message 26636.

The nan error has it's own thread.

Post to thread

Message boards : Graphics cards (GPUs) : NOELIA WUs getting "stuck"

//