Author |
Message |
|
I am seeing about 10% of my NOELIA WUs getting "stuck" - the "fraction done" output stops moving. This seems to happen most often with "run4" WUs, but I have also seen with the other run numbers. If I restart BOINC, it starts the WU over from 0.00000. Sometimes it will freeze again at another spot, sometimes the WU will finish successfully after this restart, sometimes they error out.
Link to the machine - http://www.gpugrid.net/show_host_detail.php?hostid=111125
Linux x86_64 (Fedora 17 3.4.6-2.fc17.x86_64)
NVIDIA UNIX x86_64 Kernel Module 304.37
GeForce GTX 560 Ti |
|
|
Dylan Send message
Joined: 16 Jul 12 Posts: 98 Credit: 386,043,752 RAC: 0 Level
Scientific publications
|
Hmm, I haven't encountered this issue, but I run Windows, so that might be why.
For me, sometimes tasks restart when Boinc does, but not all the time, though.
Sorry I can't help you. |
|
|
K1atOdessaSend message
Joined: 25 Feb 08 Posts: 249 Credit: 389,810,184 RAC: 1,368,385 Level
Scientific publications
|
I had this happen twice on me to date (I am running Windows). One time it was 2 days before I noticed, so I just aborted (given the utilization was practically 0% and progress not moving). The 2nd time I saw it, I closed BOINC and reopened, after which the WU errored out.
Kinda a pain, but just keeping an out eye at this point. |
|
|
|
One of my hosts is crunching such a workunit right now. It's running for 21h26m now, and it's at 78.320%, progressing very slowly. This type of workunits usually take less than 7 hours to complete. I've tried to pause and restart the task, then I put it on another GPU in the same host, but there's no change in its speed. I've double checked that none of this host's GPUs is downclocked. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The same workunit was aborted on another system with these verbosely challenged details:
Stderr output
<core_client_version>6.12.34</core_client_version>
Most likely some problem with the tasks, but perhaps this needs more CPU or bandwidth. Does freeing up another CPU core/thread help (if its bandwidth it would take you to suspend any CPU tasks to notice)? Are GPU temps/fan speeds normal?
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help
|
|
|
|
It's finished after 27 hours...
Stderr output:
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
MDIO: cannot open file "restart.coor"
No heartbeat from core client for 30 sec - exiting
# Time per step (avg over 545000 steps): 19.800 ms
# Approximate elapsed time for entire WU: 99000.143 s
called boinc_finish
</stderr_txt>
]]>
Since then my host finished a couple of workunits without any problems, and without any restart. |
|
|
|
I was seeing this issue with the much shorter "trypsin_lig" runs. When these ran successfully, they ran very quickly (like under an hour). The CPU has been under 20% utilization and the GPU/fan are normal. I have been running other ACEMD and long runs on this same machine without issue for almost a year. |
|
|
|
Here are some examples of failed WUs - three errored out and two I aborted after they got stuck.
http://www.gpugrid.net/result.php?resultid=5726005
http://www.gpugrid.net/result.php?resultid=5724879
http://www.gpugrid.net/result.php?resultid=5723761
http://www.gpugrid.net/result.php?resultid=5719313
http://www.gpugrid.net/result.php?resultid=5713069 |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
We are seeing three different errors here.
Zoltan's system had a "No heartbeat from core client for 30 sec" error, and ETQuestor had 2 different errors (3 tasks were aborted). One error was a sig abort and the other was an "energies have become nan" error:
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574.
acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed.
SIGABRT: abort called
Stack trace (15 frames):
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d]
/lib64/libc.so.6(+0x359a0)[0x7fdff7ad39a0]
/lib64/libc.so.6(gsignal+0x35)[0x7fdff7ad3925]
/lib64/libc.so.6(abort+0x148)[0x7fdff7ad50d8]
/lib64/libc.so.6(+0x2e6a2)[0x7fdff7acc6a2]
/lib64/libc.so.6(+0x2e752)[0x7fdff7acc752]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdff7abf735]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9]
Exiting...
--
ERROR: file deven.cpp line 1106: # Energies have become nan
Each of this errors can be caused by more than one problem. There have been suggestions about these errors in the past.
While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set).
The SIGABRT (an abort task signal) and the Not a Number errors could well be task related, but also Linux setup/driver/library issues. In the past similar errors were supposedly caused by Boinc running CPU benchmarks, amongst other things. A lib.so.6 "double free or corruption" was reported back in Jan when crunching one a TONI task, though there are no suggestions in that thread. Soneageman also reported this error. Again I don't see any specific helpful info. If the problem continues look at hardware (temps/fan speed/noise), the driver and the Boinc client (config/updates).
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
We are seeing three different errors here.
Zoltan's system had a "No heartbeat from core client for 30 sec" error, ....
While the No Heartbeat error can simply be caused by not having enough access to the CPU, its often due to not being able to write to the hard drive, due to other tasks basically thrashing the drive (I think any process that reads or writes to the drive would be prioritised over all things Boinc). Things like automatic disk defrags, Windows search engines and AV scans can trigger this. I have also experienced this error when a SATA cable became lose (the sub-standard Red ones)! I suspect opening or leaving the Event Log window open might contribute to this as well (if you have lots of cc_config flags set).
While this is true, it's not the source of the slowness of this workunit. It was slow right from the start.
My rosetta@home tasks were going wild, using 400 to 850MB, so when I started Skype, it caused the BOINC manager to shut down every task, because the memory used by BOINC applications exceeded the treshold of maximum usable physical memory (90%). Then it restarted the tasks one by one, but rosetta@home tasks read 1.3GB (and write 130MB) at startup, and since I don't have SSD in this PC, this could overwhelm the file system, causing tasks starting and stopping several times (because the "No Heartbeat from core client" error), and rendering my PC unusable for a couple of minutes. Since then I've doubled the RAM in this host (now it has 12GB).
The other 3 tasks running at the same time were experiencing this "No heartbeat error", but they didn't slow down. This error makes the BOINC manager to stop the task, and restart it from the last checkpoint.
Here is a list of my workunits which experienced this error, but didn't slow down:
5743222 2 times
5743031 2 times
5742951 2 times
5742426 4 times
5741987 2 times
5741772 2 times
5741561 2 times
5741537 2 times
5740471
5740130
5739827
5739821
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Sort of an aside, as the (no heartbeat) 30sec no response->stop task issue is clouding the problem here, but I think a delay needs to be introduced during task startup/restart, but this should really be done at the Boinc level, rather than app or by the users. That said a script to do so would be a good workaround, similar to the Linux startup delay, but on a task by task basis (allow a few seconds for each task to load).
If possible using a secondary hard drive should help avoid this issue. That said, for normal usage you would want an SSD for the system to boost system performance, especially startup/shutdown, rather than just being used for Boinc. Of course I'm guilty of buying an SSD just to support some of the more challenging projects, but then I like a challenge.
I really see this as a problem with Rosetta and Boinc. Basically I think Boinc should always prioritise GPU projects over CPU projects, even if it means using delayed write or suspending the CPU project. Frankly I never want any CPU project to interfere with a GPUGrid Long run for any reason (HDD, CPU, RAM...). If CPU projects could ascertain how much resources were available to them, after the GPU project starts as a priority, this sort of problem should never happen.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
I am getting a lot of 'Energies have become nan' the same as many others. This usually occurs 3 or 4 hours into the calc. I'm presuming 'nan' means 'indeterminate' which is a legitimate conclusion to the model. Therefore I do not understand why the calculation ends in an error. Please, if this result is useful info, can a 'completeded successfully' be awarded? There seems to be some disenchantment in the forums over this topic. I know everyone here is dedicated and respectful and we try real hard for results. Let's understand why 'nan' can be a good 3 letter word. |
|
|
|
The nan error has it's own thread. |
|
|