Message boards : News : New project in long queue
Author | Message |
---|---|
Hello all, | |
ID: 28895 | Rating: 0 | rate: / Reply Quote | |
I can't download any, I keep trying, but no long runs in the last hour. | |
ID: 28896 | Rating: 0 | rate: / Reply Quote | |
These units appear to be very long, could be close to 20 hours finishing time on my computers. Assuming there are no errors!! | |
ID: 28899 | Rating: 0 | rate: / Reply Quote | |
I've had to abort 3 NOELIA'S in the past 2 hours, GPU usage was at 100% and the memory controller was at 0%. I had to reboot the computer to get the GPU's working again. Windows popped up an error message complaining that "acemd.2865P.exe had to be terminated unexpectedly". As soon as I suspended the NOELIA work unit the error message went away. | |
ID: 28911 | Rating: 0 | rate: / Reply Quote | |
So far i got an error one After 14000secs :( and a second one witch was successful after 15 hours (560ti 448 core edition, 157k credits). Now its calculating a third one..lets see. | |
ID: 28930 | Rating: 0 | rate: / Reply Quote | |
The first one of these I received was: <core_client_version>7.0.52</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> I have another one that's at 62.5% after 14 hours. Looking at some of the NOELIA Wus, they seem to be failing all over the place, some of them repeatedly. They're also too long for my machines to process and return in 24 hours. After the one that's running either errors out or completes I will be aborting the NOELIA WUs. Wasting 24+ hours of GPU time per failure is not my favorite way to waste electricity. Sorry. BTW, the TONI WUs run fine. | |
ID: 28933 | Rating: 0 | rate: / Reply Quote | |
I've found NOELIA WUs to be highly unreliable, even on the short queue. I don't like getting one as I've no idea if it'll complete without errors. I had to abort a short NOELIA one yesterday as it kept crunching in circles meaning it crunched for some minutes and then returned to the beginning to do the same all over again. | |
ID: 28934 | Rating: 0 | rate: / Reply Quote | |
These new NOELIA tasks don't use a full CPU thread (core if you like) to feed a Kepler type GPU, like other workunits (like TONI_AGG) used to. Is this behavior intentional or not? Maybe that's why it takes so long to process them. It takes 40.400 secs for my overclocked GTX 580 to finish these tasks, while it takes 38.800 for a (slightly overclocked) GTX 670, so there is a demonstrable loss (~5%) in their performance. | |
ID: 28935 | Rating: 0 | rate: / Reply Quote | |
Some of the new NOELIA units are bugged somehow, I think. Some run fine, some of them not. | |
ID: 28936 | Rating: 0 | rate: / Reply Quote | |
I posted this in the "long application updated to the latest version" but Firehawk inferred these issues should be reported in this thread. I don't know if this is a 6.18 issue or a NOELIA WU issue but I guess time will tell. So I apologize in advance for the double posting if that is a bigger faux pas than not knowing which thread is the appropriate one to post to. ;-) | |
ID: 28937 | Rating: 0 | rate: / Reply Quote | |
You're getting us hawks mixed up, I've been using this name sense 95 and that's the first time I think that's happend. | |
ID: 28938 | Rating: 0 | rate: / Reply Quote | |
Mea Culpa. | |
ID: 28940 | Rating: 0 | rate: / Reply Quote | |
Am also getting several Noelia tasks making very slow progress. Same problem as flagged in the beta test. Also the size of the upload is causing issues for me as well. | |
ID: 28941 | Rating: 0 | rate: / Reply Quote | |
The Noelia workunits refuse to run on my 660ti linux system. They lock up or make no progress. I have finished one on two different linux systems with 670s without problems. | |
ID: 28942 | Rating: 0 | rate: / Reply Quote | |
The Noelia longs either fail in the first 3 to 4 minutes on my GTX 560 and GTX 650 Ti (the only four failures I have had on GPUGRID), or else they complete successfully. | |
ID: 28943 | Rating: 0 | rate: / Reply Quote | |
Well Jim, now you can understand how most of us got 90% of our errors. If you had looked closer you would have noticed that almost all of them came from a first run of NOELIA's in early February. Instead, you thought you would display you're distributed computing prowess and give us you're expert advice and proceeded to tell us about our substandard components or our inability to overclock correctly and the overheating issue's we must be having. | |
ID: 28944 | Rating: 0 | rate: / Reply Quote | |
flashawk, | |
ID: 28945 | Rating: 0 | rate: / Reply Quote | |
I guess it's understandable, the best advice I could ever give in my 51 years "is wait and see". I don't walk in to another’s club house and start rearranging the furniture. There's been many a time when I've jumped to quick conclusions in my own mind only to find out later that I was wrong. | |
ID: 28946 | Rating: 0 | rate: / Reply Quote | |
I managed to get http://www.gpugrid.net/result.php?resultid=6567675 to run to completion, by making sure it wasn't interrupted during computation. But 12 hours on a GTX 670 is a long time to run without task switching, when you're trying to support more than one BOINC project. SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. The TONI task following, again on the same card, seems to have started and to be running normally. | |
ID: 28947 | Rating: 0 | rate: / Reply Quote | |
Richard Haselgrove wrote: I just had the same thing happen to me Richard, right after the computation error, a TONI wu started on the same GPU card and it was at an idle with 0% GPU load and 0% memory controller usage. I had to suspend BOINC and reboot to get the GPU crunching again. As far as times go on my GTX670's, the NOELIA wu's have ranged from 112MB to 172MB so far and the smaller one took 7.5 hours and the large one took 11.75 hours. So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Edit: Check out this one, I just downloaded it a couple minutes ago. I noticed it ended in a 6, that means I'm the 7th person to get it. This is off the hook - man! http://www.gpugrid.net/workunit.php?wuid=4210634 | |
ID: 28948 | Rating: 0 | rate: / Reply Quote | |
So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Far more likely that the tasks which run - by design - for a long time, generate a large output file. After the last NOELIA failure (which triggered a driver restart), I ran a couple of small BOINC tasks from another project. The first one errored, the second ran correctly. After that, I ran a long TONI - successful completion, no computer restart needed. I'm running the 314.07 driver. | |
ID: 28950 | Rating: 0 | rate: / Reply Quote | |
My systems hasn't been changed since the application upgrade. | |
ID: 28951 | Rating: 0 | rate: / Reply Quote | |
I´m having some new weird issue, but only on my AMD 3x690 rig. For 3 times now, BSOD´s, systems restarts. It only go away if all the worunits (and the cache!!!) where aborted. I don´t have a clue on why this happens, but this AMD rig is rock solid in normal crunching and it´s doing more than 2m per day alone. | |
ID: 28952 | Rating: 0 | rate: / Reply Quote | |
My systems hasn't been changed since the application upgrade. I have had the same issues and on top of that I got error message saying that acemd.2865.exe has crashed, and the video card ends up running at a slower speed. I have had more errors with this application than the last time I did beta testing. | |
ID: 28953 | Rating: 0 | rate: / Reply Quote | |
Hello! | |
ID: 28954 | Rating: 0 | rate: / Reply Quote | |
I have had more errors with this application than the last time I did beta testing. I think we need to try and distinguish between 'application' problems and 'task' problems - or 'project' (as in research project), as Noelia called it in starting this thread. | |
ID: 28955 | Rating: 0 | rate: / Reply Quote | |
I have had more errors with this application than the last time I did beta testing. So do you think that the fact that we getting these errors after we changed application to version 6.18 is mere coincidence? Maybe you are right. There is a way to prove this: run the failed units under application 6.17. If they fail, it's the units, but if they don't fail, it's the new application. | |
ID: 28957 | Rating: 0 | rate: / Reply Quote | |
I have had more errors with this application than the last time I did beta testing. In my personal experience, all TONI tasks, and 50% of NOELIA tasks, have run correctly under application version 6.18 | |
ID: 28958 | Rating: 0 | rate: / Reply Quote | |
041px48x2-NOELIA_041p-1-2-RND9263--After 15 hours of when the work on this task ended, nvidia driver crashed and the work has been marked as faulty .. Another was marked correctly--nn016_r2-TONI_AGGd8-38-100-RND3157_0--- but these problems are already more than a week, it's insane..nvidia driver falls for a proper shut down boinc manager,exempl.. | |
ID: 28959 | Rating: 0 | rate: / Reply Quote | |
I have had more errors with this application than the last time I did beta testing. Richard, this is my experience exactly. All TONIs run fine and 50% of NOELIAs crash. TONI should maybe give a clinic to the others. I don't think it has much to do with 6.18 either, it's just that the new NOELIAS were released at the same time as 6.18. | |
ID: 28961 | Rating: 0 | rate: / Reply Quote | |
aborted a NOELIA one after it began crunching in circles... | |
ID: 28962 | Rating: 0 | rate: / Reply Quote | |
The first Noelia SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1841. That seems to be an out-of-GPU-memory error. So maybe someone should set stricter minimum memory limits on these Noelia tasks? Edit: Technically, that wasn't my first Noelia; just the first one of this batch. I got at least one, probably more, in February, and they took 25 hours but were otherwise fine. | |
ID: 28969 | Rating: 0 | rate: / Reply Quote | |
The first Noelia I see that both WUs are marked errors WU cancelled Something may be happening behind the scenes. | |
ID: 28970 | Rating: 0 | rate: / Reply Quote | |
These NOELIA WUs have been cancelled. Their successors will have a slightly different configuration that will hopefully be more stable. | |
ID: 28971 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. | |
ID: 28972 | Rating: 0 | rate: / Reply Quote | |
Be aware also these and subsequent WUs will fail if you have over-ridden the application version and are not running the latest. | |
ID: 28974 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. Were the TONI WUs cancelled too? They ran fine.. | |
ID: 28978 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. And the two I have in progress are still fine, and shown as viable on the website. | |
ID: 28980 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. Just got a couple new ones. Seems the queue coincidentally ran dry for a while: GPUGRID 03-04-13 13:45 Requesting new tasks for NVIDIA GPUGRID 03-04-13 13:45 Scheduler request completed: got 0 new tasks GPUGRID 03-04-13 13:45 No tasks sent GPUGRID 03-04-13 13:45 No tasks are available for Long runs (8-12 hours on fastest card) | |
ID: 28981 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. She'll resend new simulations that avoid the problems in the next day or so. The large upload sizes will also be fixed. Thank you guys. Another thing that I really appreciate on this project is your awesome and fast support. Wich didn´t happen on the project I ran in the past 13 years.... sadly. | |
ID: 28984 | Rating: 0 | rate: / Reply Quote | |
We're looking at the issue. The problematic WUs have been cancelled for now. The problem was clearly on our end, but it seems that there were multiple reasons they were having issues, and mostly not Noelia's fault. Were the issues related to the new application, the Wu's or both? | |
ID: 28989 | Rating: 0 | rate: / Reply Quote | |
How big are the uploads for these reworked NOELIA's supposed to be? The 3 I've finished were barely over 4MB after 11 1/2 hours of "crunching". Is this about right? | |
ID: 28998 | Rating: 0 | rate: / Reply Quote | |
... I got messages like "abort by user" - but I didn't abort any ... | |
ID: 29000 | Rating: 0 | rate: / Reply Quote | |
I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. | |
ID: 29001 | Rating: 0 | rate: / Reply Quote | |
I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. it's not possible to block specific tasks. At least that's what I learned from my own tread. http://www.gpugrid.net/forum_thread.php?id=3315 ____________ Team Belgium | |
ID: 29002 | Rating: 0 | rate: / Reply Quote | |
I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. It happened to me too. I had 2 noelia units that are aborted by user, which I didn't abort. They were otherwise running fine. So, what is happening? | |
ID: 29003 | Rating: 0 | rate: / Reply Quote | |
I also have the same problem with the tasks killing acemd process. When I checked the thread I got even more pissed off - one week after the "bomb" was thrown, no reaction form noelia, no official response, no retraction of the packages - nothing. http://www.gpugrid.net/forum_thread.php?id=3311&nowrap=true#28972 | |
ID: 29004 | Rating: 0 | rate: / Reply Quote | |
Were the issues related to the new application, the Wu's or both? The WUs were not set to upload the smaller file size format we are now trying to move to. They were set to use the old format, which could result in very large file upload sizes, as some people complained about. The problem with the application was an obscure one. It wasn't an issue with the application per se, but rather with how the application interacts with BOINC and this specific type of configuration file for the simulations. In short, the application was doing at the start of every WU a function that it was only supposed to do in the first WU in a chain. This caused all but the first WU in a chain to fail. This isn't a problem locally for us, but with how BOINC handles the files, it became a problem. We are working on a long-term fix, but we have simply found a way around it for now. ... I got messages like "abort by user" - but I didn't abort any ... I am not sure what is happening. Even if we cancel a group of WUs, they should complete on your computer (if they are good simulations). The "Abort by user" can only come from the user/client, typically when you deliberately cancel a WU with the "Abort" button. Hopefully there is nothing else going on... | |
ID: 29006 | Rating: 0 | rate: / Reply Quote | |
... I got messages like "abort by user" - but I didn't abort any ... Ah. That's one I can help you with. I got an 'aborted by user', too - task 6581613. If you look at the task details, it has "exit status 202". At some stage in the development of recent BOINC clients, David updated and expanded the range of error and exit status codes returned by the client. Unfortunately, he didn't - at first, and until prodded - update the decode tables used on project web sites. You need to update html/inc/result.inc on your web server to something later than http://boinc.berkeley.edu/trac/changeset/1f7ddbfe3a27498e7fd2b4f50f3bf9269b7dae25/boinc/html/inc/result.inc to get a proper website display using case 202: return "EXIT_ABORTED_BY_PROJECT"; Full story in http://boinc.berkeley.edu/dev/forum_thread.php?id=7704 | |
ID: 29008 | Rating: 0 | rate: / Reply Quote | |
I was surprised to see some "Aborted By User" tasks this morning, especially since they happened while I was sleeping! | |
ID: 29009 | Rating: 0 | rate: / Reply Quote | |
Out of all 4 of my machines, I had 7 "Aborted by user" errors last night. My computers will be on probation by tomorrow and I won’t be able to download work units. | |
ID: 29013 | Rating: 0 | rate: / Reply Quote | |
I haven't had the server abort any Noelias lately. I've just had them all segfault within an hour or two. :( | |
ID: 29014 | Rating: 0 | rate: / Reply Quote | |
Still having the same problems with noelias. That will put my biggest machine down, because this one is BSODing and ruining all the cache, with is very hard to build atm. | |
ID: 29028 | Rating: 0 | rate: / Reply Quote | |
It seems to me that these NOELIA's are suffering from memory leaks, when my card finnishes one and starts the next the GPU pegs at 99 - 100% and the memory controller stays at 0%. If I reboot, all is well and works fine. The previous wu won't release the memory on the GPU, thus the reboot. This is Windows XP Pro 64 bit, different operating systems seem to be dealing with it differently, Windows 7 and 8 are getting BSOD's or driver crashes, I also get the "acemd.2865P.exe had to be terminated unexpectedly" error. Oh well, I don't even know if this stuff we post helps or gets read. | |
ID: 29029 | Rating: 0 | rate: / Reply Quote | |
Same here..Seems that the new Noelia WU doesn't work well..it freeze the computer. had to reset the project. | |
ID: 29030 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. | |
ID: 29032 | Rating: 0 | rate: / Reply Quote | |
Just noelias incoming. Impossible to run the project atm. Too bad, is a big farm. | |
ID: 29033 | Rating: 0 | rate: / Reply Quote | |
Just noelias incoming. Impossible to run the project atm. Too bad, is a big farm. I just moved to a different project too. Too bad, I liked helping out here but they don't seem to test anything before release. | |
ID: 29036 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. Thanks, we're looking at it. Obviously this is pretty serious. I will submit some additional stuff to long that I know for sure are good simulations so that we can get a handle on this. Edit: So I have submitted to long queue some simulations we know are good. If it is an issue with the app, we will find out. They have name NATHAN_dhfr36_3 | |
ID: 29038 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. I've just reported (in Number Crunching) a failure with a long queue task under Windows 7/64, which didn't freeze the computer or poison the GPU, while short queue tasks under XP/32 are (mostly) running. | |
ID: 29039 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. I noticed the NATHAN units, they are coming really good. All machines are back.....will report results ASAP :D | |
ID: 29040 | Rating: 0 | rate: / Reply Quote | |
A NATHAN has started running OK here too, even with no reboot after the NOELIA failure (technique as described in NC). | |
ID: 29041 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are runnig fine these NOELIAs. In my case, both the 32 bit windows xp and the 64 bit windows 7 are having errors, this morning. The units I downloaded yesterday, seem to be okay. Though, I did get a crash on my windows 7 computer, on a unit running fine, when I did a reboot, though another running on the other card didn't crash. The setting (speed of GPU, memory and fan) on the video card which the unit crashed on were reset. I had to do another reboot, with the units suspended to get the video card settings right. I also noticed the that on windows 7 machine the units take 18 hours plus to finish, while on the windows xp machine it takes about 13 hours. This difference seems to be excessive. | |
ID: 29042 | Rating: 0 | rate: / Reply Quote | |
It's not limited to XP64, My XP32 got the error in acemd.2865P.exe as well. | |
ID: 29043 | Rating: 0 | rate: / Reply Quote | |
5-6 WU's failed for me this morning. I'm on driver 313.26. As far as I can tell these WU's have also failed for everyone else they were distributed to. I'm seeing these in dmesg (I thought my card might be failing but I think there is a problem with the new WU's) | |
ID: 29044 | Rating: 0 | rate: / Reply Quote | |
New news here: http://www.gpugrid.net/forum_thread.php?id=3318 | |
ID: 29047 | Rating: 0 | rate: / Reply Quote | |
It looks to me that this problem is related to the architecture of the host operating system, as all (1, 2, 3) of my Windows XP x64 systems have a lot of errors, while all (1, 2, 3) of my Windows XP x86 systems are running fine these NOELIAs. After my post (above), my 32 bit hosts had some failures and stuck workunits, so their previous relatively successful behavior maybe just by chance. | |
ID: 29060 | Rating: 0 | rate: / Reply Quote | |
Message boards : News : New project in long queue