Message boards : News : WU: OPM995 simulations
Author | Message |
---|---|
Here we go again :) This time with 33% more credits + corrected runtimes which means an additional 2x credit for WUs which take more than 18 hours on a 780 and only WUs which take up to a max of 24 hours on a 780. I hope I don't seriously overshoot on credits this time but it's really a bit hit & miss. | |
ID: 43600 | Rating: 0 | rate: / Reply Quote | |
Thanks Stefan! | |
ID: 43602 | Rating: 0 | rate: / Reply Quote | |
Thanks Stefan! Good suggestion. Given the length of these tasks (extra-long or at least some of them), and so many being available, there is no point in people hoarding tasks - they will just miss bonus deadlines and get less credit. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help | |
ID: 43604 | Rating: 0 | rate: / Reply Quote | |
Thanks Stefan! Are you referring to the setting: "Maintain enough work for an additional" I set mine to 0.03 several hours ago and updated my client. Yet it downloaded another WU shortly after one was finished just as the the running WU barely started. Is there something else to tweak? Thanks, Win ____________ | |
ID: 43628 | Rating: 0 | rate: / Reply Quote | |
Yes, in Boinc Manager (advanced view) under Options, Computing preference and the Computing tab you need to set two values:
| |
ID: 43632 | Rating: 0 | rate: / Reply Quote | |
Please note that really low buffer settings cause increased stress on project scheduler servers, for all projects you are attached to. | |
ID: 43633 | Rating: 0 | rate: / Reply Quote | |
... short buffers don't really help GPUGrid throughput ... Not necessarily true. I'm not speaking specifically about the OPM simulations here, but I think most GPUGrid work is run as a sort of relay race - you hold the baton for a short while, complete your lap of the track, and then hand it back in for somebody else to take over. If you sit at the side of the track for a day and a half before you even start running, that particular baton - series of linked tasks, each generated from the result of the previous lap - is permanently delayed, and the final results aren't available for the scientists to study until that much later. | |
ID: 43634 | Rating: 0 | rate: / Reply Quote | |
That had slipped my mind. But, if GPUGrid was having a problem getting the batons back for the next runners, and they wanted to ensure that the race kept running smoothly, they could tighten the deadlines on the relay chunks if need be. | |
ID: 43635 | Rating: 0 | rate: / Reply Quote | |
Until the scheduler is re-written at a per device/device-specific level there will be issues with attaching to multiple projects (when using multiple devices). However, these have been addressed as far as reasonably feasible with the existing manager. | |
ID: 43639 | Rating: 0 | rate: / Reply Quote | |
It can't take into account what else you crunch. That's exactly the reason that you shouldn't make blanket suggestions on suggested cache settings that benefit GPUGrid most, without also specifying some of the drawbacks :) I digress. For my particular scenario, I have modified my cache settings a bit, in order to try to keep all my GPUs sustained at 2-GPUGrid-tasks-per-GPU without taking on additional work from other attached GPU projects. I'm using 0.9d+0.9d on the PC that has GTX970+GTX660Ti+GTX660Ti, and 0.5d+0.5d on the PC that has GTX980Ti+GTX980Ti. To each their own. | |
ID: 43640 | Rating: 0 | rate: / Reply Quote | |
For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues. | |
ID: 43641 | Rating: 0 | rate: / Reply Quote | |
It can't take into account what else you crunch. My suggestions are predominantly for GPUGrid only and are typically optimisations for GPUGrid throughput and user/team credit. I don't make suggestions at GPUGrid to facilitate every conceivable combination of Boinc-wide project admix, nor could I - it can't be done. You have different views, values, opinions and objectives which you are quite entitled to express and implement for yourself and to your own ends. My advice is mostly aimed at new, novice or just GPUGrid-new crunchers or people with a specific problem to here. Usually they need a setup to facilitate crunching here and often changes just to make it work. Occasionally I digress too, to advise on an experience crunching elsewhere, or to pass on some observations or knowledge, but there is no catch all super setup for Boinc. I enjoy the fact that people crunch for a diversity of reasons with different setups and takes on crunching. Highlighting different circumstances and experiences adds to my knowledge and crunchers knowledge as a whole, but one shoe doesn't fit all and this is a GPUGrid forum not the Boinc central forum where generic advice might better be propagated. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help | |
ID: 43642 | Rating: 0 | rate: / Reply Quote | |
For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues. I don't bother any more. IMO it is what it is and that's just about all it will ever be. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help | |
ID: 43643 | Rating: 0 | rate: / Reply Quote | |
For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues. Gave up too. However it is supremely important to devise more ways for people to burn up their phones while doing nothing useful. | |
ID: 43644 | Rating: 0 | rate: / Reply Quote | |
Stefan, | |
ID: 43654 | Rating: 0 | rate: / Reply Quote | |
initial replication 2 | |
ID: 43655 | Rating: 0 | rate: / Reply Quote | |
Perhaps the question is: | |
ID: 43656 | Rating: 0 | rate: / Reply Quote | |
Probably validation; any proof of concept experiment to demonstrate ability needs to contain appropriate verification for it to be accepted as a model/framework for performing experiments. | |
ID: 43659 | Rating: 0 | rate: / Reply Quote | |
Thanks! | |
ID: 43661 | Rating: 0 | rate: / Reply Quote | |
Hmm... validation deals with quorum though, and also, I thought the way these GPUGrid tasks worked was that the results couldn't really be validated against each other. I might be mistaken though. | |
ID: 43662 | Rating: 0 | rate: / Reply Quote | |
Wasn't thinking about task validation in the Boinc sense but rather validation of the experimental procedure - does it hold any weight? If we consider an experiment as a batch of work, validation of the experiment (and procedures) in scientific terms usually requires that the whole experiment be replicated, and perhaps many times before the results/methods are accepted. Of course Stefan might be doing this for different reasons. | |
ID: 43663 | Rating: 0 | rate: / Reply Quote | |
I see what you mean now. I hope he has another reason. | |
ID: 43666 | Rating: 0 | rate: / Reply Quote | |
GTX970 on W10 24h and 41min with a bit of upload time too (118MB). | |
ID: 43671 | Rating: 0 | rate: / Reply Quote | |
I was fortunate enough to get and complete successfully 2 of these units: | |
ID: 43678 | Rating: 0 | rate: / Reply Quote | |
Is it so, that when the new students arrive, that you would consider creating more short tasks? | |
ID: 43681 | Rating: 0 | rate: / Reply Quote | |
Agreed: pity there are so few shorts..... | |
ID: 43683 | Rating: 0 | rate: / Reply Quote | |
Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's. | |
ID: 43685 | Rating: 0 | rate: / Reply Quote | |
I had one of these WUs fail with this error message: | |
ID: 43695 | Rating: 0 | rate: / Reply Quote | |
I had one of these WUs fail with this error message: See the WARNING/CHALLENGE: VERY LONG WU (VERYLONG_CXCL12_confAna) thread. It's embarrassing that we've run into this again. | |
ID: 43696 | Rating: 0 | rate: / Reply Quote | |
I've got 2d57-SDOERR_opm994-0-1-RND4399_1 running. The file description in client_state.xml is <file> <name>2d57-SDOERR_opm994-0-1-RND4399_1_11</name> <nbytes>0.000000</nbytes> <max_nbytes>5000000.000000</max_nbytes> <status>0</status> <upload_url>http://www.gpugrid.org/PS3GRID_cgi/file_upload_handler</upload_url> </file> - so the maximum size allowed is 5,000,000 bytes. So far, it's reached 852 KB at about 80% progress - which sounds like plenty of headroom, and perhaps not a widespread problem. But I'll keep an eye on it as it approaches completion. | |
ID: 43697 | Rating: 0 | rate: / Reply Quote | |
I apologize for not answering in a while, I have been a bit busy with writing my thesis. | |
ID: 43698 | Rating: 0 | rate: / Reply Quote | |
Hi, Stefan: | |
ID: 43699 | Rating: 0 | rate: / Reply Quote | |
2d57-SDOERR_opm994-0-1-RND4399_1 uploaded cleanly, so it's not a universal problem. | |
ID: 43700 | Rating: 0 | rate: / Reply Quote | |
Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's. WUid=11616186 (1a0r OPM994) crashed my system multiple times - this WU had 100% GPU usage / 1% MCU / 20% power (65W) before the (first ever driver reset(s) I've encountered computing ACEMD in three years.) The (1a0r) WU ended with a -97 (0xffffffffffffff9f) Unknown error number after 102sec at reference stock clock once I noticed the first couple of driver recoveries OCed. (FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1965) A few other stable wingman (980ti / (2) 970's) high-end RAC systems (6 total) have error(s) (<100sec) with (1a0r) WU. As of now (2) OPM995 are without issue on my 970's at very high OC's: (WUid=11614432) 4a6fRO (50479 atoms with 9411 waters in system) 20.25hr estimated runtime at 12-15% CPU usage (3.2GHz) / 63% GPU usage (1511MHz) / 31% MCU (7200MHz) / 27% BUS (PCIe3.0 x4) / 34% power (110W) / 42C core / 820MB memory usage (WUid=116143650 4u15RO (51270 atoms with 8255 waters in system) 20.5hr estimated runtime at 12-15% CPU usage (3.2GHz) / 65% GPU usage (1511MHz) / 34% MCU (7010MHz) / 22% BUS (PCIe3.0 x8) / 60% power (120W) / 45C core / 843MB memory usage | |
ID: 43703 | Rating: 0 | rate: / Reply Quote | |
1s4wR0-SDOERR_opm995-0-1-RND5214_0 11614436 3 Jun 2016 | 6:47:02 UTC 3 Jun 2016 | 20:01:33 UTC Completed and validated 45,293.51 20,015.48 147,829.50 | |
ID: 43704 | Rating: 0 | rate: / Reply Quote | |
4azpR0-SDOERR_opm995-0-1-RND6483_1 looks safe as well - 1,283 KB at 61%. | |
ID: 43705 | Rating: 0 | rate: / Reply Quote | |
Too many errors (may have bug) 1a0r-SDOERR_opm994-0-1-RND9594 | |
ID: 43706 | Rating: 0 | rate: / Reply Quote | |
(2) new OPM995 that should make the maximum size file_xfer allowed 5,000,000 bytes: Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's. | |
ID: 43713 | Rating: 0 | rate: / Reply Quote | |
Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit? | |
ID: 43724 | Rating: 0 | rate: / Reply Quote | |
Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit?4by0-SDOERR_opm994-0-1-RND5591_1 58.472s (16h 14m 26s) 1.023.036 credits 170941 atoms 11.696 ns/day 5M steps This workunit is very interesting, as the initial replication was 2, the other host which received this workunit also received the +50% bonus, while it has returned it after 1d 14h. | |
ID: 43725 | Rating: 0 | rate: / Reply Quote | |
This workunit is very interesting, as the initial replication was 2, the other host which received this workunit also received the +50% bonus, while it has returned it after 1d 14h. AFAIK that's the way it's always worked here. The first reported WU sets the credit for everyone. | |
ID: 43727 | Rating: 0 | rate: / Reply Quote | |
Finally got an OPM on my Ubuntu 16.04 rig. Alas it didn't turn out to be an extra-long run and completed in 12h 35min at stock. Had 4 OPMs finish today. The credit on all of them is 1/2 or less per hour compared to any other long WUs. Guess the credit wasn't fixed after all. | |
ID: 43728 | Rating: 0 | rate: / Reply Quote | |
Got 2 real extra-long tasks on my Win10 system and one 'fake' extra-long task on my Linux system. The real extra-long tasks got 900K Boinc credits whereas the normal-long task only received 147K credits (or there about). | |
ID: 43729 | Rating: 0 | rate: / Reply Quote | |
I got 2 really long tasks on my Win10 system and one fake long task on my Linux system. The real long tasks got |900K Boinc credits whereas the not-really-long task (normal-ling) only received 147K credits (or there about). Remedial math is a good post graduate course... ;-) | |
ID: 43730 | Rating: 0 | rate: / Reply Quote | |
Just after correcting my remedial English :) | |
ID: 43731 | Rating: 0 | rate: / Reply Quote | |
Just after correcting my remedial English :) I don't think it's you that needs the remedial math, and yep it's that time again. | |
ID: 43733 | Rating: 0 | rate: / Reply Quote | |
Really, I am out of ideas on how to fix the credits any further. I did everything I could imagine being wrong. I could blindly multiply the credits by whatever factor you guys tell me, but right now I have to base it off our usual credit calculation script. | |
ID: 43745 | Rating: 0 | rate: / Reply Quote | |
Really, I am out of ideas on how to fix the credits any further. I did everything I could imagine being wrong. I could blindly multiply the credits by whatever factor you guys tell me, but right now I have to base it off our usual credit calculation script. Recent comparisons, OPM vs. CXCL12VOLK. Example from one of my machines: 1gzmR0-SDOERR_opm995-0-1-RND1802_0 11614349 3 Jun 2016 | 6:36:43 UTC 5 Jun 2016 | 9:02:02 UTC Completed and validated 162,200.44 47,231.66 237,804.00 e6s24_e1s9p0f524-GERARD_CXCL12VOLK_15782120_2-0-1-RND1978_0 11613059 28 May 2016 | 21:23:31 UTC 30 May 2016 | 4:49:33 UTC Completed and validated 96,473.03 31,352.66 233,875.00 Here's another one of my computers. This WU had 131548 Natoms: 2w61-SDOERR_opm994-0-1-RND7728_0 11616211 2 Jun 2016 | 17:30:20 UTC 5 Jun 2016 | 18:09:42 UTC Completed and validated 243,192.27 35,757.08 262,409.00 e4s9_e1s18p0f473-GERARD_CXCL12VOLK_15782120_2-0-1-RND7513_1 11609049 1 Jun 2016 | 14:32:18 UTC 2 Jun 2016 | 22:47:34 UTC Completed and validated 98,520.08 29,575.68 233,875.00 From the OPM WUs I've been running lately it seems that the credit is about 45% - 60% per hour compared to other/previous long WUs. On top of that there is a greater chance of failure with these long WUs. I would suggest erroring on the high side rather than the low side when estimating credit as it costs you nothing and it's one of the few tokens of appreciation that we receive for our small contribution to the great science that you guys are doing. Whining aside, keep up the excellent work. For a lot of us this is a small way that we can contribute to science. | |
ID: 43749 | Rating: 0 | rate: / Reply Quote | |
I thought you guys might appreciate seeing what can go wrong in a simulation ;) I always love these mistakes. Still, only 1 out of 600+ systems managed to break like this so I'm quite impressed. | |
ID: 43777 | Rating: 0 | rate: / Reply Quote | |
Cool animation, Stefan! Thanks for sharing! :) | |
ID: 43778 | Rating: 0 | rate: / Reply Quote | |
Message boards : News : WU: OPM995 simulations