Author |
Message |
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
I'm starting testing again on the "cpumd" multi-threaded CPU app. Please post observations here.
Matt |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
14/10/12 08:21:16 | GPUGRID | No tasks are available for Test application for CPU MD
14/10/12 08:22:31 | GPUGRID | No tasks are available for Test application for CPU MD
14/10/12 08:23:55 | GPUGRID | No tasks are available for Test application for CPU MD
14/10/12 08:28:16 | GPUGRID | No tasks are available for the applications you have selected.
Are CPUMD available now? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Yes, there are a few hundred there. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
All set- received 8 tasks.
The CPUMD batch a few months back included a file that gave information about individual atoms ( I can't remember the name of file) - now I only see in progress file the energies (Bond Angle/ Proper Dih./Improper Dih./ Coulomb-14/LJ (SR)/ Coulomb (SR)/ Coul. recip./Potential/ Kinetic En./ Total Energy /Conserved En./ Temperature Pressure (bar)) for every thousand steps along with the input parameters.
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: SSE2
Is AVX available? Or there no speed up with AVX? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Only SSE2 at the moment. Will probably make builds with higher levels of optimisation later but for now I'm concerned about correctness rather than performance.
Matt |
|
|
|
Matt,
I was able to get a task on my Windows box which only runs CPU tasks. That task is currently under way.
On my Linux box, I run GPU tasks most of the time. When I tried to pull a CPU task, the event log says that the computer has reached a limit on tasks in progress, even when there are no CPU tasks downloaded. I'm guessing that it has reached a limit on GPU tasks in progress so it won't pull anything else down.
If you need more information, please let me know.
Edit:
I just changed my device profile to only pull CPU tasks and now it says that there are no CPU tasks available. Maybe it was out earlier and gave the misleading message. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Hi Matt,
rigs are ready to accept beta app and CPU tasks but nothing coming in yet.
Message:
10/12/2014 3:12:37 PM | GPUGRID | Sending scheduler request: To fetch work.
10/12/2014 3:12:37 PM | GPUGRID | Requesting new tasks for CPU
10/12/2014 3:12:39 PM | GPUGRID | Scheduler request completed: got 0 new tasks
10/12/2014 3:12:39 PM | GPUGRID | No tasks sent
10/12/2014 3:12:39 PM | GPUGRID | No tasks are available for ACEMD beta version
10/12/2014 3:12:39 PM | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
10/12/2014 3:12:39 PM | GPUGRID | No tasks are available for CPU only app
10/12/2014 3:12:39 PM | GPUGRID | No tasks are available for Test application for CPU MD
10/12/2014 3:12:57 PM | GPUGRID | work fetch suspended by user
However there are some tasks, according to server page. Could be a BOINC thing though.
____________
Greetings from TJ |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
exapower - is it using all CPU cores?
Matt |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Hi Matt,
Win7 x64, BOINC 7.2.42. I have got one and it runs on 5 CPU's. I have set in BOINC to use maximum 70% of CPU's (have 8 threads). So this is correct as two are working on GPUGRID LR's.
So it works! 1.5% done in 3.20 minutes on i7-4774 3.50GHz.
However CPU usage to task manager is 100%. mdrun.846,exe is using 60 or 61% CPU acoording to task mamager.
If you need/want more information please let me know and I will provide it.
____________
Greetings from TJ |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
exapower - is it using all CPU cores?
Matt
Yes- Working flawlessly- CPUMD running at consist 92-95% (I have an audio program running along with a total of 17 background process/19 windows processes) in task manager- with program HWiNFO64 [2] physical [2} logical are always utilizing 96-98%. The program SIV64X reads BOINC usage for each core(thread)- showing 98% for CPUMD.
4hours of processing- BOINC @ 56% of task progress. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Observation about steps: This is from stderr file- Reading file topol.tpr, VERSION 4.6.1 (single precision)
Using 1 MPI thread
Using 4 OpenMP threads
starting mdrun 'Protein in water'
5000000 steps, 10000.0 ps.
And in progress file where input parameters are: nsteps= 5000000
Yet the Progress file reads 740000 step done so far- with BOINC task progress at 79.260%. Are the total number of steps 1million instead of 5million? |
|
|
|
Task: 1315-MJHARVEY_CPUDHFR-0-1-RND2531_0
This Test Application is anounced with a runtime round about 2,5 hrs.
Now it's running over 5 hrs. WU is ready by 80% .
http://www.gpugrid.net/result.php?resultid=13198539 |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
http://www.gpugrid.net/result.php?resultid=13196328
17hr runtime so far- Boinc @ 97% (been at this percentage for last few hours without moving) Progress file shows steps currently @ 1.5million. If task is 5million total steps- total runtime time for one work unit will be 56~hr. Is there way to increase deadline? I have 8 tasks downloaded. Or should I just boot them when close to being expired?
First results are in - runtime 170-400Hr. ~20hr* each core. 8threads=160hr 16threads=320hr. Will CPUMD tasks be in performance tab? There certainly enough of them to compare. |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Hi,
That's a shame, those run times are rather longer than I'd anticipated. Probably have to dial down the length, though they are already close to the usable minimum.
Yes, they'll appear in the performance tab in due course.
Matt |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Long runtimes are not bothersome to me- BOINC not reliable estimating runtimes. If steps are bare minimum already- Is there a way to split tasks in half- where different users receive a half?
BTW- HFR is a great research choice- each chain has many Biological process- positive regulation of T cell mediated cytotoxicity-antigen processing and presentation of peptide antigen via MHC class I- regulation of defense response to virus by virus along with many others.
What is specifically rendered in work units? (how I miss the old file with atom/bonds description) Stderr has "starting mdrun 'Protein in water'"
A few websites to understand type of research being done here not all specific to this work unit.
http://www.ebi.ac.uk/pdbe-srv/view/entry/1a6z/summary
http://www.ebi.ac.uk/pdbe-srv/view/entry/1de4/summary
http://www.rcsb.org/pdb/gene/B2M
http://amigo.geneontology.org/amigo/term/GO:0019882 |
|
|
|
http://www.gpugrid.net/result.php?resultid=13196328
17hr runtime so far- Boinc @ 97% (been at this percentage for last few hours without moving) Progress file shows steps currently @ 1.5million. If task is 5million total steps- total runtime time for one work unit will be 56~hr. Is there way to increase deadline? I have 8 tasks downloaded. Or should I just boot them when close to being expired?
First results are in - runtime 170-400Hr. ~20hr* each core. 8threads=160hr 16threads=320hr. Will CPUMD tasks be in performance tab? There certainly enough of them to compare.
I confirm this
http://www.gpugrid.net/results.php?userid=58967&offset=0&show_names=0&state=0&appid=27
after 14-17 Hours stops at 98% without moving on ...
aborted all Wus |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I fired up an extra rig yesterday evening, with no usable GPU's for here but an older i7.
What I now see on three rigs doing the CPU, (two use 5 threads, one 4 threads) is that the first 99% gets done in about 17 hours and the last 1% takes a way more. At the older i7 it is now 20 hours running (0.300% to go) and the others are 22 hours running and not yet finished. I will let them run off course, but wouldn't it be better for a GPU to handle this?
____________
Greetings from TJ |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
I fired up an extra rig yesterday evening, with no usable GPU's for here but an older i7.
What I now see on three rigs doing the CPU, (two use 5 threads, one 4 threads) is that the first 99% gets done in about 17 hours and the last 1% takes a way more. At the older i7 it is now 20 hours running (0.300% to go) and the others are 22 hours running and not yet finished. I will let them run off course, but wouldn't it be better for a GPU to handle this?
TJ- in the progress file- how many steps so far? I've been at 98% for last 7 hours running 4 threads with about 2.5million steps left out 5mil total. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I fired up an extra rig yesterday evening, with no usable GPU's for here but an older i7.
What I now see on three rigs doing the CPU, (two use 5 threads, one 4 threads) is that the first 99% gets done in about 17 hours and the last 1% takes a way more. At the older i7 it is now 20 hours running (0.300% to go) and the others are 22 hours running and not yet finished. I will let them run off course, but wouldn't it be better for a GPU to handle this?
TJ- in the progress file- how many steps so far? I've been at 98% for last 7 hours running 4 threads with about 2.5million steps left out 5mil total.
eXaPower, there is a lot of information in the progress file, but I don't see it. Or better I don't know where to look. Can you give me a hint?
____________
Greetings from TJ |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Stroll down to very end of file- this where most current step is- Every ten minutes a checkpoint is created.
Every thousand steps should show
[list=] Step Time Lambda
1680000 3360.00000 0.00000
Energies (kJ/mol)
Bond Angle Proper Dih. Improper Dih. LJ-14
1.98361e+003 5.27723e+003 6.64863e+003 3.42704e+002 2.23267e+003
Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. Potential
2.89505e+004 3.73032e+004 -3.83508e+005 3.47532e+003 -2.97294e+005
Kinetic En. Total Energy Conserved En. Temperature Pressure (bar)
6.02571e+004 -2.37037e+005 -2.60930e+005 2.99386e+002 -3.39559e+002[/list] |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Found it, I have some data
Step Time Lambda
4891000 9782.00000 0.00000
(99.895% done, 5CPU, 24:23:20h)
Step Time Lambda
3856000 7712.00000 0.00000
(99.887% done, 5CPU, 23:53:20h)
Step Time Lambda
2938000 5876.00000 0.00000
(97.075% done, 4CPU, 18:51:13h)
So will take more time to finish. At one rig I got 4 more, they will not meet the deadline of 17 October.
____________
Greetings from TJ |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Finally got the first one finished in 24h55m!
____________
Greetings from TJ |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
exapower,
Actually it's dihydrofolate reductase http://en.wikipedia.org/wiki/Dihydrofolate_reductase
Matt
|
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
TJ - odd that it ran with 5 threads -- do you have a CPU limit set ?
Matt |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
For the WUs going on tonight I've increased the expected compute cost by 10x, and the deadline to 7 days.
Matt |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Thanks for the correction. Gives me another enzyme to learn about. These CPUMD tasks for Ivy Bridge 4 threads looks to be ~48 hr runtime. My Westmere generation Pentium runtime is about 90~hr. (It really a Dual core Xeon 30TDP L3403) reason why- a 4.8GT/s QPI internal link and external DMI link. No westmere Pentium has QPI- only DMI links. Intel been rebadging Xeons for while now.
I just received a new task with 10x compute cost-- 1084hour estimated runtime! Old task estimated a 5hr runtime- while taking 48~hr to finish. Going to be very interestxing.
http://www.gpugrid.net/workunit.php?wuid=10165559
Edit*** Task http://www.gpugrid.net/workunit.php?wuid=10157944 is now running in high priority mode- kicking out one of my two GPU tasks off to waiting to run. All settings for Boinc have no limitations. Manual restarting of task and shutting off SLI has no affect. This just started to happen when task went into High priority mode after I downloaded new CPUMD task with high compute cost.
I aborted http://www.gpugrid.net/workunit.php?wuid=10165559 and GPU waiting to run task started back up. A new task http://www.gpugrid.net/workunit.php?wuid=10165641 downloaded and stopped GPU task again- boinc saying waiting to run state. Aborted task and second GPU starts task again.
What's changed from prior batch of CPUMD?
Having a new 10x compute cost task in cache suspends one of two GPU tasks running. If no 10x compute task are in cache- both GPU tasks run normal.
All this is happening while the older CPUMD task is running in high priority mode with two GPU SDOERR tasks and one 10x compute CPUMD task in cache. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
TJ - odd that it ran with 5 threads -- do you have a CPU limit set ?
Matt
Yes it is at 70 so that 2 are free to feed the GPU's and 1 for the system to stay responsive so I can use this site and do some other things.
I have another rig set to 100 so the next one should run at 100% when the current one has finished.
____________
Greetings from TJ |
|
|
Chilean Send message
Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level
Scientific publications
|
Running one @ 8 threads (4 cores + HT).
EDIT: is there a way to let it use only 6 cores and free up two cores (so vLHC can ran as well...)
____________
|
|
|
|
Hmm... I got a task, which will correctly run at 6 of my 8 CPUs on my rig (since I'm using "Use at most 75% CPUs" to presently accommodate 2 VM tasks outside of BOINC):
http://www.gpugrid.net/workunit.php?wuid=10166037
BUT...
It immediately started running in "High Priority" "Earliest Deadline First" mode. Could this mean that the 1-week-deadline is too short? The estimated runtime is 1130+ hours, which is, uhh, 6.7 weeks? Does that sound right?
This configuration must be incorrect. The deadline should never be less than the expected initial runtime, right? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Yeah, no idea what went wrong with the runtime estimates. The last set got an estimate of 5h, but when I increased the cost by 10x the estimate went up 200x.
Matt |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Task http://www.gpugrid.net/workunit.php?wuid=10157944 been @ 99.978 compete with an estimated 3 minute left for last 12hr with a current 2million steps to go before finish. This task been past 98% for about 25hr out 32hr running. The newer CPUMD tasks sitting in cache kills one of GPU tasks - keeping task in waiting to run mode. Once 10x compute cost task booted all is normal again. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Running one @ 8 threads (4 cores + HT).
EDIT: is there a way to let it use only 6 cores and free up two cores (so vLHC can ran as well...)
Yes that can be achieved, but not with the current one running.
Eight cores, so 12.5 per core. You want to use 6 cores thus 12.5x6=75.
So you have to set in BOINC Manager, Tools, Computing preferences, and then at the bottom: On multiprocessor systems, use at most and there you set it to 75%.
This works only for new WU's that not have been started. Once your current WU has finished, the next one will only use 75% thus 6 threads.
____________
Greetings from TJ |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
I am thinking about hard-coding the CPU use to [number-of-CPU-cores] - [number-of-GPUs]. Opinions?
Matt |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
I am thinking about hard-coding the CPU use to [number-of-CPU-cores] - [number-of-GPUs]. Opinions?
Matt |
|
|
|
Running one @ 8 threads (4 cores + HT).
EDIT: is there a way to let it use only 6 cores and free up two cores (so vLHC can ran as well...)
Another way to accomplish this is to set up an app_config.xml file like the following:
<app_config>
<app>
<name>android</name>
<max_concurrent>8</max_concurrent>
</app>
<app_version>
<app_name>cpumd</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>6</avg_ncpus>
<cmdline>--nthreads 6</cmdline>
</app_version>
</app_config>
The "avg_ncpus" parameter sets the number of threads reserved in the BOINC scheduler and the "__nthreads" parameter set the number of threads you want a task to use when the task runs.
The app_config.xml file goes into the /Boinc/projects/www.gpugird.net folder.
And just as a reminder, if you are using Windows, use Notepad to edit the app_config.xml file. Do not use Word as it put in extra formatting characters that confuses the xml interpreter. If you are using Ubuntu, use Gedit to edit the app_config.xml file.
That will leave 2 threads free for BOINC to schedule other tasks.
Sometimes you can make this effective by opening BOINC Manager and clicking on "Advanced" then "Read config files". In the message log, you should get a message that says something like app_config.xml found for www.gpugrid.net. You might need to shut down BOINC and start it back up again to make the app_config.xml effective. Then the parameters will apply to any new work that is downloaded after the parameters are effective.
Hope that helps. |
|
|
|
MJH:
Yeah, no idea what went wrong with the runtime estimates. The last set got an estimate of 5h, but when I increased the cost by 10x the estimate went up 200x.
Did you increase the <rsc_flops_bound> value appropriately? I believe it's used for task size, and hence task runtime estimation.
I am thinking about hard-coding the CPU use to [number-of-CPU-cores] - [number-of-GPUs]. Opinions?
I personally don't care either way, but I absolutely care that you make absolutely sure that the number of cores used (via commandline) matches the number of cores budgeted (via ncpus), so BOINC doesn't overcommit or undercommit the CPU. I hope that makes sense. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
I am thinking about hard-coding the CPU use to [number-of-CPU-cores] - [number-of-GPUs]. Opinions?
Matt
Doesn't hurt to try. Only Nvidia GPU? How much would a GPU help with task time? If runtime is lowered by half or more then I'd say this is workable. Would be an option to run a task with half of GPU cores so maybe two of MD task can run at a time? Or is whole GPU required?
|
|
|
|
Side note:
Is it possible that progress % is not hooked up correctly?
For instance, progress.log says:
nsteps = 5000000
and if I scroll to the bottom:
Step Time Lambda
1254000 2508.00000 0.00000
... yet, the BOINC UI only says "1.777% done"
Shouldn't it say 25% done? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
The app isn't reporting its progress to the client yes. It's just being estimated from the flopses.
Matt |
|
|
|
Is it possible to change it so that the app can report a better progress %? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
It's on the TODO list, yes. It'll be appearing in the Linux version first, as that's the easier to develop.
Matt |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I am thinking about hard-coding the CPU use to [number-of-CPU-cores] - [number-of-GPUs]. Opinions?
Matt
For GPUGRID only crunchers this is great, but I think that those who do other project on the CPU are not that happy.
For me you can do it.
Edit: there are also projects that use the i-GPU, that is also a thread on the CPU.
____________
Greetings from TJ |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I have set to use 100% of the CPU's and only the CPU app for GPUGRID is in hte queque, but strangely only 5CPU's are used. Should be 8.
I noticed this in the progress file:
Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand: Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
Family: 6 Model: 26 Stepping: 5
Features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
Acceleration most likely to fit this hardware: SSE4.1
Acceleration selected at GROMACS compile time: SSE2
Binary not matching hardware - you might be losing performance.
Acceleration most likely to fit this hardware: SSE4.1
Acceleration selected at GROMACS compile time: SSE2
Also I think as the estimation of run time is not correct yet the first 99% goes rather quick and then the last 1% take between 20-28 hours to finish. But Matt knows this already.
____________
Greetings from TJ |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
TJ- from a post earlier.
Only SSE2 at the moment. Will probably make builds with higher levels of optimisation later but for now I'm concerned about correctness rather than performance.
Matt
TJ or Jacob- for you're SLI system- have you noticed any task being kicked out while running an older CPUMD task with a new x10 compute cost in cache? For me- with an old CPUMD running in high priority mode and new CPUMD in cache- one of two GPU tasks running go's into "waiting to run" mode. |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
TJ- from a post earlier.
[quote]Only SSE2 at the moment. Will probably make builds with higher levels of optimisation later but for now I'm concerned about correctness rather than performance.
Matt
Aha, thanks now I understand.
TJ or Jacob- for you're SLI system- have you noticed any task being kicked out while running an older CPUMD task with a new x10 compute cost in cache? For me- with an old CPUMD running in high priority mode and new CPUMD in cache- one of two GPU tasks running go's into "waiting to run" mode.
I have only "old" ones running and in queue, will let them finish first. However none is yet running at high priority.
____________
Greetings from TJ |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
TJ- from a post earlier.
[quote]Only SSE2 at the moment. Will probably make builds with higher levels of optimisation later but for now I'm concerned about correctness rather than performance.
Matt
Aha, thanks now I understand.
TJ or Jacob- for you're SLI system- have you noticed any task being kicked out while running an older CPUMD task with a new x10 compute cost in cache? For me- with an old CPUMD running in high priority mode and new CPUMD in cache- one of two GPU tasks running go's into "waiting to run" mode.
I have only "old" ones running and in queue, will let them finish first. However none is yet running at high priority.
Currently- CPU task from first batch is NOT running in high priority- but when I download a "new" CPUMD 10x compute task- "old" task go's into high priority and kick's out one of two GPU task computing. |
|
|
|
Ok, let's get a thing straight here. Client scheduling.
The order, I believe, goes something like this:
1) "High Priority" coprocessor (GPU/ASIC) tasks
2) "High Priority" CPU tasks (up to ncpus + 1) (MT tasks allowed to overcommit)
3) "Regular" coprocessor (GPU/ASIC) tasks (up to ncpus + 1)
4) "Regular" CPU tasks (up to ncpus + 1) (MT tasks allowed to overcommit)
So...
When one of the new GPUGrid MT CPU tasks comes in, if it is set to use all of the CPUs, and it run's high priority... It gets scheduled in "order 2", which is above the GPU tasks which come in at "order 3".
And then, it will additionally schedule as many "order 3" GPU tasks as it can, but only up to the point that it budgets 1 additional CPU. (So, if your GPU tasks are set to use 0.667 CPUs like I have scheduled mine via app_config, then it will run 1 GPU task, but not 2).
This is NOT a problem of "oh wow, GPUGrid MT tasks are scheduling too many CPUs."
This IS a problem of "oh wow, GPUGrid MT tasks go high-priority immediately. That throws off all of the scheduling on the client."
Hopefully that helps clarify.
PS: Here is some dated info that is a useful read:
http://boinc.berkeley.edu/trac/wiki/ClientSched
http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Jacob- thank you for the information about client scheduling.
Matt- I see you released a CPUMD app for Linux with support for SSE4/AVX. Will windows also see an upgrade? Do you have idea what the speed up with SSE4/ AVX app will be compared to standard SSE2 app? |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Will windows also see an upgrade?
Probably within the week.
Do you have idea what the speed up with SSE4/ AVX app will be compared to standard SSE2 app?
10-30% for AVX on Intel, I think.
|
|
|
=Lupus=Send message
Joined: 10 Nov 07 Posts: 10 Credit: 12,777,491 RAC: 0 Level
Scientific publications
|
ohmyohmy...
http://www.gpugrid.net/result.php?resultid=13195959
running on 3 out of 4 cpu cores, nsteps=5000000
at 57 hours:
Writing checkpoint, step 3770270 at Tue Oct 14 23:57:52 2014
seems it will finish... in 24 more hours.
seems something went rly weird with est runtime. question: should I abort the 6 other workunits? |
|
|
Chilean Send message
Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level
Scientific publications
|
Step = 1 744 000
After 9 hrs 20 min running on full 8 threads. This might be the most expensive (computing wise) WU I've run ever since I started DC'ing.
____________
|
|
|
|
I noticed this bit of info off of the task I ran, using 3 out of the 4 available cores on my computer:
http://www.gpugrid.net/result.php?resultid=13201370
Using 1 MPI thread
Using 3 OpenMP threads
NOTE: The number of threads is not equal to the number of (logical) cores
and the -pin option is set to auto: will not pin thread to cores.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).
Can this become an issue for computers that aren't running the task with all cores? |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
This is the last bit of the stderr file:
starting mdrun 'Protein in water'
5000000 steps, 10000.0 ps (continuing from step 3283250, 6566.5 ps).
Writing final coordinates.
Core t (s) Wall t (s) (%)
Time: 32457.503 32458.000 100.0
9h00:58
(ns/day) (hour/ns)
Performance: 9.140 2.626
gcq#0: Thanx for Using GROMACS - Have a Nice Day
16:39:45 (4332): called boinc_finish(0)
It ran on 5 CPU's (8 where allowed). Am I right seeing that it took 9 hours to finish?
It took a bit more, see this:
1345-MJHARVEY_CPUDHFR-0-1-RND9787_0 10159887 153309 12 Oct 2014 | 17:58:49 UTC 15 Oct 2014 | 14:38:34 UTC Completed and validated 94,864.92 567,905.50 2,773.48 Test application for CPU MD v8.46 (mt)
____________
Greetings from TJ |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
It ran on 5 CPU's (8 where allowed). Am I right seeing that it took 9 hours to finish?
No - it took just over a day. The performance was ~9ns/day, the sim was 10ns in length.
Matt |
|
|
Chilean Send message
Joined: 8 Oct 12 Posts: 98 Credit: 385,652,461 RAC: 0 Level
Scientific publications
|
Had 3 errors on one of my PCs:
http://www.gpugrid.net/results.php?hostid=185425
All errored out with:
"Program projects/www.gpugrid.net/mdrun.846, VERSION 4.6.3
Source code file: ..\..\..\gromacs-4.6.3\src\gmxlib\checkpoint.c, line: 1562
File input/output error:
Cannot read/write checkpoint; corrupt file, or maybe you are out of disk space?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors"
This computer has no problems running other projects... including vLHC@Home, Rosetta, etc.
____________
|
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
A few observations about CPUMD tasks- a dual core CPU will need 4 days/ 96hr~ to compete task- a dual core with HT [4threads] requires 2days/48~hr - a quad core with no HT [4threads] takes 16-36~hr - a quad core with HT [8threads] competes tasks in ~24hr - a 6 core [12threads] finishes a task in ~8-16hr - while a 16 thread CPU manages CPUMD tasks in under ~12hr. There are CPU finishing faster from being overclocked and having 1833MHz or higher RAM clocks. Disk usage is low for CPUMD- notice when running GPU tasks disk usage can be higher for certain tasks. (unfold_Noelia)
CPU temps are low with SSE2 app- when AVX CPUMD app is released temps will be higher. For people who are running Intel AVX CPU- there a possible 10-30% speed up coming when AVX app is released.
Some info-- http://en.wikipedia.org/wiki/Dihydrofolate_reductase |
|
|
|
I completed my first CPU task on my main rig on Windows 10 Technical Preview x64.
http://www.gpugrid.net/result.php?resultid=13206356
Observations:
- It used app: Test application for CPU MD v8.46 (mtsse2)
- It had horrible estimates, along with an ability to report progress correctly, and had a 1-week deadline, and so it ran as high-priority the entire time, interfering with the BOINC client scheduling of my GPUGrid GPU tasks. I will not be running this type of task again unless the estimation is fixed.
- It did not report progress correctly.
- It ran using 6 (of my 8) logical CPUs, as I had BOINC set to use 75% CPUs, since I am running 2 RNA World VM tasks outside of BOINC
- It took 162,768.17s (45.2 hours) of wall time
- It consumed 721,583.90s (200.4 hours) of CPU time
- It did checkpoint every so often, which I was happy to see. It appeared to resume from checkpoints just fine.
- It completed successfully, with the output text below
- It validated successfully, and granted credit.
- It seems weird that the time values in the output do not match either the wall time or CPU time values that BOINC reported. Bug?
Core t (s) Wall t (s) (%)
Time: 18736.176 18736.000 100.0
5h12:16
(ns/day) (hour/ns)
Performance: 5.491 4.371
Let us know when the estimation and progress problems are fixed, and then maybe I'll run another one for you!
Thanks,
Jacob |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Deleted post |
|
|
|
eXaPower: Your questions about Windows 10 should have been a PM. I'll send you a PM response. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task. |
|
|
|
For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task.
Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505
It's not random.
When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also).
Not random at all. Working as designed, correctly...
... given the circumstances of the GPUGrid MT task estimates being completely broken. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task.
Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505
It's not random.
When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also).
Not random at all. Working as designed, correctly...
... given the circumstances of the GPUGrid MT task estimates being completely broken.
Jacob- one GPU task been running for 37hr straight in high priority mode- one GPU task for 22hr straight high priority and one CPUMD task for 24 straight hours high priority mode. During this time I haven't added any task to cache- If all three task were already in high priority (Order 1 or 3/is there a way to find out which?)mode running- why did BOINC kick one out after all this time? Since very beginning these three tasks have been in High priority and I haven't changed any BOINC scheduler or allowed CPU usage. I had a similar issue when a CPUMD task was in cache- so I've stopped allowing any task to sit in cache- only keeping tasks capable of computing on available GPU/CPU.
If I suspend CPUMD task- both GPU task will run with one being in High priority and other not. If I suspend CPUMD task one GPU that in high Priority changes to non-high priority. When CPUMD task is running along side one GPU task- when the task that's in waiting to run is suspended - the GPU task running stops high priority mode. |
|
|
|
"High priority mode" for a task means that "Presently, if tasks were scheduled in a FIFO order in the round-robin scheduler, the given task will not make deadline. We need to prioritize it to be ran NOW." It should show you, in the UI, if the task is in "High Priority" mode, on that Tasks tab, in the Status column.
A task can move out of "High priority mode" when the round-robin simulation indicates that it WOULD make deadline. When tasks are suspended/resumed/downloaded, when progress percentages get updated, when running estimates get adjusted (as tasks progress), when the computers on_frac and active_frac and gpu_active_frac values change ... the client re-evaluates all tasks to determine which ones need to be "High priority" or not.
Did you read the information in the links that were in my post? They're useful. After reading that information, do you still think the client scheduler is somehow broken?
Also, you can turn on some cc_config flags to see extra output in Event Log... specifically, you could investigate rr_simulation, rrsim_detail, cpu_sched, cpu_sched_debug, or coproc_debug. I won't be able to explain the output, but you could probably infer the meaning of some of it. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Some cc_config flags information- BOINC thinks I'm going to miss deadline for CPUMD task----
(1138hr remaining estimate/14/10/16 13:34:52 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1) Boinc says CPUMD is 20% compete in 24hr--progress file is at 3.5million step )
BOINC will run unfold Noelia task (97%compete/18hr est remaining/14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1) in High priority when CPUMD task is running while booting the task Boinc thinks will miss a deadline-- 63% compete SDOERR task (174hr remaining estimate) (SDOERR)14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0
Here some newer tasks states that have changed---14/10/16 13:43:13 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 1 next 1 task state 0
14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1
14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1
14/10/16 13:56:05 | GPUGRID | [rr_sim] 24011.34: unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (721404.58G/30.04G)
14/10/16 14:00:07 | GPUGRID | [rr_sim] 4404370.74: 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 finishes (4.00 CPU) (54297244.54G/12.33G)
14/10/16 13:56:05 | GPUGRID | [rr_sim] 658381.65: I1R119-SDOERR_BARNA5-38-100-RND1580_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (19780638.18G/30.04G)
14/10/16 13:56:05 | GPUGRID | [rr_sim] I1R119-SDOERR_BARNA5-38-100-RND1580_0 misses deadline by 348785.46
14/10/16 13:58:05 | GPUGRID | [cpu_sched_debug] skipping GPU job I1R119-SDOERR_BARNA5-38-100-RND1580_0; CPU committed
14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1
14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0
14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1
Now the three tasks are all running with new task states after being rescheduling ( I downloaded a new Long task)---
14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1
14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1
14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1 |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
CPUMD tasks completed past deadline: credit is rewarded.
http://www.gpugrid.net/workunit.php?wuid=10159833
http://www.gpugrid.net/workunit.php?wuid=10158842 |
|
|
|
I have a problem with the Test application for CPU MD work units. This is obviously a test setup, according to both application name and this discussion thread, and the work units are being pushed to my machines even though my profile is set to not receive WUs from test applications.
I'm happy to do GPU computing for you guys, but I'm not willing to let you take over complete machines for days. Please control your app to respect the "Run test applications?" setting in our profiles.
Thank you,
David |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Hm, sorry about that. Should only be going to machines opted in to test WUs.
I should point out the app is close to production - the main remaining problem with it is the ridiculous runtime estimates the client is inexplicably generating.
Matt |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Are the working SSE2 CPUMD tasks on vacation? Were return results incomplete/invalid? 10000 tasks disappeared.
From the look of BOINC stats and GPUGRID graphs- a decent amount of new user CPU only machines were added with credit rewarded. |
|
|
sis651Send message
Joined: 25 Nov 13 Posts: 66 Credit: 193,925,538 RAC: 0 Level
Scientific publications
|
I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before.
My problem is when I paused the units and restarted the Boinc none of the CPU works resume crunching from their last progress. They start crunching from the beginning. In an area with short but frequent blackouts its not possible to run these CPU units. |
|
|
|
I believe the project admins dumped the AVX mt program because of some flaws in it. When I ran the AVX program I also noticed the program never checkpointed.
from MJH on another post:
The buggy Windows AVX app is gone now. Please abort any instances of it still running. It's replaced with the working SSE2 app.
http://www.gpugrid.net/forum_thread.php?id=3812&nowrap=true#38680
For now at least, there are no other CPU beta workunits to test. I guess the project admins will revise and replace the workunits when they are ready and able to. |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before.
Make sure that the application executable that you are running has "sse2" in its name, not "avx". Manually delete the old AVX app binary from the project directory if necessary.
MJH |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
Received 5 abandoned 9.03 "AVX" tasks. All are computing SSE2 even with AVX app binary in directory- checkpoints are working- BOINC client progress reporting is still off.(@70% with 3.7million steps left to compute) Progress file is reporting steps computed properly. |
|
|
|
Hola, Amigos en Barcelona!
No CPU tasks received: are there any available?
Thanks!
John |
|
|
AstiesanSend message
Joined: 8 Jun 10 Posts: 3 Credit: 937,688,770 RAC: 3,195,274 Level
Scientific publications
|
mdrun-463-901-sse-32 causes a soft system freeze occassionally when exiting active state into sleeping state i.e. screensaver off to on.
By soft system freeze, I mean that the start bar/menu (I do use start8, but it's confirmed to occur without this active as well), all parts of it are locked. Windows-R can bring up the Run menu, and I can use cmd and taskkill mdrun and the start menu itself will return to normalcy, however the bar will continue to be unresponsive. Killing explorer.exe to reset the start bar will result in a hard freeze requiring reboot. During the soft freeze, alt-tab and other windows will be VERY slow to respond until mdrun is killed, afterwards all other windows work fine, but the start bar is unusable and will force a reboot of the system.
There is nothing in the error logs.
Any assistance or ideas in resolving this would be appreciated.
My system:
Windows 8.1 64-bit
i7 4790K @ stock
ASRock Z97-Extreme4
EVGA GTX 970 SC ACX @ stock
2x8GB HyperX Fury DDR3-1866 @ stock |
|
|
tomba Send message
Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level
Scientific publications
|
I gave four cores of my AMD FX-8350 to the app. I've done four WUs, which all completed in a remarkably consistent time of just over 16 hours, with a-bit-mean 920 credits each.
I just checked the server status:
...and was a little surprised to see my 16 hours well under the minimum run time of 19.16 hours.
|
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps. |
|
|
tomba Send message
Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level
Scientific publications
|
A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps.
I did complete this 5M-step WU on 24 October and got 3342 credits... |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
Yes, the credit allocation is wrong - need to work out how to fix that.
Matt |
|
|
tomba Send message
Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level
Scientific publications
|
Yes, the credit allocation is wrong - need to work out how to fix that.
Matt
A fixed 2.5M per completion would be a nice 'n' easy solution ;)
|
|
|
|
I have completed 2 of the new (I think?) tasks, of application type "Test application for CPU MD v9.01 (mtsse2)", on my host (id: 153764), running 8 logical CPUs (4 cores hyperthreaded).
When I first got the tasks, I think the estimated run time was something like 4.5 hours. But then, after it completed the first task (which took way longer - it took 15.75 hours of run time), it realized it was wrong, and adjusted the estimated run times for the other tasks to be ~16 hours.
For each of the 2 completed tasks:
- Task size: 2.5 million steps
- Run Time: ~16.4 hours
- CPU Time: ~104 hours (My CPUs were slightly overcommitted by my own doing)
- Credit granted: ~3700
I will continue to occasionally run these, to help you test, especially when new versions come out.
Regards,
Jacob |
|
|
TrotadorSend message
Joined: 25 Mar 12 Posts: 103 Credit: 13,393,727,393 RAC: 70,894,421 Level
Scientific publications
|
I'm crunching some of these units in my dual processor 32/48 threads machines. They are sandy bridge (32 threads) and ivy bridge (48 threads) xeon based machines.
In the 32 thread machine it has been quite straightforward, it has finished the first unit in 3h5m executing CPU MD v9.02 (mtavx)with CPU kicking in at turbo speed (3,3 GHz). No other boinc project in execution.
In the 48 thread it has been a little bit funnier :), first units crashed all just at the beggining, reading the stddr I learnt that the gromacs application can not work well with over 32 threads, but it will try anyway, so launching with 46 threads available other two reserved two GPUGRID GPU units) ended in error (11 units in a row).
So, while I investigated how to setup an app_config.xml file for mt units, I reduced the % of available processors until it reached 32 and started another MT unit that this time executed properly and finished in something less than 3h.
Then, I copied the app_config.xml file in the GPUGRID folder, enabled again 46 threads and crossed my fingers. It worked fine, 1 MT task using 32 threads and rest of threads executing Rosseta units. Additionally 2 GPUGRID GPU tasks. This time it need about 3h10m which i think should be because of the overall load in the machine.
I'll execute some more units and report of more findings if noticeable.
|
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
I powered on my old workstation with two xeon's and two slow GTX 660's.
I have allowed BOINC to use all 8 cores, requested new work for GPUGRID and got 2 GPU WU's SR and one CPU. This CPU WU runs on 4 cores it says but in taskmaanager it actually used 92%. I don't mind as I allowed to use all cores, but I would have expect that it uses 6 cores, as there are 6 cores free. Two for the GPU WU's, so 8-2=6.
Am I thinking wrong here?
____________
Greetings from TJ |
|
|
|
Matt,
Scheduling suggestion: One of my PC's has 16 threads. I just ran it dry so I could install Windows updates uninterrupted. When I started back up, the first project I allowed to download new CPU tasks was GPUGRID. It download 16 of the multi-thread tasks. Each task is scheduled to take 70 hours. Since the tasks run one at a time, it will take a while to work through 16 tasks at 70 hours per task.
My suggestion is that the number of tasks downloaded at one time is 2 or 3. The number of tasks downloaded should not equal to the number of threads on the machine. I think the BOINC default is to initially download the same number of CPU tasks as there are threads on the machine. That may need to be changed for multi-thread tasks.
Thanks for all the effort you put in.
captainjack |
|
|
|
Matt,
Scheduling suggestion: One of my PC's has 16 threads. I just ran it dry so I could install Windows updates uninterrupted. When I started back up, the first project I allowed to download new CPU tasks was GPUGRID. It download 16 of the multi-thread tasks. Each task is scheduled to take 70 hours. Since the tasks run one at a time, it will take a while to work through 16 tasks at 70 hours per task.
My suggestion is that the number of tasks downloaded at one time is 2 or 3. The number of tasks downloaded should not equal to the number of threads on the machine. I think the BOINC default is to initially download the same number of CPU tasks as there are threads on the machine. That may need to be changed for multi-thread tasks.
Thanks for all the effort you put in.
captainjack
I am actually very familiar with BOINC Work Fetch.
Essentially, what it does is: You have 2 preferences, the "Min Buffer" and the "Additional Buffer".
- When BOINC doesn't have enough work to keep all devices busy for "Min Buffer", or has an idle device presently, it will ask projects for work.
- When it asks, it asks for "Enough work to fill the idle devices, plus enough work to saturate the devices for [Min Buffer + Additional Buffer] time.", properly taking into account that some tasks are MT and some aren't. It correctly asks for that amount, because it minimizes the RPC web calls to the projects.
When BOINC contacted GPUGrid, it likely worked correctly, to satisfy your cache settings. If you think otherwise, then turn on <work_fetch_debug>, abort all of the unstarted tasks, and then let work fetch run, then copy the Event Log data to show us what happened.
Feel free to turn on the <work_fetch_debug> flag to see what BOINC is doing during work fetch. http://boinc.berkeley.edu/wiki/Client_configuration
Regards,
Jacob |
|
|
|
Jacob,
Per your suggestion, I aborted all tasks, disabled the app_config file, turned on the work_fetch_debug option, started BOINC, and allowed new GPUGRID tasks. It downloaded one task.
Then I aborted that task, enabled the app_config file, restarted BOINC and allowed new tasks. It downloaded one task.
Then I turned off the work_fetch_debug option, aborted the task, restarted BOINC, and allowed new tasks. It downloaded one task.
No idea why it downloaded 16 tasks at one time yesterday. Must have been sun spots or something like that. Anyway, it seems to be working today.
Thanks for the suggestion.
captainjack |
|
|
|
Strange.
The only things I can think of, offhand, would be:
- maybe your local cache of work-on-hand had been much lower during the "16-task-work-fetch", as compared to the "1-task-work-fetch"
- maybe your cache settings ("Min buffer" and "Max additional buffer") were different between the fetches.
Anyway, I'm glad to hear it's working for you!
If you have any questions/problems related to work fetch, grab some <work_fetch_debug> Event Log data, and feel free to PM me. I am a work fetch guru -- I helped David A (the main BOINC designer) make sure work fetch works well across projects, resources (cpus, gpus, asics), task types (st single threaded, mt multi threaded), etc. The current BOINC 7.4.27 release does include a handful of work fetch fixes compared to the prior release. You should make sure all your devices are using 7.4.27.
Regards,
Jacob |
|
|
|
This application appears to have problems restarting from a checkpoint -
I suspended it for a few days, then when I told it to resume, it gave a
computation error less than a second later.
Test application for CPU MD v9.01 (mtsse2)
http://www.gpugrid.net/result.php?resultid=13426589
http://www.gpugrid.net/workunit.php?wuid=10302711 |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Robert, was LAIM on?
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
TrotadorSend message
Joined: 25 Mar 12 Posts: 103 Credit: 13,393,727,393 RAC: 70,894,421 Level
Scientific publications
|
So, is it still a Test application? Not ready for science production yet? |
|
|
|
Robert, was LAIM on?
What's LAIM? How do I tell if it's on? |
|
|
|
Robert, was LAIM on?
What's LAIM? How do I tell if it's on?
LAIM |
|
|
|
LAIM stands for "Leave application in memory"(while suspended) setting in boinc client under disk and memory usage tab. |
|
|
|
LAIM stands for "Leave application in memory"(while suspended) setting in boinc client under disk and memory usage tab.
It's on. However, I may have installed some updates and rebooted while the workunit was suspended.
|
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
The CPU work units apparently use GROMACS 4.6, which has provisions for GPU acceleration also. Is that being planned? |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
It looks like the work units have now gone from 4 cores to 6 cores. It is possible that the difference is due to an increase in the number of cores I allowed in BOINC, but I think it is more likely to be a change in the work units themselves.
GPUGRID 9.03 Test application for CPU MD (mtavx) 73801-MJHARVEY_CPUDHFR2-0-1-RND3693_0 - (-) 6C
That is perfectly OK with me, and I am glad to find a project that uses AVX. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
To answer my own question, it looks like it is due to the changes that I made in BOINC. It is now up to 8 cores with the latest WU download, though it seems to me that it was limited by something else when I first started, but that was a while ago. |
|
|
|
I'm pretty sure that the thread count of the task is set either at time-of-download, or time-of-task-start... And it's based on the "Use at most X% of CPUs" setting. |
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
So, is it still a Test application? Not ready for science production yet?
MJH:
Is CPUMD MJHARVEY_CPUDHFR2 finished? Will a batch of new (test) CPUMD be available? Or CPUMD transitioning to (production)?
|
|
|
|
Bump. Are there new CPU workunits coming? Hope all is well with the project. |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
The CPU work is temporarily in abeyance while we prepare a new application.
Check back later or, if you have an AMD GPU, please participate in testing the new app.
Matt |
|
|
|
Fancy word!
a·bey·ance
əˈbāəns/
noun: abeyance
a state of temporary disuse or suspension.
https://www.google.com/search?q=define%3Aabeyance
PS: Please use simpler words :-p |
|
|