Advanced search

Message boards : News : ACEMD 4

Author Message
Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58395 - Posted: 1 Mar 2022 | 12:37:14 UTC

Hello everybody,

You probably have noticed the new ACEMD 4 app and have a few questions.


    What is the difference between ACEMD 3 and ACEMD 4?
    A key new feature is the integration of machine learning into molecular simulations. Specifically, we are implementing a new method called NNP/MM.

    What is NNP/MM?
    NNP/MM is a hybrid simulation method combining neural network potentials (NNP) and molecular mechanics (MM). NNP can model the molecular interactions more accurately than the conventional force fields in MM, but it still is not as fast as MM. Thus, only the important part of a molecular system is simulated with NNP, while the rest part is using MM. You can read more in a pre-print of the NNP/MM article: https://arxiv.org/abs/2201.08110

    How much more accurate is NNP?
    You can read a pre-print of the TorchMD-NET article: https://arxiv.org/abs/2202.02541

    What are software/hardware requirements for ACEMD 4?
    Pretty much the same as for ACEMD 3. Only the significant change is the size of the software stack. ACEMD 3 and all its dependencies need just 1 GB, while for ACEMD 4 that has increased to 3 GB, notably due to PyTorch (https://pytorch.org). Also, at the moment, there are just the Linux version for CUDA >=11.2 available.

    When will ACEMD 3 be replaced by ACEMD 4?
    Within a few months, we will release ACEMD 4 officially and ACEMD 3 will be deprecated. For a moment, the apps will coexist and you will receive WUs for both of them.

    What will happen next?
    We have already sent several WUs to test the deployment of ACEMD 4 and will continue this week. Let us know, if you notice some irregularities. Next week, we are aiming to start sending production WUs.


Happy computing,

Raimondas

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58396 - Posted: 1 Mar 2022 | 15:37:27 UTC

Thanks for the explanations! I'll be reading those preprints soon.

Do we need to check the Run Test Applications preference?

There's no acemd4 option in preferences?

Why are there 2 Use CPU boxes?

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58397 - Posted: 1 Mar 2022 | 17:25:27 UTC - in response to Message 58396.

Thanks! I have added the ACEMD 4 option.

Why are there 2 Use CPU boxes?

Where do you see these boxes?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58398 - Posted: 1 Mar 2022 | 18:59:08 UTC - in response to Message 58397.

Thanks! I have added the ACEMD 4 option.

Why are there 2 Use CPU boxes?

Where do you see these boxes?


Click on Edit Preferences and there's one at the top and one at the bottom.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58399 - Posted: 1 Mar 2022 | 22:30:50 UTC - in response to Message 58398.

Thanks! I have added the ACEMD 4 option.

Why are there 2 Use CPU boxes?

Where do you see these boxes?


Click on Edit Preferences and there's one at the top and one at the bottom.

Not seeing this on my account.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 17,709
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58400 - Posted: 1 Mar 2022 | 22:41:02 UTC - in response to Message 58399.

Resource share

Use CPU Enforced by version 6.10+ no
Use ATI GPU Enforced by version 6.10+ no
Use NVIDIA GPU Enforced by version 6.10+ yes

Run test applications? yes
Is it OK for GPUGRID and your team (if any) to email you? yes
Should GPUGRID show your computers on its web site? yes

Default computer location ---

Maximum CPU % for graphics 0 ... 100 20

Run only the selected applications ACEMD 3: no
ACEMD 4: no
Quantum Chemistry (CPU): no
Quantum Chemistry (CPU, beta): no
Python Runtime (CPU, beta): no
Python Runtime (GPU, beta): yes

If no work for selected applications is available, accept work from other applications? yes

Use Graphics Processing Unit (GPU) if available yes
Use Central Processing Unit (CPU) yes


This is what my default location looks like.
The CPU is at the top and bottom.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58401 - Posted: 1 Mar 2022 | 23:15:18 UTC - in response to Message 58400.

Resource share

Use CPU Enforced by version 6.10+ no
Use ATI GPU Enforced by version 6.10+ no
Use NVIDIA GPU Enforced by version 6.10+ yes

Run test applications? yes
Is it OK for GPUGRID and your team (if any) to email you? yes
Should GPUGRID show your computers on its web site? yes

Default computer location ---

Maximum CPU % for graphics 0 ... 100 20

Run only the selected applications ACEMD 3: no
ACEMD 4: no
Quantum Chemistry (CPU): no
Quantum Chemistry (CPU, beta): no
Python Runtime (CPU, beta): no
Python Runtime (GPU, beta): yes

If no work for selected applications is available, accept work from other applications? yes

Use Graphics Processing Unit (GPU) if available yes
Use Central Processing Unit (CPU) yes


This is what my default location looks like.
The CPU is at the top and bottom.


OK, after refreshing the browser, I am seeing use cpu in both places also.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58402 - Posted: 1 Mar 2022 | 23:16:24 UTC - in response to Message 58397.

Thanks! I have added the ACEMD 4 option.

Why are there 2 Use CPU boxes?

Where do you see these boxes?

In Project Preferences

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58403 - Posted: 1 Mar 2022 | 23:17:13 UTC

Next week, we are aiming to start sending production WUs.


Can you say at what quantities?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58404 - Posted: 2 Mar 2022 | 13:24:19 UTC
Last modified: 2 Mar 2022 | 14:00:02 UTC

Just tried to run one of the new v1.01 tasks. Failed with this error message:

13:16:51 (224808): wrapper: running bin/conda-unpack ()
/usr/bin/env: ‘python’: No such file or directory
13:16:52 (224808): bin/conda-unpack exited; CPU time 0.000612
13:16:52 (224808): app exit status: 0x7f

Another is downloading on my second Linux machine as I type: I'll try to watch it running.

Edit - second task failed the same way. Trying to work out what those unprintable (?punctuation?) characters represent.

Edit2 - no, all mine are failing, and I can't work out those character codes (not a format I recognise). &#152/153 feels like a start-stop pair (open quote/close quote?), but I can't get further than that.

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58405 - Posted: 2 Mar 2022 | 14:41:56 UTC - in response to Message 58403.

About 400 WUs

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58406 - Posted: 2 Mar 2022 | 14:43:19 UTC - in response to Message 58404.

Just tried to run one of the new v1.01 tasks. Failed with this error message:

13:16:51 (224808): wrapper: running bin/conda-unpack ()
/usr/bin/env: ‘python’: No such file or directory
13:16:52 (224808): bin/conda-unpack exited; CPU time 0.000612
13:16:52 (224808): app exit status: 0x7f

Another is downloading on my second Linux machine as I type: I'll try to watch it running.

Edit - second task failed the same way. Trying to work out what those unprintable (?punctuation?) characters represent.

Edit2 - no, all mine are failing, and I can't work out those character codes (not a format I recognise). &#152/153 feels like a start-stop pair (open quote/close quote?), but I can't get further than that.


Don't worry, I'll get rid of that conda-unpack.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58407 - Posted: 2 Mar 2022 | 15:31:06 UTC - in response to Message 58406.

Don't worry, I'll get rid of that conda-unpack.

No probs. I've just failed another one, so I'll set 'No new Tasks' until you're ready. Can you let us know when you're set for the next test, please?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58408 - Posted: 2 Mar 2022 | 16:25:24 UTC
Last modified: 2 Mar 2022 | 16:29:25 UTC

One out of nine have passed so far. Here's the latest failure, I can't help with what code 195 means, just sharing.
Edit: Dr Google found it: "ERR_NO_APP_VERSION -195 - BOINC couldn't find the application's version number."
https://boinc.mundayweb.com/wiki/index.php?title=Error_code_-191_to_-200_explained

Stderr output
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
04:49:47 (986503): wrapper (7.7.26016): starting
04:49:47 (986503): wrapper (7.7.26016): starting
04:49:47 (986503): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.bz2)
04:56:02 (986503): /bin/tar exited; CPU time 369.002826
04:56:02 (986503): wrapper: running bin/conda-unpack ()
/usr/bin/env: &#226;&#128;&#152;python&#226;&#128;&#153;: No such file or directory
04:56:03 (986503): bin/conda-unpack exited; CPU time 0.001065
04:56:03 (986503): app exit status: 0x7f
04:56:03 (986503): called boinc_finish(195)
</stderr_txt>
]]>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58409 - Posted: 2 Mar 2022 | 17:08:44 UTC - in response to Message 58408.

Your computers are hidden, so I can't see the details of the task which succeeded. Was it the most recent (implying that Raimondas has fixed conda-unpack already), or an old v100 left over from the previous (working) run?

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,685,513,165
RAC: 826,467
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58410 - Posted: 2 Mar 2022 | 19:04:06 UTC - in response to Message 58405.

Raimondas wrote:

About 400 WUs

Linux only, or Windows also?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58413 - Posted: 3 Mar 2022 | 4:29:19 UTC - in response to Message 58410.

He said earlier it would be a Linux only application. For now.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58414 - Posted: 3 Mar 2022 | 13:53:12 UTC

Raimondas has issued a new download - 2.86 gigabytes this time. It's downloading (for now) at around 7.5 megabytes/sec, which is great. But I fear that speeds may drop if too many people try to download it at once (as some people reported yesterday).

Other BOINC projects have overcome this problem by redirecting these big downloads for new releases to a caching online server service: because everyone is downloading the identical file, caching is viable and valuable.

The new file has now completed downloading, and is running as v1.02 - seemingly successfully. It started making accurate progress reports with 50%, and then progressed in 5% steps. It completed without errors in 6 minutes 44 sec (GTX 1660). Just waiting for a slot in the DDOS protection to report.

Task 32747705

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 17,709
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58415 - Posted: 3 Mar 2022 | 15:30:45 UTC - in response to Message 58414.

Mine will take almost 11 hours to download. It might finish in time on a GTX 1070.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58416 - Posted: 3 Mar 2022 | 15:37:49 UTC
Last modified: 3 Mar 2022 | 16:28:59 UTC

T13_7-RAIMIS_TEST-0-2-RND0280_1 did a bad thing. It took over from an acemd3 WU that was 88% done. It should wait its turn.
I'm checking to see if it'll Suspend On Checkpoint.

Edit: It got to 80% at 9:06 minutes and suspended. So it has at least one checkpoint.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 194
Credit: 539,137,515
RAC: 13
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58418 - Posted: 4 Mar 2022 | 1:53:27 UTC

My task took 9.5 hours to download, and ran for 8 minutes. Of those 8 minutes, the first several minutes showed zero load on the GPU. I assume it was unpacking the task then. So it really ran for only about 5 minutes on the GPU.
____________
Reno, NV
Team: SETI.USA

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 17,709
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58419 - Posted: 4 Mar 2022 | 2:17:12 UTC - in response to Message 58418.
Last modified: 4 Mar 2022 | 2:17:38 UTC

My task took 9.5 hours to download, and ran for 8 minutes. Of those 8 minutes, the first several minutes showed zero load on the GPU. I assume it was unpacking the task then. So it really ran for only about 5 minutes on the GPU.

That is pretty much what I got, though I did not pay attention to the load. It was much ado about nothing. When they get more and longer ones, I will try again.

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 82
Level
Met
Scientific publications
watwatwat
Message 58420 - Posted: 4 Mar 2022 | 14:15:49 UTC
Last modified: 4 Mar 2022 | 14:16:37 UTC

My hosts received WUs, but they error out after 5 minutes:

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:39:40 (150677): wrapper (7.7.26016): starting
14:39:40 (150677): wrapper (7.7.26016): starting
14:39:40 (150677): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.bz2)
14:44:45 (150677): /bin/tar exited; CPU time 299.093084
14:44:45 (150677): wrapper: running bin/acemd (--boinc --device 0)
ERROR: /home/user/conda/conda-bld/acemd_1646158992086/work/src/mdio/amberparm.cpp line 76: Unsupported PRMTOP version!
14:44:46 (150677): bin/acemd exited; CPU time 0.205850
14:44:46 (150677): app exit status: 0x9e
14:44:46 (150677): called boinc_finish(195)

</stderr_txt>
]]>


I noticed that the only ones failing are v1.02. v2.19 ones validate (well, one did so far, the rest are still running). I'm a bit confused, so is v1 ACEMD 3 and v2 ACEMD 4? Or are both different versions of ACEMD 4?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58421 - Posted: 4 Mar 2022 | 14:24:38 UTC - in response to Message 58420.

I noticed that the only ones failing are v1.02. v2.19 ones validate (well, one did so far, the rest are still running). I'm a bit confused, so is v1 ACEMD 3 and v2 ACEMD 4? Or are both different versions of ACEMD 4?


It's confusing the way they switched to a long-winded name and stopped labeling them acemd3 (v2.19) and acemd4 (v 1.0).

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58422 - Posted: 4 Mar 2022 | 14:49:58 UTC - in response to Message 58420.
Last modified: 4 Mar 2022 | 15:01:39 UTC

You've also had errors like one I've just encountered: EXIT_DISK_LIMIT_EXCEEDED

Like you, my first failure was with Unsupported PRMTOP version! (task 32749480), followed by

<workunit>
<name>T1_NNPMM_1ajv_07-RAIMIS_NNPMM-0-2-RND3217</name>
<app_name>acemd4</app_name>
<version_num>102</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>

(task 32749572)

It's possible that the first error didn't clean up properly behind itself, and caused the combined project total to exceed that 10,000,000,000 byte limit - though that seems unlikely.

All hell is breaking out at GPUGrid today, with ADRIA acemd3 tasks completing in under an hour - we'll just have to wait while things get sorted out one by one, and the dust settles!

Edit - the slot directory where my next task is running contains 17,298 items, totalling 10.3 GB (10.3 GB on disk) - above the limit, although BOINC hasn't noticed yet.

Edit 2 - it has now. Task 32750112

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,384,861,035
RAC: 14,849
Level
Trp
Scientific publications
watwatwat
Message 58423 - Posted: 4 Mar 2022 | 15:10:45 UTC

Could it have anything to do with your running Borg BOINC? Resistance is futile :-)

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 82
Level
Met
Scientific publications
watwatwat
Message 58424 - Posted: 4 Mar 2022 | 15:21:40 UTC

ah, so the failing tasks are actually acemd3?

I'm not sure what that disk limit error is about. All my relevant hosts are set to allow between 30-75 GB of space for BOINC. The event logs confirm this setting.

Also, unrelated and nothing new, but this site is such a supreme pain in the butt to navigate if you are running multiple hosts from the same external IP... Ridiculous.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58425 - Posted: 4 Mar 2022 | 15:36:46 UTC

We currently have:

Advanced molecular dynamics simulations for GPUs v2.19 - that's acemd3
Advanced molecular dynamics simulations for GPUs v1.02 - that's acemd4

Clear as mud ??!!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58426 - Posted: 4 Mar 2022 | 15:45:33 UTC - in response to Message 58424.

I'm not sure what that disk limit error is about.

The limit in question is set at the workunit level, and applies to the amount copied to the working ('slot') directory, plus any data generated during the run. The BOINC limits are applied to the sum total of all files, in all directories, under the control of BOINC.

To prove the point, I caught task 32750377 and suspended it before it started running. Then, I shut down the BOINC client, and edited client_state.xml to double the workunit limit. It ran to completion, and was validated.

I'll do another one to check, but I can't be sitting here manually editing BOINC files every five minutes - this needs catching at source, and quickly.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58427 - Posted: 4 Mar 2022 | 16:10:53 UTC

And the next one - workunit 27113095, created 15:43:59 UTC - already has the fix in place. Kudos to whoever was watching our conversation here.

It's also running a lot slower - nearly two minutes for each 2.5% step - so hopefully we're starting to get some real science done.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 194
Credit: 539,137,515
RAC: 13
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58428 - Posted: 4 Mar 2022 | 16:17:26 UTC - in response to Message 58427.

And the next one - workunit 27113095, created 15:43:59 UTC - already has the fix in place. Kudos to whoever was watching our conversation here.

It's also running a lot slower - nearly two minutes for each 2.5% step - so hopefully we're starting to get some real science done.


Did the fix require the whole app to be re-downloaded? Please say no...
____________
Reno, NV
Team: SETI.USA

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 82
Level
Met
Scientific publications
watwatwat
Message 58429 - Posted: 4 Mar 2022 | 16:39:30 UTC
Last modified: 4 Mar 2022 | 16:52:03 UTC

Thanks for the explanation, RH. :)

Getting validating acemd4 tasks now.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58430 - Posted: 4 Mar 2022 | 17:01:24 UTC - in response to Message 58428.

Did the fix require the whole app to be re-downloaded? Please say no...

No!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58431 - Posted: 4 Mar 2022 | 17:08:20 UTC

i just downloaded the 2.8GB tar package and it only took about a few minutes to download at the roughly 15Mbps transfer rate.

the ~500MB model file however is going quite slow at ~200Kbps and is dragging along
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58432 - Posted: 4 Mar 2022 | 21:06:38 UTC

Up to 3GB file. 3hr45min to get to 66% download.

Greger
Send message
Joined: 6 Jan 15
Posts: 68
Credit: 6,775,854,299
RAC: 150,355
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 58433 - Posted: 4 Mar 2022 | 21:11:39 UTC
Last modified: 4 Mar 2022 | 21:15:27 UTC

3GB got downloaded in few minutes and unit looks to working fine.

Task: http://www.gpugrid.net/result.php?resultid=32751730

stderr out

Stderr output
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
17:49:50 (2974532): wrapper (7.7.26016): starting
17:49:50 (2974532): wrapper (7.7.26016): starting
17:49:50 (2974532): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.bz2)
17:58:25 (2974532): /bin/tar exited; CPU time 509.278471
17:58:25 (2974532): wrapper: running bin/acemd (--boinc --device 0)
21:34:24 (2974532): bin/acemd exited; CPU time 12867.200624
21:34:24 (2974532): called boinc_finish(0)

</stderr_txt>
]]>


run.log

#
# ACEMD version 4.0.0rc6
#
# Copyright (C) 2017-2022 Acellera (www.acellera.com)
#
# When publishing, please cite:
# ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale
# M. J. Harvey, G. Giupponi and G. De Fabritiis,
# J Chem. Theory. Comput. 2009 5(6), pp1632-1639
# DOI: 10.1021/ct9000685
#
# Arguments:
# input: input
# platform:
# device: 0
# ncpus:
# precision: mixed
#
# ACEMD is running in Boinc mode!
#
# Read input file: input
# Parse input file
$
$# Forcefield configuration
$
$ parmFile structure.prmtop
$ nnpFile model.json
$
$# Initial State
$
$ coordinates structure.pdb
$ binCoordinates input.coor
$ binVelocities input.vel
$ extendedSystem input.xsc
$# temperature 298.15 # Explicit velocity field provided
$
$# Output
$
$ trajectoryPeriod 25000
$ trajectoryFile output.xtc
$
$# Electrostatics
$
$ PME on
$ cutoff 9.00 # A
$ switching on
$ switchDistance 7.50 # A
$ implicitSolvent off
$
$# Temperature Control
$
$ thermostat on
$ thermostatTemperature 310.00 # K
$ thermostatDamping 0.10 # /ps
$
$# Pressure Control
$
$ barostat off
$ barostatPressure 1.0000 # bar
$ barostatAnisotropic off
$ barostatConstRatio off
$ barostatConstXY off
$
$# Integration
$
$ timeStep 2.00 # fs
$ slowPeriod 1
$
$# External forces
$
$
$# Restraints
$
$
$# Run Configuration
$
$ restart off
$ run 500000
# Parse force field and topology files
# Force field: AMBER
# PRMTOP file: structure.prmtop
#
# Force field parameters
# Number of atom parameters: 12
# Number of bond parameters: 14
# Number of angle parameters: 22
# Number of dihedral parameters: 20
# Number of improper parameters: 0
# Number of CMAP parameters: 0
#
# System topology
# Number of atoms: 5058
# Number of bonds: 5062
# Number of angles: 136
# Number of dihedrals: 240
# Number of impropers: 0
# Number of CMAPs: 0
#
# Initializing engine
# Version: 7.7
# Plugin directory: /var/lib/boinc-client/slots/3/lib/acemd3
# Loaded plugins
# CPU
# PME
# CUDA
# CudaCompiler
# WARNING: there is no library for "OpenCL" plugin
# PlumedCUDA
# WARNING: there is no library for "PlumedOpenCL" plugin
# PlumedReference
# TorchReference
# TorchCUDA
# WARNING: there is no library for "TorchOpenCL" plugin
# Available platforms
# CPU
# CUDA
#
# Bonded interactions
# Harmonic bond interactions
# Number of terms: 5062
# Harmonic angle interactions
# Number of terms: 136
# Urey-Bradley interactions
# Number of terms: 0
# Number of skipped terms (zero force constant): 136
# NOTE: Urey-Bradley interations skipped
# Proper dihedral interations
# Number of terms: 224
# Number of skipped terms (zero force constants): 16
# Improper dihedral interations
# Number of terms: 0
# NOTE: improper dihedral interations skipped
# CMAP interactions
# Number of terms: 0
# NOTE: CMAP interations skipped
#
# Non-bonded interactions
# Number of exclusions: 5391
# Lennard-Jones terms
# Cutoff distance: 9.000 A
# Switching distance: 7.500 A
# Coulombic (PME) term
# Ewald tolerance: 0.000500
# No NBFIX
# No implicit solvent
#
# NNP
# Configuration file: model.json
# Model type: torch
# Model file: model.nnp
# Number of atoms: 75
#
# Constraining hydrogen (X-H) bonds
# Number of constrained bonds: 3356
# Making water molecules rigid
# Number of water molecules: 1661
# Number of constraints: 5017
#
# Reading box sizes from input.xsc
#
# Creating simulation system
# Number of particles: 5058
# Number of degrees of freedom 10154
# Periodic box size: 37.314 37.226 37.280 A
#
# Integrator
# Type: velocity Verlet
# Step size: 2.00 fs
# Constraint tolerance: 1.0e-06
#
# Thermostat
# Type: Langevin
# Target temperature: 310.00 K
# Friction coefficient: 0.10 ps^-1
#
# Setting up platform: CUDA
# Interactions: 1 2 4 7 14 12
# Platform properties:
# DeviceIndex: 0
# DeviceName: NVIDIA GeForce RTX 3070
# UseBlockingSync: false
# Precision: mixed
# UseCpuPme: false
# CudaCompiler: /usr/local/cuda/bin/nvcc
# TempDirectory: /tmp
# CudaHostCompiler:
# DisablePmeStream: false
# DeterministicForces: false
#
# Set initial positions from an input file
#
# Initial velocities
# File: input.vel
#
# Optimize platform for MD
# Number of constraints: 5017
# Harmonic bond interations
# Initial number of terms: 5062
# Optimized number of terms: 45
# Remaining interactions: 2 4 7 14 12 1
#
# Running simulation
# Current step: 0
# Number of steps: 500000
#
# Trajectory output
# Positions: output.xtc
# Period: 25000
# Wrapping: off
#
# Log, trajectory, and restart files are written every 50.000 ps (25000 steps)
# Step Time Bond Angle Urey-Bradley Dihedral Improper CMAP Non-bonded Implicit External Potential Kinetic Total Temperature Volume
# [ps] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [kcal/mol] [K] [A^3]
25000 50.00 7.8698 22.5692 0.0000 62.2167 0.0000 0.0000 -15734.1899 0.0000 -1379454.7644 -1395096.2985 3140.7230 -1391955.5755 311.303 51783.76
# Speed: average 6.63 ns/day, current 6.63 ns/day
# Progress: 5.0, remaining time: 3:26:16, ETA: Fri Mar 4 21:35:50 2022
50000 100.00 6.7940 26.1562 0.0000 56.6657 0.0000 0.0000 -15592.6529 0.0000 -1379450.5644 -1394953.6014 3101.0705 -1391852.5309 307.372 51783.76
# Speed: average 6.66 ns/day, current 6.68 ns/day
# Progress: 10.0, remaining time: 3:14:43, ETA: Fri Mar 4 21:35:03 2022
75000 150.00 4.6298 22.0773 0.0000 58.5973 0.0000 0.0000 -15798.4139 0.0000 -1379459.1078 -1395172.2173 3143.0430 -1392029.1743 311.533 51783.76
# Speed: average 6.66 ns/day, current 6.68 ns/day
# Progress: 15.0, remaining time: 3:03:42, ETA: Fri Mar 4 21:34:50 2022
100000 200.00 7.8170 26.8665 0.0000 59.5203 0.0000 0.0000 -15618.1979 0.0000 -1379453.2405 -1394977.2346 3138.9295 -1391838.3051 311.125 51783.76
# Speed: average 6.67 ns/day, current 6.68 ns/day
# Progress: 20.0, remaining time: 2:52:48, ETA: Fri Mar 4 21:34:42 2022
125000 250.00 8.7645 24.8827 0.0000 59.5112 0.0000 0.0000 -15731.4732 0.0000 -1379450.9954 -1395089.3103 3081.5431 -1392007.7672 305.437 51783.76
# Speed: average 6.67 ns/day, current 6.68 ns/day
# Progress: 25.0, remaining time: 2:41:56, ETA: Fri Mar 4 21:34:37 2022

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58434 - Posted: 4 Mar 2022 | 21:28:32 UTC

is it really necessary to spend 4-5 mins every task to extract the same 3GB file? seems unnecessary. if it's not downloading a new file every time, then why extract the same file over and over? why not just leave it extracted?
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58436 - Posted: 4 Mar 2022 | 22:53:25 UTC - in response to Message 58432.

Up to 3GB file. 3hr45min to get to 66% download.


5hr20min to download
10+ min to start processing and already at 50% complete
Task completed in just under 12 min. Less than 2 minutes of processing on a 3070Ti at around 55% load.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 194
Credit: 539,137,515
RAC: 13
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58437 - Posted: 5 Mar 2022 | 1:52:50 UTC

Since the original task, I have received several more, with no giant file to download again. Then I got a task for the same machine, and it is downloading another beast of a file, veeeery slowly. 4.5 hours so far, and only 29% complete. BOINCtasks says it 9.71KBps. Ouch.

____________
Reno, NV
Team: SETI.USA

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 194
Credit: 539,137,515
RAC: 13
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58438 - Posted: 5 Mar 2022 | 4:26:09 UTC - in response to Message 58437.

Since the original task, I have received several more, with no giant file to download again. Then I got a task for the same machine, and it is downloading another beast of a file, veeeery slowly. 4.5 hours so far, and only 29% complete. BOINCtasks says it 9.71KBps. Ouch.


A Follow-up: The d/l seems to stall eventually. But going into Boinc and turning off/on networking seems to restart the d/l at a reasonable pace, and be done in a matter of minutes in stead of hours.

This project need to get their networking in order.

____________
Reno, NV
Team: SETI.USA

Erich56
Send message
Joined: 1 Jan 15
Posts: 944
Credit: 3,685,513,165
RAC: 826,467
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58439 - Posted: 5 Mar 2022 | 8:32:19 UTC - in response to Message 58438.

This project need to get their networking in order.

that's what I have been saying often enough in the past

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58440 - Posted: 5 Mar 2022 | 15:48:15 UTC

it's also interesting to see that this new ACEMD4 application does not have the same high PCIe bus use as the ACEMD3 app. should allow faster processing on systems that might be on smaller bus widths (cards on USB risers, cards plugged in via chipsets, older systems with PCIe 2.0 or less, etc)

it seems fairly bound on the memory bandwidth though. my 3080Ti is using up to about 80% memory bus which is a bit higher than the ACEMD3 app, fast cards with a smaller bus will be more bound. but this is better for speed I think, reaching back and forth to GPU RAM is a lot faster than reaching back and forth over the PCIe bus to system RAM.

still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58441 - Posted: 5 Mar 2022 | 16:33:16 UTC - in response to Message 58440.

still curious about the constant unpacking of the compressed file for every task. wastes 5 mins for each task doing the same thing over and over. if you just leave it unpacked then you save 5 mins on each subsequent task.

It's BOINC that empties the slot directory when a task has finished uploading, exited and reported. BOINC will check again that the allocated slot is still empty before starting a new task. It won't re-use an old slot if there's anything left behind, whether it's the (a) same project, same application, (b) same project, different application, or (c) a different project entirely.

The slot directory is also the 'working' directory in operating system terms, and both the operating system and the GPUGrid project use it in that sense. To use a different location for persistent files would require some effort in modifying the Path environment to let GPUGrid run.

Personally, I suspect the "everything, including the kitchen sink" compressed files are perhaps over-specified. The 17,298 items, 10.3 GB I found in there yesterday feels like an 'oh, include it, just in case' solution. When testing is complete and production is about to start, perhaps the project could audit the compressed archives and strip them back to the bare minimum?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58442 - Posted: 5 Mar 2022 | 17:14:48 UTC

Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591:

Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
working set size > client RAM limit: 14361.16MB > 14353.11MB

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58443 - Posted: 5 Mar 2022 | 17:50:37 UTC - in response to Message 58441.

unless they build a single binary file for processing, like most other projects do. then they just dump the binary into the projects folder and it gets used over and over.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58444 - Posted: 5 Mar 2022 | 18:58:49 UTC - in response to Message 58442.

Another problem in P0_NNPMM_2p95_19-RAIMIS_NNPMM-0-20-RND5821_0 (task 32755591:

Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
working set size > client RAM limit: 14361.16MB > 14353.11MB


I had one too. What kind of app needs 40GB of memory?
working set size > client RAM limit: 38573.95MB > 38554.31MB</message>

On the next user it was aborted by the project. I've had 3 canceled by server as well, 2 while running.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58445 - Posted: 5 Mar 2022 | 19:22:28 UTC - in response to Message 58444.

And I had P0_NNPMM_1hpo_19-RAIMIS_NNPMM-1-20-RND4653_0 cancelled as well, same machine. Same sequence, also ran ~50 minutes. Maybe somebody pulled the batch?

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58452 - Posted: 7 Mar 2022 | 13:57:10 UTC

Hello everybody,

Thank you for your feedback on the ACEMD 4 app.

Response to the reported issues:

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


    Long download time of the input files.
    The WUs have to download ~500 MB of input files. At the moment, I cannot do much about this, but this will be reduced eventually.


    Long decompression time.
    I have changed to a different format (gzip), so now it takes 1-2 min to decompress. As I side note, the ACEMD 3 app does the same, but it uses a built-in ZIP decompressor, which doesn’t report that in the log. In the case of the ACEMD 4 app there is an issue that the built-in decompressor doesn’t support files >2 GB. So, I had to add the decompression as a separate task.


    Excessive memory usage.
    I have fixed a memory leak. Now it should consume a reasonable amount of memory (2-4 GB).


If I missed something important, feel free to remind me.

Happy computing,

Raimondas

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58453 - Posted: 7 Mar 2022 | 15:20:23 UTC - in response to Message 58452.
Last modified: 7 Mar 2022 | 15:25:58 UTC

thanks for giving some attention to the package decompression :). much faster now.

another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime).
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58454 - Posted: 7 Mar 2022 | 16:59:33 UTC - in response to Message 58452.

Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it.

On another tack - and this applies to all the GPUGrid researchers as a group - it would help the work proceed more smoothly if you could find a way of paying more careful attention to the meta-data which BOINC passes downstream to our computers with each task.

The key value is the estimated size, the <rsc_fpops_est>, of each task. At the moment, I have various machines working on:

AbNTest_micro tasks for ADRIA, which run for over a day
AbNTest_counts tasks for ADRIA, which run for about an hour
Today's NNPMM task from your good self, which looks set to run for about 8 hours.

Earlier test runs only lasted a few minutes, but all seem to be given the same project standard <rsc_fpops_est> setting of 5,000,000,000,000,000.

The BOINC client uses the fpops estimate, plus its own measurement of elapsed time, to keep track of the effective speed of our machines, and thus the anticipated runtime of newly downloaded tasks.

It's tedious, I know, but if the task size estimate isn't routinely adjusted to take account of the testing being undertaken, anticipated runtimes can get seriously distorted. In the worst case, a succession of short tests (if not described accurately) can make our BOINC clients think that our machines have become suddenly many times faster, and can even cause 'normal' tasks to be aborted for taking too long. Experienced volunteers can anticipate and work through these problems, but the project's main work will proceed more smoothly if they don't arise in the first place.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58455 - Posted: 7 Mar 2022 | 17:35:58 UTC - in response to Message 58454.

yeah I agree about the est flops size. it throws things all out of wack. my task now which will run for about 3hrs, started with an estimated runtime of 10 days lol.
____________

Sven
Send message
Joined: 26 Nov 20
Posts: 1
Credit: 25,611,398
RAC: 6
Level
Val
Scientific publications
wat
Message 58457 - Posted: 7 Mar 2022 | 21:12:06 UTC

Hello everyone,

after restarting the boinc-client.service the progress indication reset from about 35% to just 1% and ist working from there (ACEMD 4 simulations for GPUs 1.03). Rather surprising that the task did not manage to save results and work from there as other applications are.

Any explanation?

Best
Sven

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58459 - Posted: 7 Mar 2022 | 22:42:11 UTC - in response to Message 58454.
Last modified: 7 Mar 2022 | 22:44:01 UTC

Thank you for your responses and explanations. Two further points - one an amplification, and the other something different.

    Long download time of the software package.
    The GPUGRID server is connected to a university network with a substantial bandwidth, so it is not a limiting factor. Most likely, your ISPs are throttling the download speed. As mentioned before, the software package has increased due to PyTorch, which by itself takes ~ 1 GB. I have removed some more unnecessary files, but still it is ~3 GB. The software package is cached by the Boinc client, so it is only downloaded once for each app version and reused.


I'm on the end of a pretty stable 70 megabit ISP download connection - I don't think they are throttling it, either. I think the answer probably lies in the intermediate routing - the switching centres and interconnectors that the signals pass through between the University and our ISPs. Sometimes they work well, sometimes they drop packets, and sometimes they slow to a crawl. I don't think that's something that can be fixed at the research lab level, but it might be worth mentioning it upstream to the University's technical support team - they might be able to monitor it.


I agree, its not on our end either. As mentioned somewhere, a pause and resume of networking in the client can speed up the download. I did this on the 3GB download.

The site is often slow and will timeout. Once timing out then it will reload on a refresh. I'd put more weight on a DDoS type of restriction somewhere. Too many requests and speed drops or cut off.

The app rename with v3 vs v4 in the name makes things easier.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58462 - Posted: 8 Mar 2022 | 10:44:15 UTC

Another 3+ GB download, for ACEMD4 v1.03

This one is coming down at 6MB/sec (so far - the average speed figure is still stabilising while the data download shares the connection), but it looks on target to finish within 5 minutes. Not a problem by itself, but users with slow connections might choose to delay requesting new work until the surge is over.

08/03/2022 10:28:57 | GPUGRID | Started download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9
08/03/2022 10:36:55 | GPUGRID | Finished download of x86_64-pc-linux-gnu__cuda1121.tar.gz.0f9ebf1ac84d8d1f5ae2c260dc903be9

OK, 7 minutes 58 seconds. And with a six day estimate and a one day deadline, the previous task was booted aside and the new task started immediately. That's called EDF (Earliest Deadline First), or Panic Mode On!

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 186
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58482 - Posted: 10 Mar 2022 | 11:33:48 UTC - in response to Message 58453.

another comment I have is regarding the credit reward for the ACEMD4 tasks. they seem to be set to a static 1500cred. it might make sense to implement the same credit reward model from the ACEMD3 tasks (higher credit and actually scaled to difficulty/runtime).

I do agree that credits granted for ACEMD4 tasks are very undervalued, comparing to same processing times for tasks from other projects, or even ACEMD3 tasks from this same project.
For example, Host #186626
PrimeGrid Genefer 18 tasks: ~1250 seconds processing time --> 1750 cedits
Gpugrid ACEMD3 ADRIA KIXCMYB tasks: ~82000 seconds processing time --> 540000 credits (+50% bonus, base credits: 360000)
Gpugrid ACEMD3 ADRIA e100 tasks: ~3300 seconds processing time -->27000 credits
Gpugrid ACEMD4 RAIMIS tasks: ~25600 seconds processing time --> 1500 credits (?)

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58623 - Posted: 12 Apr 2022 | 12:17:05 UTC

Hello everybody,

A quick update on the ACEMD 4 app:

    Credits
    I have re-calibrated the estimation of credits. Now the granted credits will be more in line with ACEMD 3, i.e. 1 hour of NVIDIA RTX 2080 Ti calculation is valued at 60 000 credits, not including the additional bonuses.


    Estimated flops
    Currently there is no automated mechanism to set the flops, just the app maintainers set some arbitrary numbers. I have decreased the flop estimate by two orders of magnitude. Hopefully, this is more in line with the actual work.



What happens next? Today I have sent several test WUs. If no issues are discovered, I will send ~1300 production WUs.

Happy computing,

Raimondas

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58626 - Posted: 12 Apr 2022 | 17:05:11 UTC
Last modified: 12 Apr 2022 | 17:07:00 UTC

Still having issues with the acemd4 application.

Time exceeded 1800 seconds errors still.

Can't download the python/anaconda environment in time.

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>
<stderr_txt>
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz)
08:52:11 (3808596): /bin/tar exited; CPU time 52.866473
08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0)

</stderr_txt>
]]>

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58627 - Posted: 12 Apr 2022 | 17:17:35 UTC - in response to Message 58626.

Still having issues with the acemd4 application.

Time exceeded 1800 seconds errors still.

Can't download the python/anaconda environment in time.

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>
<stderr_txt>
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper (7.7.26016): starting
08:51:16 (3808596): wrapper: running /bin/tar (xf x86_64-pc-linux-gnu__cuda1121.tar.gz)
08:52:11 (3808596): /bin/tar exited; CPU time 52.866473
08:52:11 (3808596): wrapper: running bin/acemd (--boinc --device 0)

</stderr_txt>
]]>

Keith I saw this too in your tasks list. I'm betting that the reduction in estimated flops caused this to happen. reducing flops makes BOINC think it will take less time to complete and sets the limit for timeout lower.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58628 - Posted: 12 Apr 2022 | 18:13:55 UTC

You are probably correct Ian. I fixated on the 1800 seconds error since that was the errors I saw with the python tasks.

But since this is acemd4, no python involved here I believe.

The reduction in estimated gflops was likely the culprit.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58629 - Posted: 12 Apr 2022 | 19:18:35 UTC
Last modified: 12 Apr 2022 | 19:29:51 UTC

Got an ACEMD4 task on Linux: T0_NNPMM_frag_00-RAIMIS_NNPMM-1-3-RND2497_5. The five previous attempts have all timed out, in between 2,400 seconds and 5,000 seconds.

My metrics are
<flops>181962433195.469788</flops>
<rsc_fpops_est>1000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>1000000000000000.000000</rsc_fpops_bound>
<duration_correction_factor>12.396821</duration_correction_factor>

size / speed gives 5.5 seconds uncorrected estimate. With DCF, that becomes 68 seconds, and that's what's displayed in BOINC Manager.

'bound' is 1000 x 'est', so it will time out in 5,500 seconds (if DCF is ignored, as I suspect it is). I'll bump them both by 1000 x, and see how it fares while I'm out.

Edit - with a new estimate of 18.5 hours, and a 24 hour deadline, it's gone straight into panic mode. Should get an idea how its doing before I go to bed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58632 - Posted: 12 Apr 2022 | 21:33:19 UTC

It's reached 50% in 2 hours 7 minutes, so this task is heading for about four and a quarter hours on my GTX 1660 Super. That would have failed without manual intervention.

@ Raimondas - you need to consider both speed and size when making adjustments.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58633 - Posted: 12 Apr 2022 | 23:38:01 UTC

File size too big by both users on upload.
https://www.gpugrid.net/result.php?resultid=32882663

Just take all the limits and x100000000.

Ok, not that much, but its sad tasks error out on artificial limits esp on upload.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,290,493,256
RAC: 130,788
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58645 - Posted: 14 Apr 2022 | 14:35:08 UTC

Thu 14 Apr 2022 09:27:26 AM CDT | GPUGRID | Aborting task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1: exceeded elapsed time limit 7231.33 (1000000.00G/138.29G)
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | Computation for task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1 finished
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | Output file T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1_4 for task T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_1 exceeds size limit.
Thu 14 Apr 2022 09:27:28 AM CDT | GPUGRID | File size: 137187308.000000 bytes. Limit: 10000000.000000 bytes

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58646 - Posted: 14 Apr 2022 | 16:29:03 UTC
Last modified: 14 Apr 2022 | 16:33:01 UTC

Still getting elapsed time limit errors. Looks like the estimated GFLOPS was changed but still not enough.

exceeded elapsed time limit 2675.08 (1000000.00G/373.82G)</message>

exceeded elapsed time limit 1758.43 (1000000.00G/568.69G)</message>

exceeded elapsed time limit 1803.39 (1000000.00G/554.51G)</message>

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58648 - Posted: 14 Apr 2022 | 19:03:16 UTC

bombed out after 25mins and 20% completion on a 3080Ti

exceeded elapsed time limit 1538.81


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58649 - Posted: 14 Apr 2022 | 21:06:20 UTC

Just got T1_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND6653_5.

Mine don't (usually) start immediately, so I could get to it before it started. Added x1000 to the fpops measures, x100 to the _4 upload size (thanks captainjack). It's running now, should finish overnight.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58679 - Posted: 19 Apr 2022 | 16:36:25 UTC
Last modified: 19 Apr 2022 | 17:11:24 UTC

got another new task.

flops bound had been increased by 10x from previous values (based on previous comments of what the value used to be).

however, the max_nbytes of the _4 output file has not been increased at all, so I expect another computation error if the file size ends up too big.

computation has already begun, and it's in a system with mixed GPUs so stopping BOINC to edit the size limit and restarting is not a great option, risks restarting on another GPU and insta-error.

as far as run behavior:
on an RTX 3080Ti
~80% GPU core use
~50% GPU memory bus use
~1% PCIe bus use
~2300MB VRAM used
~265W (with a 300W limit set)

not really taking full advantage of the GPU resources.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58680 - Posted: 19 Apr 2022 | 18:20:17 UTC - in response to Message 58679.
Last modified: 19 Apr 2022 | 18:35:14 UTC


however, the max_nbytes of the _4 output file has not been increased at all, so I expect another computation error if the file size ends up too big.


called it. ran for 2hrs18mins and errored right after completion

T2_NNPMM_frag_01-RAIMIS_NNPMM-0-2-RND4664_0

upload failure: <file_xfer_error>
<file_name>T2_NNPMM_frag_01-RAIMIS_NNPMM-0-2-RND4664_0_4</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

from the event log:
Tue 19 Apr 2022 02:13:28 PM EDT | GPUGRID | File size: 20443520.000000 bytes. Limit: 10000000.000000 bytes


it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58681 - Posted: 19 Apr 2022 | 18:33:37 UTC - in response to Message 58680.

it's important for the devs to see that these error out rather than fiddling with things on my end to ensure I get credit. otherwise they may be under the impression that it's not a problem.

Agreed. But in that case, it's also helpful to post the local information from the event log that the devs can't easily see - like captainjack's note

File size: 137187308.000000 bytes. Limit: 10000000.000000 bytes

That gives them the magnitude of the required correction, as well as its location.

My (single) patched run did indeed complete successfully after surgery, so the file size should be the last correction needed.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58682 - Posted: 19 Apr 2022 | 18:37:09 UTC - in response to Message 58681.

i just edited with that info from the log.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58683 - Posted: 20 Apr 2022 | 0:01:42 UTC
Last modified: 20 Apr 2022 | 0:03:08 UTC

Two more, half the run time, and 10x the file size for _4 output file. both run on 3080Tis again.

odd that these ones showed different run behavior. more similar to how the ACEMD3 app works.

~96% GPU core use
~1-2% GPU memory bus use
~12% PCIe bus use

T2_GAFF2_frag_02-RAIMIS_NNPMM-0-1-RND3120_1

Tue 19 Apr 2022 07:54:14 PM EDT | GPUGRID | File size: 213191804.000000 bytes. Limit: 10000000.000000 bytes


T2_GAFF2_frag_00-RAIMIS_NNPMM-0-1-RND5192_2
Tue 19 Apr 2022 07:53:26 PM EDT | GPUGRID | File size: 213539276.000000 bytes. Limit: 10000000.000000 bytes

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58684 - Posted: 20 Apr 2022 | 4:41:55 UTC
Last modified: 20 Apr 2022 | 4:43:16 UTC

Same thing here. Ran 2 1/2 hours to completion and then failed on too large an upload file.

upload failure: <file_xfer_error>
<file_name>T2_GAFF2_frag_02-RAIMIS_NNPMM-0-1-RND3120_2_4</file_name>
<error_code>-131 (file size too big)</error_code>

Waste of resources.

Wish the app admin dev would fix this issue. Like right now!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58685 - Posted: 20 Apr 2022 | 8:39:01 UTC

Woke up this morning to find two unstarted ACEMD4 tasks awaiting my attention (and downloaded a third since).

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

The initial estimates (10 - 12 minutes, with DCF) look tight, but I've left them alone to check if the devs' adjustments are adequate.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58686 - Posted: 20 Apr 2022 | 11:56:43 UTC - in response to Message 58685.

the estimated runtime of my tasks were very close. at inception they started right at around 2hrs and that's how long it took. that was with no adjustments. so the estimated flops seems correct.

they just need to bump the file size limit by at least 25x. maybe 100x to be safe. it really is a waste to trash good work on something like an arbitrary and artificial file size limit.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58687 - Posted: 20 Apr 2022 | 14:00:36 UTC

The other thing they still have to sort out is checkpointing. I've just come home to find that BOINC had downloaded and started a new ACEMD4 task - for some reason, it pre-empted the two Einstein tasks running on the GPU I dedicate to GPUGrid. That must have been EDF kicking in, but with a six hour cache and a 24 hour deadline, it shouldn't have been needed.

Anyway, I applied the upload correction, and the task restarted from 1% - I had stopped it at something like 16% in 44 minutes. That implies a longer runtime than this morning's group, so the output size may be larger as well. 25x may not be enough, if not all tasks are created equal. I'll check it when it finishes.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58688 - Posted: 20 Apr 2022 | 15:53:48 UTC

Another 3 hours wasted because of too large an upload file.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 186
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58689 - Posted: 20 Apr 2022 | 16:05:21 UTC - in response to Message 58685.

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

It is highly likely that the one (and only) at current Server Status page "successful users in last 24h" for ACEMD 4 tasks, is you ;-)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58690 - Posted: 20 Apr 2022 | 16:32:58 UTC - in response to Message 58689.

I've fixed the file size problem, partly to stop them recirculating to other users, but also to get some real science done, if possible.

It is highly likely that the one (and only) at current Server Status page "successful users in last 24h" for ACEMD 4 tasks, is you ;-)

It might be one user, but it's four tasks and counting so far:

Host 132158
Host 508381

I'll try and see off this run of timewasters, even if I have to do it all myself!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58691 - Posted: 20 Apr 2022 | 18:13:31 UTC - in response to Message 58687.

That implies a longer runtime than this morning's group, so the output size may be larger as well.

Turned out not to be a problem - the file size was 20.4 MB, despite running nearly twice as long. I can't see anything about the filename which would reliably distinguish between "quick run, large file" and "slow run, small file" tasks.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58692 - Posted: 20 Apr 2022 | 18:48:51 UTC - in response to Message 58691.

That implies a longer runtime than this morning's group, so the output size may be larger as well.

Turned out not to be a problem - the file size was 20.4 MB, despite running nearly twice as long. I can't see anything about the filename which would reliably distinguish between "quick run, large file" and "slow run, small file" tasks.


T2_GAFF2_frag_00-RAIMIS_NNPMM = short run, large file size

T2_NNPMM_frag_01-RAIMIS_NNPMM = longer run, smaller (but still too big) file size

i processed several of both types.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58693 - Posted: 20 Apr 2022 | 18:56:37 UTC - in response to Message 58692.

Yup, that works.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 82,034,439
RAC: 1,231
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58694 - Posted: 20 Apr 2022 | 19:35:58 UTC

Got another Python. I will let it run its course this time.
BOINC thinks 167 days and 19 hours after just 3 hours run time.
CPU usage is 104%

Estimated app speed 3,176.92 GFLOPs/sec
Virtual memory size 5,589.28 MB
Working set size 2,517.45 MB

Running on a GTX 1080.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58700 - Posted: 22 Apr 2022 | 16:52:31 UTC

a handful of new ACEMD4 tasks went out today. I got one.

worked fine this time. great job :)

T3_NNPMM_frag_01-RAIMIS_NNPMM-1-2-RND2943_0

looks like they finally ironed out the config issues. open the floodgates :)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58704 - Posted: 22 Apr 2022 | 21:20:45 UTC

Same here. Got two tasks today that completed and validated successfully.

Agree . . . . open the floodgates for more.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 186
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58705 - Posted: 23 Apr 2022 | 17:47:29 UTC

I also got yesterday an ACEMD 4 task:
T3_NNPMM_frag_02-RAIMIS_NNPMM-1-2-RND2618_1
It was processed to the end and validated successfully.

However, I noticed that it belonged to branch work unit #27219575, and it had failed at a previous system. The reported message:

exceeded elapsed time limit 16199.05 (10000000.00G/617.32G)

May be some fine tuning in WU configuration parameters is still required on Project side.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58706 - Posted: 23 Apr 2022 | 19:20:11 UTC

Yes, bummer for missing the expected compute time by less than 90 seconds.

Looks like the fpops estimate needs to be increased by as little as 500 to get the fast cards like his 3080 Ti to meet the estimated crunching time.

Maybe pad it out by another 10K or 100K to be safe.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58707 - Posted: 24 Apr 2022 | 0:20:28 UTC - in response to Message 58706.
Last modified: 24 Apr 2022 | 0:23:06 UTC

my 3080Ti completed one no problem. I think something was wrong with that persons machine and it was hung up or there was something slowing down the computation. My 3080Ti completed it in half the time.

many people are blissfully unaware that you need to leave some breathing room for the CPU on GPU tasks. they just set CPU use to 100% and walk away. especially when the project has set <1.0 CPU use per GPU task (which BOINC basically sees as 0). CPU overcommit is common.
____________

Raimondas
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 26 Mar 18
Posts: 7
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58713 - Posted: 25 Apr 2022 | 16:54:08 UTC

Hello everybody,

A quick update on the ACEMD 4 app:

    Estimated flops
    I have tried to tune the flops, but at the end the final values are very similar to the original ones. It seems many volunteers are adjusting the factors by themselves, so trying to fix for ones, ruins for others.


    File sizes
    I have updated the limits of the file sizes to match the scale of the new WUs.


What happens next? I have started sending 1000+ WUs, so let's do some science...

Happy computing,

Raimondas

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58714 - Posted: 25 Apr 2022 | 17:30:49 UTC - in response to Message 58713.

I doubt that there's much large-scale tuning of flops (the speed measure) by general users out in the community. The few who post here regularly might well have tweaked it, of course.

Instead, it's more likely that the BOINC server software is still tuning it. In general,

1) The initial value for a new application is initialised to some very low value. Low speed ==> very long estimated run times.

2) After the first 100 "completed" (success, valid) tasks have been returned - by the fastest hosts in the population of users, naturally - the running average of the most recent 100 completed tasks is used to replace the initial values.

3) Once any single host has reached 11 "completed" tasks, it's own individual average speed is used as the basis for future tasks.

That's my best attempt at understanding the combined effect of
Client scheduling changes
Job runtime estimation
But only David Anderson would have an authoritative overview of the current server code.

fpops (the size estimate) is much simpler: you have complete control of the value declared by the server, through your workunit generator.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58715 - Posted: 25 Apr 2022 | 18:04:57 UTC

I'm loaded up as much as possible for now. one on each GPU. short 1-day deadlines (can't remember if they were always that short)

but i'm getting interesting error messages when asking for more work.

two systems report that it wont send more work because they "wont complete in time", but that seems at odds with the fact that I have a 3-day cache set and BOINC's estimates are that it will take only ~2hours to complete. so why does it think it wont complete in time?

another system (7-GPU) says I do not have enough disk space, claiming i need an additional ~6GB, saying I have 12GB free and need 18GB for the task. but again this is at odds with BOINC preferences set to allow ~95% disk use, and that I have 100+GB free (BOINC reports this as "free, available to BOINC"). although this system does have ~86GB being used for GPUGRID alone. is there a 100GB per project limit somewhere? otherwise it seems nonsensical and my settings should be allowing plenty of space.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58716 - Posted: 25 Apr 2022 | 18:16:37 UTC

While I was typing the previous post, the server sent my host 132158 task 32888461.

The server has sent me

<flops>181882964117.026398</flops>

(181 Gflops), which is actually quite close to the running APR of 198 Gflops for the nine tasks it's completed so far - not enough to be considered definitive yet.

The task size is

<rsc_fpops_est>10000000000000.000000</rsc_fpops_est>

so from size / speed, the raw runtime estimate is 55 seconds (limit 15.25 hours). That should be good enough for now. Card is a GTX 1660 Super.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,290,493,256
RAC: 130,788
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58717 - Posted: 25 Apr 2022 | 20:28:36 UTC

Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message:

process exited with code 195 (0xc3, -61)


Received 0 tasks on Windows.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58718 - Posted: 25 Apr 2022 | 20:59:11 UTC - in response to Message 58717.

Received 25 tasks on Linux. Ran about 45 - 60 seconds. All error. Error message:
process exited with code 195 (0xc3, -61)


I received several of your resends, they are processing fine on my system.

but that brings up a talking point. I see many older GPUs having errors on these. I have 2080Tis, 3070Tis, and 3080Tis and they have all processed successfully.

I see one user with a GTX 650, and he errors with an architecture error, so the app was obviously built without Kepler support.

I see several other users with <2GB VRAM that errored out, I also assume these cases might be due to too few memory

then cases like yours where it should be supported and with enough memory, but for some reason causes errors.

has anyone had successful ACEMD4 runs on Maxwell or Pascall cards?

Received 0 tasks on Windows.


that's because there is no Windows app. these ACEMD4 tasks only have a Linux application for now.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58719 - Posted: 25 Apr 2022 | 21:20:12 UTC

captainjack is running a GTX 970 under Linux, and the acemd child is failing with

app exit status: 0x87

I don't recognise that one, but it should be in somebody's documentation.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58720 - Posted: 25 Apr 2022 | 21:30:16 UTC - in response to Message 58718.



has anyone had successful ACEMD4 runs on Maxwell or Pascall cards?


I just allowed tasks on my 1080Ti to see if it will run on Pascal.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58721 - Posted: 25 Apr 2022 | 21:34:51 UTC - in response to Message 58719.
Last modified: 25 Apr 2022 | 21:35:28 UTC

captainjack is running a GTX 970 under Linux, and the acemd child is failing with

app exit status: 0x87

I don't recognise that one, but it should be in somebody's documentation.


yes I know. but he also has a windows system. I was letting him know the reason his windows system didnt get any, because an app for Windows does not exist at this time.

check this WU. it's one that he (and several other Maxwell card systems) failed with the same 0x87 code. another Maxwell Quadro M2000 failed as well with 0x7.

https://gpugrid.net/workunit.php?wuid=27220282

my system finally completed it without issue. makes me wonder if the app works on Maxwell or Pascal cards at all. if it turns out that these tasks don't run on Maxwell/Pascal, the project might need to do additional filtering in their scheduler by compute capability to exclude incompatible systems.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58722 - Posted: 25 Apr 2022 | 21:39:32 UTC - in response to Message 58720.



has anyone had successful ACEMD4 runs on Maxwell or Pascall cards?


I just allowed tasks on my 1080Ti to see if it will run on Pascal.


that'll be a good test. thanks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58723 - Posted: 25 Apr 2022 | 22:00:22 UTC - in response to Message 58722.
Last modified: 25 Apr 2022 | 22:01:35 UTC



has anyone had successful ACEMD4 runs on Maxwell or Pascall cards?


I just allowed tasks on my 1080Ti to see if it will run on Pascal.


that'll be a good test. thanks.

It will be a while before I can report success or failure.
Have to download the 3.5GB application file still before starting the task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58724 - Posted: 25 Apr 2022 | 22:05:54 UTC - in response to Message 58723.



has anyone had successful ACEMD4 runs on Maxwell or Pascall cards?


I just allowed tasks on my 1080Ti to see if it will run on Pascal.


that'll be a good test. thanks.

It will be a while before I can report success or failure.
Have to download the 3.5GB application file still before starting the task.


haha, a few of my systems did the same earlier.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58725 - Posted: 25 Apr 2022 | 22:28:18 UTC

Pascal works: https://gpugrid.net/result.php?resultid=32888358
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58726 - Posted: 26 Apr 2022 | 0:18:33 UTC - in response to Message 58725.

Mine just started. So far, so good.

I had a lot of _0..._5 failures before getting to one of mine and they were all on low RAM Maxwell or low RAM Pascal cards like a 1050 or 950.

So maybe the low RAM count cards are the suspect ones.

Looks like even a 8GB 1070 works. My 11GB 1080 Ti should have no issues.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58727 - Posted: 26 Apr 2022 | 3:21:27 UTC

The 1080 Ti can crunch the acemd4 tasks with no issues.
1000 seconds faster than my 2070 Supers or 2080's

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58728 - Posted: 26 Apr 2022 | 3:34:04 UTC - in response to Message 58727.

i think the CPU speed plays a pretty significant roll in GPU speed on these tasks. so your fast 5950X is helping out a good bit.

my 300W 3080Ti under the 7443P runs roughly 700s faster than equivalent 300W 3080Ti under a 7402P.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58729 - Posted: 26 Apr 2022 | 4:15:17 UTC

Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup.

Must be 11GB VRAM versus 8GB VRAM on the 2080's, 2070 Supers

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58730 - Posted: 26 Apr 2022 | 7:08:24 UTC
Last modified: 26 Apr 2022 | 7:36:27 UTC

Checkpointing is still not active.

Both my Linux machines crashed hard overnight - I've yet to work out why. But one machine I've restarted had an ACEMD 4 at about the midway point: it's started again from 1%.

Edit - very odd. Second machine looks completely inert - no post, no beep, no video output. But BOINC remote monitoring shows all tasks are running, including both GPUs (I got a clue from the SSD activity LED). I'm draining the cache - may call out for some help later.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,479,966,482
RAC: 361,077
Level
Met
Scientific publications
watwatwatwatwat
Message 58733 - Posted: 26 Apr 2022 | 10:47:28 UTC

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58734 - Posted: 26 Apr 2022 | 13:13:48 UTC - in response to Message 58729.

Same speed for all my 5950X hosts which the 1080 Ti is the newest. So the cpu speed is not the reason for the speedup.

Must be 11GB VRAM versus 8GB VRAM on the 2080's, 2070 Supers


i wasn't saying there was any "speedup". just that your fast CPU is helping vs how a 1080Ti might perform with a lower end CPU. put the same GPU on a slower CPU and it will slow down to some extent.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58735 - Posted: 26 Apr 2022 | 13:16:36 UTC - in response to Message 58733.

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!


your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc.

the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58736 - Posted: 26 Apr 2022 | 13:38:45 UTC - in response to Message 58730.

Checkpointing is still not active.
I confirm that.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58737 - Posted: 26 Apr 2022 | 13:59:11 UTC - in response to Message 58735.

exceeded elapsed time limit 12932.15 (10000000.00G/773.27G)</message>
exceeded elapsed time limit 7695.31 (10000000.00G/1299.49G)</message>

Why are there time limits? Up them!


your hosts are hidden. what are the system specifics? what CPU/RAM specs, etc.

the time limit is set by BOINC, not the project directly. BOINC bases the limits on the speed of the device and the estimated flops set by the project.

It's possible that the lack of checkpointing contributed to this problem. ACEND 4 tells BOINC that it has checkponted (and I think it's correct - the checkpoint have been written). So BOINC thinks it's OK to task-switch to another project.

But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58738 - Posted: 26 Apr 2022 | 14:50:19 UTC - in response to Message 58737.

if it's retaining the timer when restarting the task from 0 then yes I agree the checkpointing could be the root cause. if it's not checkpointing, the timer should reset to 0.

I still think it's a bit of user config error to allow any task switching for GPU projects. setting the task switch limit longer than estimated run time and the task will run unimpeded, barring any other high priority work (though with the 24hr deadlines these are going right into panic mode and preempt the to the front of the line anyway lol).

time and time again, GPUGRID shows that the tasks don't like being interrupted. you have the ACEMD3 tasks that can't be restarted on a different device, and sometimes restarting on the same device gets detected as a "different" device. sometimes you have to work around the project rather than making the project work around you. these tasks are ~1/5th the size (runtime) of the current ACEMD3 batch, I don't think it's too burdensome to just let it run to completion without interruption.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58739 - Posted: 26 Apr 2022 | 14:56:39 UTC

I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more.

at inception of the run, BOINC gives an estimated runtime of about 25mins. this is of course too low and they end up running ~2-2.5hrs.

so if boinc thinks the tasks are only 25minutes long, and theres a 24hr deadline, what's the logic in saying it can't have another task due to not enough time for completion? even following BOINCs own logic at that point it should realize that it has 23:30 to complete another 0:25 task, no?

can anyone explain what BOINC is doing here?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58740 - Posted: 26 Apr 2022 | 15:53:17 UTC

We mostly concentrate on the estimates calculated by the client, from size, speed, and (at this project) DCF.

But before a task is issued, a similar calculation is made by the server. In 'the good old days' (pre-2010), these were pretty much in lock-step: you can see the working out in Einstein's public server logs. In particular, both client and server included DCF in the estimate.

Einstein's server is pretty old-school: the server here in in a curious time-warp state, somewhere between Einstein and full CreditNew. Without access to the server logs, we can't tell exactly what old features have been stripped out, and what new features have been added in. That makes it very hard to answer your question.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 186
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58741 - Posted: 26 Apr 2022 | 16:56:07 UTC - in response to Message 58739.

I still don't understand the messages that the tasks wont complete in time lol. it will only download 1 per GPU, then no more.

I find that this warning appears for ACEMD 4 tasks when you try to set a work buffer greater than 1 day.
BOINC Manager probably "thinks": Why should I want to store more than one day of tasks, for tasks with one day deadline?
The same happens with ACEMD 3 tasks when setting a work buffer greater than 5 days.

It's in some way tricky...
* Related question
- Short explanation
- Extended explanation

That is, for current ACEMD 4 tasks (24 hours deadline): If you want to get more than one task per GPU, set your work buffer provisionally at a value lower than 1, and revert to your custom value once a further task is downloaded.
(Or leave your buffer permanently set to 0.95 ;-)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58742 - Posted: 26 Apr 2022 | 17:11:27 UTC - in response to Message 58741.
Last modified: 26 Apr 2022 | 17:21:00 UTC

interesting observation. I've experienced similar things in the past with other projects with counterintuitive behavior/response to a cache level set "too high".

I'll try that out.

[edit]
Indeed that was it! thanks :)

the extracted package for these tasks is also huge. 7 tasks running (7 GPUs), and ~88GB disk space used by the GPUGRID project lol.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58743 - Posted: 26 Apr 2022 | 18:02:58 UTC
Last modified: 26 Apr 2022 | 18:03:49 UTC

The BOINC manager UI shows 42 minutes left, while the work fetch debug shows 2811 minutes:

[work_fetch] --- state for NVIDIA GPU --- [work_fetch] shortfall 0.00 nidle 0.00 saturated 167688.92 busy 167688.92
That's odd, because the fraction_done_exact isn't set in the app_config.xml

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58744 - Posted: 26 Apr 2022 | 19:31:11 UTC - in response to Message 58737.


But when the time comes to return to GPUGrid, ACEMD fails to read the checkpoint files, deletes the result file so far, and starts from the beginning. But it retains the memory of elapsed time so far, bringing it much closer to the abort point.


I had one of these ACEMD4 tasks paused with a short amount of computation completed. at 2% and about 6minutes.

with the task paused, the task remained in its "extracted" state. upon resuming the task, it restarts from 1%, not 0%. I'm guessing 0-1% is for the file extraction. but indeed the timer stayed where it was at ~6minutes and continued from there and did not reset to the actual time elapsed for 1%.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58745 - Posted: 26 Apr 2022 | 19:47:36 UTC

Take a look at the tasks on my host.
It's very easy to spot the one which was restarted without the checkpoint.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58746 - Posted: 26 Apr 2022 | 23:09:57 UTC

The host from my previous post has received an ACEMD3 task. It has reached 16.3%, when the host received an ACEMD4 task, which took over, as the latter has much shorter deadline. The ACEMD3 could restart from the checkpoint, so it will finish eventually. I wonder how many times the ACEMD3 taks will be suspended, and how many days will pass until it's completed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58747 - Posted: 26 Apr 2022 | 23:42:33 UTC
Last modified: 26 Apr 2022 | 23:45:24 UTC

I don't think much chance at all. We've blown through all those 1000 tasks I think. Not much chance your acemd3 task will get pre-empted.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58748 - Posted: 27 Apr 2022 | 0:00:46 UTC - in response to Message 58747.
Last modified: 27 Apr 2022 | 0:07:50 UTC

I don't think much chance at all. We've blown through all those 1000 tasks I think.
Not yet.
The task which preempted the ACEMD3 task is:
P0_NNPMM_frag_85-RAIMIS_NNPMM-5-10-RND6112_0
The blue number is the total number of tasks in the given sequence
The red number is the number of the task in the given sequence (starting from 0, so the last one will be 9-10)
The green number is the number of resends.
So those 1000 tasks are actually 100 task sequences, each sequence is broken into 10 pieces.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1072
Credit: 1,451,778,214
RAC: 400,326
Level
Met
Scientific publications
watwatwatwatwat
Message 58749 - Posted: 27 Apr 2022 | 1:00:16 UTC - in response to Message 58748.

Thanks for the task enumeration explanation.

I thought since it had been ages since I got any constant work from GPUGrid that the REC balance would take forever to balance out my REC of my other projects.

So when I didn't ask for any replacement work and when I manually updated and got none to send, I thought we had blown through all the work already.

I see I will have to put my update script back into action to keep the hosts occupied.

Only halfway done by your task I see.

Thanks again for the knowledge.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58750 - Posted: 27 Apr 2022 | 1:09:46 UTC - in response to Message 58749.

it seems like they are only letting out about 100 into the wild at any given time rather than releasing them all at once.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58751 - Posted: 27 Apr 2022 | 6:18:52 UTC
Last modified: 27 Apr 2022 | 6:20:50 UTC

One of my machines currently has:

P0_NNPMM_frag_76-RAIMIS_NNPMM-3-10-RND5267_0 sent 27 Apr 2022 | 1:25:23 UTC
P0_NNPMM_frag_85-RAIMIS_NNPMM-7-10-RND6112_0 sent 27 Apr 2022 | 3:14:45 UTC

Not all models seem to be progressing at the same rate (note neither is a resend from another user - both are first run after creation).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58756 - Posted: 27 Apr 2022 | 23:29:15 UTC - in response to Message 58746.
Last modified: 27 Apr 2022 | 23:31:04 UTC

The ACEMD3 could restart from the checkpoint, so it will finish eventually.
It has failed actually. :(
It was suspended a couple of times, so I set "no new task" to make it finish within 24 hours, but it didn't help.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 281,648
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58757 - Posted: 28 Apr 2022 | 11:54:58 UTC

Looks like we're coming to the end of this batch - time to take stock. For this post, I'm looking at host 132158, running on a GTX 1660 Super.

I have one remaining task, about to start, with an initial estimate of 01:31:38 (5,498 seconds) - but that's with DCF hard up against the limit of 100. The raw estimate will be 55 seconds, and the typical actual runtime has been just over 4 hours (14,660 seconds).

We need to get those estimates sorted out - but gently, gently. A sudden large change will make matters even worse.

My raw metrics are:

<flops>181882876215.769470</flops>
<rsc_fpops_est>10000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>10000000000000000.000000</rsc_fpops_bound>

- so speed 181,882,876,216 or 181.88 Gflops. This website has an APR of 198.76, but it stopped updating after 9 completed tasks with v1.03 (I've got 13 showing valid at the moment). Size/speed gives 54.98, confirming what BOINC is estimating.

I reckon the size estimate (fpops_est) should be increased by a factor of about 250, to get closer to the target of DCF=1.

BUT DON'T DO IT ALL IN ONE MASSIVE JUMP.

We could probably manage a batch with the estimate 5x the current value, to get DCF away from the upper limit (DCF corrects very, very, slowly when it gets this far out of equilibrium, and limits work fetch requests to a nominal 1 second). Then, a batch with a further increase of 10x, and a third batch with another 10x, should do it. But monitor the effects of the changes carefully.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58758 - Posted: 28 Apr 2022 | 14:20:08 UTC
Last modified: 28 Apr 2022 | 14:20:39 UTC

The ACEMD4 app puts less stress on the GPU, than the ACEMD3 app.
ACEMD3 on RTX 3080Ti: 1845MHz 342W
ACEMD4 on RTX 3080Ti: 1920MHz 306W
I made similar observations on RTX 2080Ti, though I didn't record the exact numbers yet.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58759 - Posted: 28 Apr 2022 | 15:38:46 UTC - in response to Message 58758.
Last modified: 28 Apr 2022 | 16:17:10 UTC

I also do notice that the these tasks don't fully utilize the GPU (mentioned in an earlier post with usage stats). But I think it’s a CPU bottleneck with these tasks. Faster CPU allows the GPU to work harder.

My 3080Tis run 2010MHz @ ~265W and 70-80% GPU utilization. That’s with an AMD EPYC 7402P @ 3.35GHz

On another system I have another 3080Ti but this one paired with a faster 7443P running at ~3.5-3.6GHz. Here the 3080Ti runs at the same 2010MHz, but has 80-85% GPU utilization and about 275-280W power draw.

If the i3-9300 on your 3080Ti system is running over 4.0 (maybe 4.2?) GHz then your higher power draw (than mine) makes sense and supports that it’s a CPU bottleneck.

The "GAFF2" version of these tasks (which were ACEMD4) that were sent out during the testing phase ran with behavior much more similar to the ACEMD3 behavior. see my post here: https://gpugrid.net/forum_thread.php?id=5305&nowrap=true#58683

so it seems to be how the WU is setup/configured rather than the application itself
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58760 - Posted: 28 Apr 2022 | 17:45:12 UTC - in response to Message 58759.

running another GAFF2 task now. GPU utilization is higher (~96%) but power use is even lower at about 230W for a 3080Ti (power limit at 300W).

I suspect these GAFF2 tasks have some code/funtion in them that's causing serialized computations. We saw this exact same behavior with the Einstein tasks (high reported utilization, low power draw) before one of our teammates was able to find and deploy a code fix for the app, then it went to maxing out the GPU.

I've now got some "QM" tasks in the queue as well. will report how they behave, if different.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58761 - Posted: 28 Apr 2022 | 18:46:37 UTC - in response to Message 58760.

QM tasks run in the same manner as the GAFF2 tasks. high ~96% GPU utilization, low GPU power draw, ~1GB VRAM used, ~2% VRAM bus use, high PCIe use.
____________

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,290,493,256
RAC: 130,788
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58763 - Posted: 28 Apr 2022 | 23:11:25 UTC

ACEMD 4 tasks now seem to be processing ok on my antique GTX 970. I had previously reported that they were all failing.
Two GAFF2 tasks completed and validated. 1 QM task underway for 25 minutes and processing normally.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,948,103,494
RAC: 778,033
Level
Arg
Scientific publications
wat
Message 58764 - Posted: 29 Apr 2022 | 18:54:10 UTC

out of the ~1200 or so that I assume went out. I processed over 400 of the myself. all completed successfully with no errors (excluding ones cancelled or aborted). I "saved" many _7s too.
____________

Post to thread

Message boards : News : ACEMD 4

//