Advanced search

Message boards : Number crunching : Strange really big wrong ETAs on workunits

Author Message
Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 460
Credit: 842,743,941
RAC: 1,650,777
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62050 - Posted: 18 Dec 2024 | 13:33:16 UTC

Recently i see that my hosts show a ETA on the WUs as example 46d so the host does not download another workunits until it finishing out 100% (and the systems are finishing the WUs depending of its compelxity in 1-12h. So it triggers always the backupproject in the upload/downloadphase after 100% timeslot. What can be done to correct this wrong benchmarking from the GPUs in GPUGrid only?
____________
DSKAG Austria Research Team: http://www.research.dskag.at



KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 20
Credit: 1,269,001,212
RAC: 14,658,023
Level
Met
Scientific publications
wat
Message 62051 - Posted: 18 Dec 2024 | 17:34:40 UTC - in response to Message 62050.

They set it the way they did to fix something.

There was an earlier thread about this. Eventually after say 100 tasks it normalizes.

A couple things I do in app_config is

<skip_cpu_benchmarks>1</skip_cpu_benchmarks>
<fraction_done_exact/>

Shipping the benchmark will let it normalize. I was told that running the benchmark resets it back to 10's of days.

Fraction exact gets the current task's remaining time accurate asap so that it allows you to get a second task.

Solution is weak because it only works with one app name. Maybe there's a way to provide criteria for all the apps you're running but I have not been able to figure that out.


Others have mentioned shutting boinc down and editing client_state's

<duration_correction_factor>0.721557</duration_correction_factor>

but I have not tried that.

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 460
Credit: 842,743,941
RAC: 1,650,777
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62052 - Posted: 18 Dec 2024 | 18:24:34 UTC

Thx i will try that out!
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,896,753,504
RAC: 6,536,190
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62053 - Posted: 19 Dec 2024 | 3:41:48 UTC

Definitely use the fraction exact in an app_info.

But the best and fastest solution is to edit the client_state.xml in the GPUGrid section of the file with a text editor and change the dcf to 0.01 and save the file.

Depending on the mix of work being done, it will eventually start climbing again and you will have to re-edit the file. But I find I only have to do that every few months at the most.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1142
Credit: 10,921,530,840
RAC: 22,608,743
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62054 - Posted: 19 Dec 2024 | 6:47:07 UTC - in response to Message 62053.

Definitely use the fraction exact in an app_info.

But the best and fastest solution is to edit the client_state.xml in the GPUGrid section of the file with a text editor and change the dcf to 0.01 and save the file.

Depending on the mix of work being done, it will eventually start climbing again and you will have to re-edit the file. But I find I only have to do that every few months at the most.

some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-(
So I gave up.

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 20
Credit: 1,269,001,212
RAC: 14,658,023
Level
Met
Scientific publications
wat
Message 62055 - Posted: 19 Dec 2024 | 16:24:08 UTC - in response to Message 62054.


some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-(
So I gave up.


It reset with benchmarking turned off?

KeithM send me the link for app_config which states that you can't even manually benchmark when off.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 62056 - Posted: 19 Dec 2024 | 16:34:02 UTC - in response to Message 62055.


some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-(
So I gave up.


It reset with benchmarking turned off?

KeithM send me the link for app_config which states that you can't even manually benchmark when off.


benchmarking does not change the DCF value.

0.01 is the minimum value acceptable for DCF in BOINC. if Erich tried setting it lower than that, that's why it didnt stick.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1142
Credit: 10,921,530,840
RAC: 22,608,743
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62057 - Posted: 19 Dec 2024 | 16:39:02 UTC - in response to Message 62056.


0.01 is the minimum value acceptable for DCF in BOINC. if Erich tried setting it lower than that, that's why it didnt stick.

no, I did NOT set it lower

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 20
Credit: 1,269,001,212
RAC: 14,658,023
Level
Met
Scientific publications
wat
Message 62058 - Posted: 19 Dec 2024 | 17:32:06 UTC - in response to Message 62057.

I just did CPU benchmark even though I have it turned off and it ran anyway. My ETA for a future task went from 36d 8h to 89d 19h after the benchmark. Prior to the benchmark, my dcf was 0.686320 and now its 1.696240. I wasn't expecting it to run but thankfully didn't trash my 3.5h running task.

CPU benchmarking does appear to impact ETA's for GPU tasks.

Turning benchmark off doesn't prevent you from running it.

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 20
Credit: 1,269,001,212
RAC: 14,658,023
Level
Met
Scientific publications
wat
Message 62059 - Posted: 19 Dec 2024 | 18:59:55 UTC - in response to Message 62058.

Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully.

New dcf after completion is 1.679309 or 99% of the previous value.

Something else is going on: even though my dcf jumped up ~247% then down 1%, my ETA on the next task is 36d 3h or ~99.5% of the previous eta before benchmarking.
dcf does not completely control the GPU task duration.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,896,753,504
RAC: 6,536,190
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62060 - Posted: 20 Dec 2024 | 4:30:25 UTC - in response to Message 62059.

Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully.

New dcf after completion is 1.679309 or 99% of the previous value.

Something else is going on: even though my dcf jumped up ~247% then down 1%, my ETA on the next task is 36d 3h or ~99.5% of the previous eta before benchmarking.
dcf does not completely control the GPU task duration.

Why would you expect it to?

Only the task itself, the application and the host compute performance and loading determines how long a task takes to finish computation.

All the DCF does is affect the way that BOINC attempts to estimate each tasks estimated computation time.

BUT, DCF applies across the ENTIRE project, meaning ONE DCF value applies to ALL tasks sub-types.

So, every time you change tasks subspecies, the client/scheduler combination has to recompute the DCF value. Run a long running task type and the next time you run a short running task type the estimates times are skewed wildy.

Follow a run of short running tasks by the next long running species and the DCF is again wildly skewed in the other direction.

If there was a DCF value applied to EACH sub-species of tasks, the estimated times would stabilize and be pretty much on the spot.

But BOINC server code on GPUGrid does not allow that. So we have to just accept that on projects that use the DCF mechanism in their server code, and run many different sub-species of tasks, you will get DCF values that ping-pong back and forth and estimated completion times will never be correct.

The most we a user can to is set the DCF values the lowest the BOINC code limits allow. Which is 0.01.

Or get the project admins to run a different server code base. The current BOINC code removed the DCF mechanism and changed the DCF to a static value of 1.0. So projects that run that code do not see gyrations in the estimated times. But it is up to each project what Boinc server code they decide to run and how much they have modified it to suit their needs.

Benchmarking itself does not change the DCF value. It's the variation in task running times among all the varied sub-species that changes the DCF value.

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 20
Credit: 1,269,001,212
RAC: 14,658,023
Level
Met
Scientific publications
wat
Message 62061 - Posted: 20 Dec 2024 | 5:51:12 UTC - in response to Message 62060.

That's very helpful. Makes sense.

I know that both you and Ian say that DCF isn't impacted by a CPU benchmark but I saw it change mid-task before and after a benchmark. I'll chalk it up as a mystery.

I'm fine with everything as it is.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,896,753,504
RAC: 6,536,190
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62062 - Posted: 20 Dec 2024 | 7:34:55 UTC - in response to Message 62061.

There probably is some interaction between running the benchmarks and computing the dcf.

But a cursory examination of the boinc codebase didn't find an intersection between the two.

I'd have to really dig into the code and try and find some commonality.

Or it could just be that running benchmarks simply changes the flops estimate for the host.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,930,940
RAC: 3,470,976
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62065 - Posted: 20 Dec 2024 | 19:04:51 UTC

My observations on the estimated runtimes are the following:

- even setting the DCF value to its minimum (0.01) is too high for the present ACEMD3 tasks on a RTX 2080Ti
- the estimated runtimes are off in the beginning (for ACEMD3 tasks), then they skyrocket to the 30 days range from one task to the next. In reality these tasks last for 30 minutes or less. This results in "Tasks won't finish in time" messages.
- The ATMML tasks immediately break the DCF value, and put the estimated runtime for the ACEMD3 tasks to unacceptable levels.

I think that the rsc_fpops_estimated value of different tasks is set to the same value, regardless their actual amount of floating point operations. So it's no wonder that the estimated values lost their function, and task queueing is not working properly (= not working at all) with GPUGrid tasks in the queue.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,896,753,504
RAC: 6,536,190
Level
Tyr
Scientific publications
watwatwatwatwat
Message 62066 - Posted: 21 Dec 2024 | 3:22:56 UTC

Strange that I have never had a single instance of "tasks won't finish in time" message on my two hosts with a 2080 Ti running every type of task that GPUGrid offers in all the time that I've run this project.

While you have. Too large a cache in your instance, perhaps?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,930,940
RAC: 3,470,976
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62067 - Posted: 21 Dec 2024 | 13:36:53 UTC - in response to Message 62066.

The maximum queue length you can set in BOINC manager is 10+10 days. When a task's runtime estimate says that it will run for 30+ days, then it could not be the queue (cache) size causing the "Tasks won't finish in time" message.
BTW my queue lenght is set to 0.01 or 0.1 days. I set larger values only when I'm debugging. I crunch mainly FAH tasks recently, there are some FAH work servers which are very slow sometimes from my home (it takes 30-50 minutes to download a new workunit), so my BOINC projects run only at these "outages", hence the short queue. At one such outage I gave a try for GPUGrid, but it's still a mess (= it's unreliable to give work for my computers, which heat our apartment, so I need to use a different, more reliable source).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,930,940
RAC: 3,470,976
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62068 - Posted: 21 Dec 2024 | 19:33:49 UTC

I'd like to add, that there are two distict batches in ACEMD3 (ADRIA and ANTONIOM), and when an ADRIA task gets between ANTONIOM tasks, it breaks the duration correction factor of the latter, so I have to adjust it in the client_state.xml file manually on a daily basis if I want to fill up my queue (4 tasks).

Post to thread

Message boards : Number crunching : Strange really big wrong ETAs on workunits

//