Strange really big wrong ETAs on workunits

Message boards : Number crunching : Strange really big wrong ETAs on workunits

Author	Message
dskagcommunity Send message Joined: 28 Apr 11 Posts: 460 Credit: 842,161,339 RAC: 1,630,920 Level Scientific publications	Message 62050 - Posted: 18 Dec 2024 \| 13:33:16 UTC
	Recently i see that my hosts show a ETA on the WUs as example 46d so the host does not download another workunits until it finishing out 100% (and the systems are finishing the WUs depending of its compelxity in 1-12h. So it triggers always the backupproject in the upload/downloadphase after 100% timeslot. What can be done to correct this wrong benchmarking from the GPUs in GPUGrid only? ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 62050 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 20 Credit: 1,268,908,599 RAC: 14,779,204 Level Scientific publications	Message 62051 - Posted: 18 Dec 2024 \| 17:34:40 UTC - in response to Message 62050.
	They set it the way they did to fix something. There was an earlier thread about this. Eventually after say 100 tasks it normalizes. A couple things I do in app_config is <skip_cpu_benchmarks>1</skip_cpu_benchmarks> <fraction_done_exact/> Shipping the benchmark will let it normalize. I was told that running the benchmark resets it back to 10's of days. Fraction exact gets the current task's remaining time accurate asap so that it allows you to get a second task. Solution is weak because it only works with one app name. Maybe there's a way to provide criteria for all the apps you're running but I have not been able to figure that out. Others have mentioned shutting boinc down and editing client_state's <duration_correction_factor>0.721557</duration_correction_factor> but I have not tried that.
	ID: 62051 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 460 Credit: 842,161,339 RAC: 1,630,920 Level Scientific publications	Message 62052 - Posted: 18 Dec 2024 \| 18:24:34 UTC
	Thx i will try that out! ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 62052 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1358 Credit: 7,895,070,605 RAC: 6,525,004 Level Scientific publications	Message 62053 - Posted: 19 Dec 2024 \| 3:41:48 UTC
	Definitely use the fraction exact in an app_info. But the best and fastest solution is to edit the client_state.xml in the GPUGrid section of the file with a text editor and change the dcf to 0.01 and save the file. Depending on the mix of work being done, it will eventually start climbing again and you will have to re-edit the file. But I find I only have to do that every few months at the most.
	ID: 62053 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1142 Credit: 10,911,180,840 RAC: 22,039,105 Level Scientific publications	Message 62054 - Posted: 19 Dec 2024 \| 6:47:07 UTC - in response to Message 62053.
	Definitely use the fraction exact in an app_info. But the best and fastest solution is to edit the client_state.xml in the GPUGrid section of the file with a text editor and change the dcf to 0.01 and save the file. Depending on the mix of work being done, it will eventually start climbing again and you will have to re-edit the file. But I find I only have to do that every few months at the most. some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-( So I gave up.
	ID: 62054 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 20 Credit: 1,268,908,599 RAC: 14,779,204 Level Scientific publications	Message 62055 - Posted: 19 Dec 2024 \| 16:24:08 UTC - in response to Message 62054.
	some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-( So I gave up. It reset with benchmarking turned off? KeithM send me the link for app_config which states that you can't even manually benchmark when off.
	ID: 62055 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1078 Credit: 40,231,533,983 RAC: 24 Level Scientific publications	Message 62056 - Posted: 19 Dec 2024 \| 16:34:02 UTC - in response to Message 62055.
	some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-( So I gave up. It reset with benchmarking turned off? KeithM send me the link for app_config which states that you can't even manually benchmark when off. benchmarking does not change the DCF value. 0.01 is the minimum value acceptable for DCF in BOINC. if Erich tried setting it lower than that, that's why it didnt stick. ____________
	ID: 62056 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1142 Credit: 10,911,180,840 RAC: 22,039,105 Level Scientific publications	Message 62057 - Posted: 19 Dec 2024 \| 16:39:02 UTC - in response to Message 62056.
	0.01 is the minimum value acceptable for DCF in BOINC. if Erich tried setting it lower than that, that's why it didnt stick. no, I did NOT set it lower
	ID: 62057 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 20 Credit: 1,268,908,599 RAC: 14,779,204 Level Scientific publications	Message 62058 - Posted: 19 Dec 2024 \| 17:32:06 UTC - in response to Message 62057.
	I just did CPU benchmark even though I have it turned off and it ran anyway. My ETA for a future task went from 36d 8h to 89d 19h after the benchmark. Prior to the benchmark, my dcf was 0.686320 and now its 1.696240. I wasn't expecting it to run but thankfully didn't trash my 3.5h running task. CPU benchmarking does appear to impact ETA's for GPU tasks. Turning benchmark off doesn't prevent you from running it.
	ID: 62058 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 20 Credit: 1,268,908,599 RAC: 14,779,204 Level Scientific publications	Message 62059 - Posted: 19 Dec 2024 \| 18:59:55 UTC - in response to Message 62058.
	Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully. New dcf after completion is 1.679309 or 99% of the previous value. Something else is going on: even though my dcf jumped up ~247% then down 1%, my ETA on the next task is 36d 3h or ~99.5% of the previous eta before benchmarking. dcf does not completely control the GPU task duration.
	ID: 62059 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1358 Credit: 7,895,070,605 RAC: 6,525,004 Level Scientific publications	Message 62060 - Posted: 20 Dec 2024 \| 4:30:25 UTC - in response to Message 62059.
	Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully. New dcf after completion is 1.679309 or 99% of the previous value. Something else is going on: even though my dcf jumped up ~247% then down 1%, my ETA on the next task is 36d 3h or ~99.5% of the previous eta before benchmarking. dcf does not completely control the GPU task duration. Why would you expect it to? Only the task itself, the application and the host compute performance and loading determines how long a task takes to finish computation. All the DCF does is affect the way that BOINC attempts to estimate each tasks estimated computation time. BUT, DCF applies across the ENTIRE project, meaning ONE DCF value applies to ALL tasks sub-types. So, every time you change tasks subspecies, the client/scheduler combination has to recompute the DCF value. Run a long running task type and the next time you run a short running task type the estimates times are skewed wildy. Follow a run of short running tasks by the next long running species and the DCF is again wildly skewed in the other direction. If there was a DCF value applied to EACH sub-species of tasks, the estimated times would stabilize and be pretty much on the spot. But BOINC server code on GPUGrid does not allow that. So we have to just accept that on projects that use the DCF mechanism in their server code, and run many different sub-species of tasks, you will get DCF values that ping-pong back and forth and estimated completion times will never be correct. The most we a user can to is set the DCF values the lowest the BOINC code limits allow. Which is 0.01. Or get the project admins to run a different server code base. The current BOINC code removed the DCF mechanism and changed the DCF to a static value of 1.0. So projects that run that code do not see gyrations in the estimated times. But it is up to each project what Boinc server code they decide to run and how much they have modified it to suit their needs. Benchmarking itself does not change the DCF value. It's the variation in task running times among all the varied sub-species that changes the DCF value.
	ID: 62060 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 20 Credit: 1,268,908,599 RAC: 14,779,204 Level Scientific publications	Message 62061 - Posted: 20 Dec 2024 \| 5:51:12 UTC - in response to Message 62060.
	That's very helpful. Makes sense. I know that both you and Ian say that DCF isn't impacted by a CPU benchmark but I saw it change mid-task before and after a benchmark. I'll chalk it up as a mystery. I'm fine with everything as it is.
	ID: 62061 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1358 Credit: 7,895,070,605 RAC: 6,525,004 Level Scientific publications	Message 62062 - Posted: 20 Dec 2024 \| 7:34:55 UTC - in response to Message 62061.
	There probably is some interaction between running the benchmarks and computing the dcf. But a cursory examination of the boinc codebase didn't find an intersection between the two. I'd have to really dig into the code and try and find some commonality. Or it could just be that running benchmarks simply changes the flops estimate for the host.
	ID: 62062 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2356 Credit: 16,377,158,793 RAC: 3,481,243 Level Scientific publications	Message 62065 - Posted: 20 Dec 2024 \| 19:04:51 UTC
	My observations on the estimated runtimes are the following: - even setting the DCF value to its minimum (0.01) is too high for the present ACEMD3 tasks on a RTX 2080Ti - the estimated runtimes are off in the beginning (for ACEMD3 tasks), then they skyrocket to the 30 days range from one task to the next. In reality these tasks last for 30 minutes or less. This results in "Tasks won't finish in time" messages. - The ATMML tasks immediately break the DCF value, and put the estimated runtime for the ACEMD3 tasks to unacceptable levels. I think that the rsc_fpops_estimated value of different tasks is set to the same value, regardless their actual amount of floating point operations. So it's no wonder that the estimated values lost their function, and task queueing is not working properly (= not working at all) with GPUGrid tasks in the queue.
	ID: 62065 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1358 Credit: 7,895,070,605 RAC: 6,525,004 Level Scientific publications	Message 62066 - Posted: 21 Dec 2024 \| 3:22:56 UTC
	Strange that I have never had a single instance of "tasks won't finish in time" message on my two hosts with a 2080 Ti running every type of task that GPUGrid offers in all the time that I've run this project. While you have. Too large a cache in your instance, perhaps?
	ID: 62066 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2356 Credit: 16,377,158,793 RAC: 3,481,243 Level Scientific publications	Message 62067 - Posted: 21 Dec 2024 \| 13:36:53 UTC - in response to Message 62066.
	The maximum queue length you can set in BOINC manager is 10+10 days. When a task's runtime estimate says that it will run for 30+ days, then it could not be the queue (cache) size causing the "Tasks won't finish in time" message. BTW my queue lenght is set to 0.01 or 0.1 days. I set larger values only when I'm debugging. I crunch mainly FAH tasks recently, there are some FAH work servers which are very slow sometimes from my home (it takes 30-50 minutes to download a new workunit), so my BOINC projects run only at these "outages", hence the short queue. At one such outage I gave a try for GPUGrid, but it's still a mess (= it's unreliable to give work for my computers, which heat our apartment, so I need to use a different, more reliable source).
	ID: 62067 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2356 Credit: 16,377,158,793 RAC: 3,481,243 Level Scientific publications	Message 62068 - Posted: 21 Dec 2024 \| 19:33:49 UTC
	I'd like to add, that there are two distict batches in ACEMD3 (ADRIA and ANTONIOM), and when an ADRIA task gets between ANTONIOM tasks, it breaks the duration correction factor of the latter, so I have to adjust it in the client_state.xml file manually on a daily basis if I want to fill up my queue (4 tasks).
	ID: 62068 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Strange really big wrong ETAs on workunits

	About	Science	Volunteers	Performance	Forum	Join us	Donate