Advanced search

Message boards : Number crunching : resume from checkpoint: same gpu?

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,906,691
RAC: 813,778
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52228 - Posted: 10 Jul 2019 | 15:22:07 UTC
Last modified: 10 Jul 2019 | 15:23:13 UTC

When rebooting, does the checkpointed task resume with the same GPU (eg GTX-1070) or could it be assigned to a different GPU like the GTX-1060 or vice versa?

I have a mix of NVidia and was looking at why some tasks take much longer than others.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,158,793
RAC: 3,481,243
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52229 - Posted: 10 Jul 2019 | 16:15:24 UTC - in response to Message 52228.

When rebooting, does the checkpointed task resume with the same GPU (eg GTX-1070) or could it be assigned to a different GPU like the GTX-1060 or vice versa?
It could be assigned to a different GPU.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,906,691
RAC: 813,778
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52233 - Posted: 10 Jul 2019 | 20:21:12 UTC - in response to Message 52229.
Last modified: 10 Jul 2019 | 20:29:51 UTC

It could be assigned to a different GPU.


Ok, that can explain what I see. I am doing a study of risers on various slots

Motherboard is x8, x4, x8, x4 (4 slots)
Riser respectively are
x1: 1070
x1: 1060
none
4-in-1: 3 at 1060

I lost track of which board was d0, d1, etc so I rebooted without making a note of which board had 9 elapsed hours and another day and 1/2 to complete. After rebooting I checked boinc messages and the 1070 had the task that supposedly had another day and 1/2 to complete. It was running %75 gpu load and was very warm unlike the three on the 4-in-1 riser. I thought something was wrong with the 1070 but it seems one of the tasks from the 4-in-1 riser was reassigned to the 1070. I was not aware that tasks can be reassigned but that must have happened as that task on the 1070 gained a day in just a couple of hours and is back to the normal completion time (almost) for a gtx1070. I am seeing gpu load of 42-55% on the 4-in-1 riser (with three boards), the load on the single 1060 is around %65 and the 1070 is around 75.

Very strange, cannot get the format correct after posting but preview is ok. I am going to post the missing text below. Not sure how it got cutoff and I don't see any ctrl charters in my text.

I was not aware that tasks can be reassigned but that must have happened as that task on the 1070 gained a day in just a couple of hours and is back to the normal completion time (almost) for a gtx1070. I am seeing gpu load of 42-55% on the 4-in-1 riser (with three boards), the load on the single 1060 is around %65 and the 1070 is around 75.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,158,793
RAC: 3,481,243
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52234 - Posted: 10 Jul 2019 | 21:56:17 UTC - in response to Message 52233.
Last modified: 10 Jul 2019 | 21:56:39 UTC

The GPUGrid app needs more PCIe bandwidth than (Bit)Coin mining, or other BOINC projects (like SETI@home) to achieve optimal GPU usage.
It is also recommended to use a non-WDDM OS (preferably Linux).

Here's an excerpt from the stderr output of task 21098601 from your host in question:

<core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 1060 3GB] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 3 : # Name : GeForce GTX 1060 3GB # ECC : Disabled # Global mem : 3072MB # Capability : 6.1 # PCI ID : 0000:08:00.0 # Device clock : 1708MHz # Memory clock : 4004MHz # Memory width : 192bit # Driver version : r430_00 : 43086 # GPU 0 : 75C # GPU 1 : 49C # GPU 2 : 54C # GPU 3 : 56C # GPU 4 : 39C ... # GPU [GeForce GTX 1070] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8192MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1683MHz # Memory clock : 4004MHz # Memory width : 256bit # Driver version : r430_00 : 43086 # GPU 0 : 51C # GPU 1 : 40C # GPU 2 : 42C # GPU 3 : 41C # GPU 4 : 43C ... # Time per step (avg over 5905000 steps): 3.989 ms # Approximate elapsed time for entire WU: 39887.681 s # PERFORMANCE: 64800 Natoms 3.989 ns/day 0.000 ms/step 0.000 us/step/atom called boinc_finish </stderr_txt> ]]>
It is quite obvious that it switched over to a different GPU.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,906,691
RAC: 813,778
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52235 - Posted: 11 Jul 2019 | 1:21:46 UTC - in response to Message 52234.

The GPUGrid app needs more PCIe bandwidth than (Bit)Coin mining, or other BOINC projects (like SETI@home) to achieve optimal GPU usage.
It is also recommended to use a non-WDDM OS (preferably Linux).


Was running GPUGrid on 16.04 UBUntu with pair of 1060 and was doing very well for a while. I switched to windows over a month ago to test risers for a study I am interested in. I wanted results from techpowerup GPU log's and temperatures from tthrottle which I are available under windows.

I wanted to compare performance for various projects and had created a windows app for performance calculations
https://forum.efmer.com/index.php?board=47.0

I will be moving the 5 nvidia boards to a TB85 (6 slot miner motherboard) for testing. The 4-in-1 riser gives a real hit on performance on gpugrid. The TB85 runs 18.04 Ubuntu so that will be an interesting comparison plus there are enough x1 slots to where I don't need a splitter.

The 4-in-1 seemed OK on seti and Einstein and I am putting a table of results together. while I cannot get gpuload under ubuntu (??) the comparison of elapsed time is probably just as good a marker.

Not sure what is going on webwise here but I noticed that my previous post now has all the correct text whereas before only the preview was correct. Sometime when text in a post is missing it is because of a ctrl character accidently left in the body. Hopefully this post is intact.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,895,064,647
RAC: 6,529,813
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52236 - Posted: 11 Jul 2019 | 3:30:21 UTC - in response to Message 52235.

while I cannot get gpuload under ubuntu (??)

Huh ? ? ?
You most certainly can.
From a Terminal session just start nvidia-smi which shows gpu wattage, gpu utilization, memory used and temperature.

nvidia-smi -l 1

polls the installed cards every 1 second to display the utilization every second.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52237 - Posted: 11 Jul 2019 | 7:45:54 UTC - in response to Message 52235.

If it is just temperature you are chasing to import into a custom app (instead of using tthrottle), you can use:

nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits

this will return temp only as an integer. This command works on Linux and Windows.

You can add the "-i x" switch (where x is the card number) for specific card temperatures
The "--format" switch strips out heading and other characters so it is easier to capture the temperature value.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2356
Credit: 16,377,158,793
RAC: 3,481,243
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52238 - Posted: 11 Jul 2019 | 9:06:07 UTC - in response to Message 52235.

I will be moving the 5 nvidia boards to a TB85 (6 slot miner motherboard) for testing. The 4-in-1 riser gives a real hit on performance on gpugrid. The TB85 runs 18.04 Ubuntu so that will be an interesting comparison plus there are enough x1 slots to where I don't need a splitter.
The reduction of the GPUGrid app's performance comes from the reduced PCIe bandwidth, regardless of the limiting factor (an x1 riser or a motherboard with x1 slots will give the same result on the same OS).

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,906,691
RAC: 813,778
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52239 - Posted: 11 Jul 2019 | 18:17:34 UTC - in response to Message 52238.
Last modified: 11 Jul 2019 | 18:24:56 UTC

The reduction of the GPUGrid app's performance comes from the reduced PCIe bandwidth, regardless of the limiting factor (an x1 riser or a motherboard with x1 slots will give the same result on the same OS).


Yes, agree, but I was testing a 4-in-1 adapter for comparison to various projects that are good for gridcoin mining. That adapter cause a performance hit.

here are results from the same motherboard but each is x1 (no splitter) in slots x8,x4,x8,x4. I failed to get a screen printer when the splitter was used but the gpuload was in the mid 40's to 50s for the boards on the splitter. Considering that the splitter had 3 gpu1060s one might expected 60-80% divided by 3 for a %20-30 load so the splitter caused degradation but was not that bad.

Unaccountably I had to us http instead of https for below image. Some sites require secure others don't seem to care. My images are on GoDaddy at my (now folded up) motorcycle club web site I put together before I retired. Keith listed tools in Ubuntu for accessing gpu info which I was unware of. Same for rod4x4. I will look into that windows app 4x4 mentioned because I was a C C# VB windows developer for years. I retired when my company switched platform to Linux and CORBA. I have been looking into accessing

gpuz shared memory to get values but it seems easier to log results to the disk drive and read them using my C# program. I will look at nvidia-smi however.



[edit] strange, parts of my post are missing but are present in the preview. I just pasted this url into chrome and there is no missing text in my post. Not sure what is happening in Microsoft Edge that is causing text to be missing.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1142
Credit: 10,908,930,840
RAC: 22,075,475
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52240 - Posted: 11 Jul 2019 | 19:01:18 UTC

for the actual runs, I'd avoid GPU temps higher than 70°C. Lower than that is even better.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,906,691
RAC: 813,778
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52241 - Posted: 11 Jul 2019 | 19:06:50 UTC - in response to Message 52240.

for the actual runs, I'd avoid GPU temps higher than 70°C. Lower than that is even better.


Temps are all under 65c as I had not started up Afterburner when I took the screen shot.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,895,064,647
RAC: 6,529,813
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52242 - Posted: 12 Jul 2019 | 0:52:05 UTC

nvidia-smi is available on both Linux and Windows platforms. Shows the same information.

Use rod4X4's format parameter to dump all the nvidia-smi output into a csv file for later analysis.

nvidia-smi is located at C:\Program Files\NVIDIA Corporation\NVSMI in Windows.

You access it via a Terminal in Windows also.

nvidia-smi --help

prints out all the possible parameters.

Post to thread

Message boards : Number crunching : resume from checkpoint: same gpu?

//