Advanced search

Message boards : Graphics cards (GPUs) : Problem with Boinc device vs Nvidia X Server gpu allocation

Author Message
jlhal
Send message
Joined: 1 Mar 10
Posts: 147
Credit: 1,077,535,540
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41565 - Posted: 27 Jul 2015 | 21:13:00 UTC
Last modified: 27 Jul 2015 | 21:16:57 UTC

Hi everybody !

My MB is an ASUS X99-E WS with Core I7-5820K at 3.3Ghz stock clock and 16GB DDR5.
Running Lubuntu 15.04 and 2 strictly identical GPUs , Gigabyte GTX Titan Black in auto mode with latest hand installed drivers

Whatever GPU I want to ignore in cc_config.xml, GPU-0 is allways used...
The WU goes to end.

CC_CONFIG :

    cc_config>
    <options>
    <report_results_immediately>1</report_results_immediately>
    <use_all_gpus>1</use_all_gpus>
    <ignore_nvidia_dev>0</ignore_nvidia_dev>
    </options>
    <log_flags>
    <coproc_debug>1</coproc_debug>
    <task>1</task>
    <file_xfer>1</file_xfer>
    <sched_ops>1</sched_ops>
    </log_flags>
    </cc_config>



BOINC JOURNAL wth coproc debug :

    lun. 27 juil. 2015 20:45:31 CEST | | Starting BOINC client version 7.4.23 for x86_64-pc-linux-gnu
    lun. 27 juil. 2015 20:45:31 CEST | | log flags: file_xfer, sched_ops, task, coproc_debug
    lun. 27 juil. 2015 20:45:31 CEST | | Libraries: libcurl/7.38.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
    lun. 27 juil. 2015 20:45:31 CEST | | Data directory: /var/lib/boinc-client
    lun. 27 juil. 2015 20:45:31 CEST | | [coproc] launching child process at /usr/bin/boinc
    lun. 27 juil. 2015 20:45:31 CEST | | [coproc] relative to directory /
    lun. 27 juil. 2015 20:45:31 CEST | | [coproc] with data directory /var/lib/boinc-client
    lun. 27 juil. 2015 20:45:31 CEST | | CUDA: NVIDIA GPU 0 (ignored by config): GeForce GTX TITAN Black (driver version 352.21, CUDA version 7.5, compute capability 3.5, 4096MB, 4009MB available, 6396 GFLOPS peak)
    lun. 27 juil. 2015 20:45:31 CEST | | CUDA: NVIDIA GPU 1: GeForce GTX TITAN Black (driver version 352.21, CUDA version 7.5, compute capability 3.5, 4096MB, 4009MB available, 6396 GFLOPS peak)
    lun. 27 juil. 2015 20:45:31 CEST | | OpenCL: NVIDIA GPU 0 (ignored by config): GeForce GTX TITAN Black (driver version 352.21, device version OpenCL 1.2 CUDA, 6144MB, 4009MB available, 6396 GFLOPS peak)
    lun. 27 juil. 2015 20:45:31 CEST | | OpenCL: NVIDIA GPU 1: GeForce GTX TITAN Black (driver version 352.21, device version OpenCL 1.2 CUDA, 6143MB, 4009MB available, 6396 GFLOPS peak)
    lun. 27 juil. 2015 20:45:31 CEST | | NVIDIA library reports 2 GPUs
    lun. 27 juil. 2015 20:45:31 CEST | | No ATI library found
    lun. 27 juil. 2015 20:45:31 CEST | | Host name: odysseusV
    lun. 27 juil. 2015 20:45:31 CEST | | Processor: 12 GenuineIntel Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz [Family 6 Model 63 Stepping 2]
    lun. 27 juil. 2015 20:45:31 CEST | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
    lun. 27 juil. 2015 20:45:31 CEST | | OS: Linux: 3.19.0-23-generic
    lun. 27 juil. 2015 20:45:31 CEST | | Memory: 15.58 GB physical, 31.98 GB virtual
    lun. 27 juil. 2015 20:45:31 CEST | | Disk: 203.13 GB total, 166.69 GB free
    lun. 27 juil. 2015 20:45:31 CEST | | Local time is UTC +2 hours
    lun. 27 juil. 2015 20:45:31 CEST | Milkyway@Home | Found app_config.xml
    lun. 27 juil. 2015 20:45:31 CEST | | Config: report completed tasks immediately
    lun. 27 juil. 2015 20:45:31 CEST | | Config: ignoring NVIDIA GPU 0
    lun. 27 juil. 2015 20:45:31 CEST | | Config: GUI RPCs allowed from:
    lun. 27 juil. 2015 20:45:31 CEST | Milkyway@Home | URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 624246; resource share 100
    lun. 27 juil. 2015 20:45:31 CEST | World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 3343766; resource share 100
    lun. 27 juil. 2015 20:45:31 CEST | GPUGRID | URL http://www.gpugrid.net/; Computer ID 226017; resource share 100
    lun. 27 juil. 2015 20:45:31 CEST | World Community Grid | General prefs: from World Community Grid (last modified 24-Feb-2015 22:06:56)
    lun. 27 juil. 2015 20:45:31 CEST | World Community Grid | Computer location: home
    lun. 27 juil. 2015 20:45:31 CEST | | General prefs: using separate prefs for home
    lun. 27 juil. 2015 20:45:31 CEST | | Reading preferences override file
    lun. 27 juil. 2015 20:45:31 CEST | | Preferences:
    lun. 27 juil. 2015 20:45:31 CEST | | max memory usage when active: 11962.05MB
    lun. 27 juil. 2015 20:45:31 CEST | | max memory usage when idle: 11962.05MB
    lun. 27 juil. 2015 20:45:31 CEST | | max disk usage: 162.50GB
    lun. 27 juil. 2015 20:45:31 CEST | | (to change preferences, visit a project web site or select Preferences in the Manager)
    lun. 27 juil. 2015 20:45:31 CEST | | gui_rpc_auth.cfg is empty - no GUI RPC password protection
    lun. 27 juil. 2015 20:45:31 CEST | | Not using a proxy
    lun. 27 juil. 2015 20:45:32 CEST | GPUGRID | [coproc] Assigning NVIDIA instance 0 to e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:46:32 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:46:32 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:47:33 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:47:33 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:48:33 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:48:33 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:49:33 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:49:33 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:50:34 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:50:34 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:51:34 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:51:34 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:51:53 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:51:53 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:52:54 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:52:54 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:53:54 CEST | GPUGRID | [coproc] NVIDIA instance 0; 1.000000 pending for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0
    lun. 27 juil. 2015 20:53:54 CEST | GPUGRID | [coproc] NVIDIA instance 0: confirming 1.000000 instance for e1s2_1-GERARD_FXCXCL12_LIG_501831-0-1-RND4749_0



CONSOLE output for "date ; ps -ef | grep -v grep | grep device" :

    lundi 27 juillet 2015, 20:57:12 (UTC+0200)
    boinc 4626 4590 13 20:45 ? 00:01:32 ../../projects/www.gpugrid.net/acemd.846-65.bin --device 1



NVIDIA X SERVER SETTINGS :

shows that GPU-0 is working and GPU-1 is doing NOTHING !

This happens whatever version of driver or Boinc !

Using both GPUs (no "ignore" statement in cc_config.xml) :
It never works (Boinc) with 2 GPUs running : the WU running on device 0 is CPU core frozen, only device 1 working but in fact only GPU-0 is working ! And finally everything aborts .

Any clue ?
____________
Lubuntu 16.04.1 LTS x64

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,233,119,560
RAC: 12,859,654
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41566 - Posted: 28 Jul 2015 | 0:36:31 UTC

Try using this line, because you have 2 gpus.

<use_all_gpus>2</use_all_gpus>





Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2350
Credit: 16,296,321,943
RAC: 4,009,836
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41568 - Posted: 28 Jul 2015 | 7:30:15 UTC - in response to Message 41566.

Try using this line, because you have 2 gpus.

<use_all_gpus>2</use_all_gpus>

That's wrong. The "use_all_gpus" variable is a boolean, so its value could be 0 or 1.
See BOINC manager's client configuration wiki.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2350
Credit: 16,296,321,943
RAC: 4,009,836
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41569 - Posted: 28 Jul 2015 | 7:44:59 UTC - in response to Message 41565.

Your BOINC log shows that it's ignoring GPU 0 according to the cc_config:

lun. 27 juil. 2015 20:45:31 CEST | | CUDA: NVIDIA GPU 0 (ignored by config): GeForce GTX TITAN Black (driver version 352.21, CUDA version 7.5, compute capability 3.5, 4096MB, 4009MB available, 6396 GFLOPS peak)
lun. 27 juil. 2015 20:45:31 CEST | | CUDA: NVIDIA GPU 1: GeForce GTX TITAN Black (driver version 352.21, CUDA version 7.5, compute capability 3.5, 4096MB, 4009MB available, 6396 GFLOPS peak)
lun. 27 juil. 2015 20:45:31 CEST | | OpenCL: NVIDIA GPU 0 (ignored by config): GeForce GTX TITAN Black (driver version 352.21, device version OpenCL 1.2 CUDA, 6144MB, 4009MB available, 6396 GFLOPS peak)
lun. 27 juil. 2015 20:45:31 CEST | | OpenCL: NVIDIA GPU 1: GeForce GTX TITAN Black (driver version 352.21, device version OpenCL 1.2 CUDA, 6143MB, 4009MB available, 6396 GFLOPS peak)

This line also confirms that this task is started on GPU 1:

    lundi 27 juillet 2015, 20:57:12 (UTC+0200)
    boinc 4626 4590 13 20:45 ? 00:01:32 ../../projects/www.gpugrid.net/acemd.846-65.bin --device 1


So perhaps the NVidia X server have different ideas about the GPU numbering than the BOINC manager. I'm not a Linux expert, so I'm just guessing, but you should try to disable the other GPU in cc_config (e.g. "<ignore_nvidia_dev>1</ignore_nvidia_dev>", and then check NVidia X server again.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,233,119,560
RAC: 12,859,654
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41571 - Posted: 28 Jul 2015 | 10:16:38 UTC - in response to Message 41568.
Last modified: 28 Jul 2015 | 10:29:33 UTC

You're right on that. It can be any number, and it works the same. My mistake!

jlhal
Send message
Joined: 1 Mar 10
Posts: 147
Credit: 1,077,535,540
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41572 - Posted: 28 Jul 2015 | 11:40:09 UTC - in response to Message 41569.

So perhaps the NVidia X server have different ideas about the GPU numbering than the BOINC manager. I'm not a Linux expert, so I'm just guessing, but you should try to disable the other GPU in cc_config (e.g. "<ignore_nvidia_dev>1</ignore_nvidia_dev>", and then check NVidia X server again.


Thanks for your answers ;-)

Nvidia driver enumerates GPUs in the order found on PCI bus that is :

PCIE16_1 -> GPU-0 : Boinc device 0 ,the one which CRUNCHING and should be ignored according to config
PCI16_2 -> not used
PCIE16_3 -> GPU-1 : Boinc device 1, the one that should be crunching and does NOTHING !

If I ignore device 1 , Boinc says using device 0, that's ok, as in the first case Boinc is in phase with itself BUT,
NVidia X server shows that it is GPU-1 which is CRUNCHING !

So to REALLY ignore first(0) GPU/device I must ignore number 1 and vice-versa !!


____________
Lubuntu 16.04.1 LTS x64

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,777,302,328
RAC: 17,864,089
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41583 - Posted: 28 Jul 2015 | 17:54:41 UTC

jihal

I am also running an ASUS board with Ubuntu. It has an NVIDIA 770 and an NVIDIA 970. When I look at the NVIDIA X Server Setting software, it says the 770 is GPU number 0 and the 970 is GPU number 1. However, when I look in the BOINC event log, it says that the 770 is GPU number 1 and the 970 is GPU number 0. The GPU numbers are reversed in BOINC. The same reversal could be happening to you. Luckily for me, my GPU's are different models so it is easy to spot the reversal.

Hope that helps.

jlhal
Send message
Joined: 1 Mar 10
Posts: 147
Credit: 1,077,535,540
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41584 - Posted: 28 Jul 2015 | 18:46:31 UTC - in response to Message 41583.

jihal

I am also running an ASUS board with Ubuntu. It has an NVIDIA 770 and an NVIDIA 970. When I look at the NVIDIA X Server Setting software, it says the 770 is GPU number 0 and the 970 is GPU number 1. However, when I look in the BOINC event log, it says that the 770 is GPU number 1 and the 970 is GPU number 0. The GPU numbers are reversed in BOINC. The same reversal could be happening to you. Luckily for me, my GPU's are different models so it is easy to spot the reversal.

Hi captainjack !
Very interesting if some other crunchers could confirm this .

I can't see which of your computers is concerned
Please can you give us the model of your MB and Linux OS and Boinc version
Regards

____________
Lubuntu 16.04.1 LTS x64

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,777,302,328
RAC: 17,864,089
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41585 - Posted: 28 Jul 2015 | 19:28:22 UTC

jihal,

My motherboard is an ASUS P9X79 LE, Ubuntu 15.04 64-bit, BOINC 7.2.42 (manually installed), and NVIDIA drivers 346.47 (manually installed).

The computer shows up twice on the list (I tried to combine them but it wouldn't let me). It is the one that shows as having 2 GTX970's and running Linux.

Let me know if you need more information.

Post to thread

Message boards : Graphics cards (GPUs) : Problem with Boinc device vs Nvidia X Server gpu allocation

//