Author |
Message |
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
I saw a thread that mentioned fan speeds, but I've yet to figure out how to get the fan speeds to adjust according to work load automatically. I would think the board drivers would do that, but on mine the don't. I have a pair of GTX 260's in SLI. The fan speeds seem to default to 40%. When I start running a CUDA app at full saturation, the temp of the GPUs begin climbing, but the fan speeds never increase. Temps continue to climb until the board seizes and causes a driver error and computational error on the unit. I installed the nVidia System Tools package that allows me to manually set the fan speeds, but at times when I'm not running CUDA, such as listening to music or gaming, I don't want my machine sounding like a set of turbines next to me, nor to I want to have to manually change the fan speeds every time I do something different. I tried using the new feature the tools added to create rules for fan speed based on temp, but they appear to have no effect. The speed still stays where you manually set it regardless of temperature. I have the latest nVidia drivers. Again, I would think the drivers would do this automatically. Am I missing something? Any help appreciated since right now it's either manually crank up the fans, or don't run CUDA. Thanks. |
|
|
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
Well... I'm stubborn and wouldn't give up. I think I found a workaround. It's not perfect, but it works. As I said before, the fans wouldn't ramp up with the temp, so the cards would overheat and become unstable then I'd either get a message saying the driver had an error and has recovered, or worse, a BSOD saying the system has halted to prevent damage, then a reboot. I've never had any problems with these cards until they get over about 65 degrees Celsius, so that's the challenge, to keep them cool when running full tilt.
I kept trying in the nVidia Control Panel to get the fans to speed up as the temperatures climbed. I tried using both device rules in the device settings menu, and creating stored profiles and setting the profile policies to change to profiles with higher speed settings above certain temperatures. Neither worked. I have come to believe there's a bug in either the video bios or the drivers that doesn't let it respond to temperatures correctly.
So, I tried attacking the problem from a different angle, and it worked. Like I said, not perfect, but good enough for my situation for now. What I did was configure BOINC to only run when in screensaver mode, then configured the video cards to crank up the cooling fans whenever the screensaver comes on. I figure if the screensaver is not on, I'm probably playing games, listening to music, or something else where I don't want CUDA loading the machine and the fans loud anyway, so this works for me.
Here's what I did for others that may be having the problem:
1. Go to the BOINC computing preferences.
a. set 'Suspend while computer in use' to Yes
b. set 'In use means activity in last' to exactly same number of minutes as you have your screensaver set to turn on.
c. On the Project tab in BOINC manager, click update to get BOINC to load the new preferences.
Be careful, if you set the preferences in BOINC individually for each computer, make sure you do this on at least all the ones that are CUDA capable, or else clear the preferences there and set them on the web page.
Then, here's how to setup the video card action:
1. Of course, make sure you have the latest drivers installed.
2. Open the nVidia Control Panel.
3. Click 'Device Settings' on the right side. You should now have the 'Create Profile' tab showing.
4. In the 'Cooling' section, move the slider for GeForce GPU to 90%. (I experimented, and I really have to crank the fan that high to keep the GPU below 60c when running CUDA. Don't worry, when we're done, it will only run at that speed when doing CUDA.)
5. Repeat this for each of your GPUs if you have more than one card.
6. Click save. It will prompt for a name. Make sure to give it a new name such as 'CUDA.nsu' or whatever. Just don't overwrite the existing profiles.
7. Now click 'Profile Policies' on the left side.
8. Create two new rules. 1st, create one that says 'Load profile HOT.nsu (or whatever you named it) when screensaver is launched. 2nd, create another rule that says 'Load profile 'osbootpf.nsu' when screensaver is stopped. Make sure both are checked. You should now have 3 rules, since there would have been the default rule to load the osbootpf.nsu when windows starts.
That's it. Now, whenever your machine's screensaver comes on and presumably CUDA starts to crank, your fans will crank up as well to keep the cards cool. Then, when you wake the machine up to play a game or something, CUDA will stop and the fans will slow back down to normal again. I still think the fan speed should be automatic based on temps, but at least this does the job.
Two other things you may want to consider. On the preferences page for BOINC, I would change the 'Leave applications in memory while suspended' to NO since CUDA apps can really load up your video ram and impact performance on things like games. Also, I personally tweaked my osbootpf.nsu profile from the default 40% to 45% for each card. If you do other things that heat up the cards, you may need to set that profile even higher, but very few programs or games run the video cards as hot as all out CUDA processing does, so you probably don't need the fans cranked all the time. That's probably why nVidia sets a default of 40% thinking it's the best balance between what most people need for cooling and noise. Anyway, going to 45% I can hardly hear the difference, but it keeps the cards a few degrees cooler even at idle.
Hope this helps people get rid of driver errors and BSODs when running CUDA. Since I made these changes, I've not had a single error at all. I think heat was the cause of all my stability problems with CUDA. Now, if nVidia would fix the temp responses and BOINC would add preferences for CUDA GPU loads and such, things would be perfect. |
|
|
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
Update: now I've run into another bug that others have reported. When BOINC says it's suspended, at least 1 or more of the CUDA tasks keep running. I've even tried manually selecting the task and clicking suspend, and it does show a message saying suspended next to it, yet it keeps ticking up progress and time, and since my fans have slowed down to non-screensaver speed, my card starts heating up again. Plus, even if I left my fans running at full, I don't want CUDA running while I'm trying to play a game.
I give up. I'll have to pass on CUDA projects for now until either BOINC fixes the suspend issue with CUDA processes and adds a way to throttle how hard it works the GPUs, or nVidia improves the cooling issues in their vbios or drivers. I think the issue where you have to turn SLI off to use all your video cards for CUDA needs to be fixed as well.
Oh well... back to 'regular' BOINC.
|
|
|
|
Use riva tuner to control fan speeds (it works on all versions of drivers from ATI and nvidia, including beta, that I have encountered).
As for BOINC, suspend the gpu project, and all tasks stop. Done this with the win 64bit versions 6.4.5 and 6.6.3, both work. Nvidia took out the fan speed control a few driver versions back.
I cant stress this enough... Make sure your drivers and Direct X are up to date. There are updates to DX that come out every few months, last one was November 2008. If you are still having problems with blue screens, even at 60C and updated drivers, you should return your card, for something is wrong. The GPU is designed to run stable up to around 105C. 75C is quite normal for a GPU, 60C is unreasonably cool. Mine runs 100% stable at 73C, with the fan at the default 35% (8800GTS 512, 64bit server2008/win7).
Just a note, video memory and system memory are treated differently. Apps do not ever stay in video memory, that is only used while the WU is active, and only for the current computation at hand.
http://www.nvidia.com/Download/index.aspx?lang=en-us <- nvidia drivers
http://www.microsoft.com/downloads/details.aspx?FamilyID=2da43d38-db71-4c1b-bc6a-9b6652cd92a3&DisplayLang=en <- direct X updater for all win OSs (including vista and win7, and 64bit variants)
just curious, what OS are you running? |
|
|
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
I'm running:
2 GTX 260 cards
Driver: 181.22
Latest DirectX 10 updates
Vista 64bit SP1 with all current updates
ASUS P6T6 WS Revolution system board
6 Gb DDR3 Triple Channel RAM
BOINC 6.6.3
I know they are supposed to suspend when you tell BOINC to, but I'm definetely not the only post of people that have seen CUDA tasks continue to run even when BOINC says it's suspended. My utilities also show the GPU being utilized. I think it's just a bug that will be fixed in the next BOINC release I hope.
As for the cooling issue, I think it's systemic to all the new GTX 200 series cards. You can google and find gobs of posts in forums, including gamers that have nothing to do with BOINC or CUDA, that fight issues with BSOD's or messages from the OS saying a driver error has occured and recovered. I've even seen a couple websites where the dreaded 'nvlddmkm.sys' error, for which I see a thousand supposed solutions on the web, was reproduced repeatedly by simply overheating the video card.
The GTX 200 series of boards have by far the fastest, hottest GPU's nVidia has released. The boards are already two slots wide, and I think nVidia has been desperate to avoid going to a triple wide board. Unfortunately, that severely limits the available space for fans and heatsinks. These chips under load run hotter that the CPU, yet are forced to make do with tiny fans and heatsinks most modders would laugh at for a CPU. Even the stock Intel fan and heatsink for Core i7 CPU's are 3 times the size of what the GTX 200 series GPUs have to work with. In fact, I've already seen aftermarket coolers being released for those who have room in thier case (and between the boards) to replace the stock fan and heatsink with a much larger one. Also, there are versions of the GTX line being released with water cooling attachments already on them.
I think we are seeing some of the same issues as extreme gamers see because CUDA is capable of putting a maximum load, indeed much more intense than the typical pc user, on these cards. nVidia is simply reaching the point where they will need to improve the cooling solutions on these boards. This likely means they will need to work with motherboard manufacturers for things like PCI slot spacing, to ensure they have room for multiple GPUs for those that want to do SLI. Even now many motherboards don't have room to do 3 or 4 boards in SLI, and even if they have room, most won't do full x16 speeds on all those slots. You have to choose the system board carefully.
People balked when they did longer cards, then when they first started making cards 2 slots thick. Now I think they'll need to do 3 slots so they can put much beefier heatsinks and fans on them. The engineer in me thinks even the current fans and heatsinks could be vastly improved. Again, look how large many of the heatsink/fan combos are for a single CPU are. It's not surprising that a GTX 295 with two GPUs on a single board are among the ones people have the most trouble keeping cool with the dinky fan and heatsink they put on them. That's part of why I went with two 260's instead of a single 295, but still I have to crank the fan to keep it cool under load. That's life when pushing the envelope I guess.
I think since nVidia is really starting to push CUDA, and as more games that are even more graphics intense than say Crysis come along, they'll have no choice to rethink the cooling of these monsters.
I'm thinking at this point, I'd like to see someone come out with a small compressor type unit that would simply feed chilled air into a case. If all the fans and heatsinks in my case were sucking in air that was cooler to begin with, it would help everything. Right now, all the fans and heatsinks in the world are still working with room temperature air to begin with. I'd prefer that over liquid cooling. Just pump cold air into the case. It wouldn't need to be a very large compressor at all to cool a single pc case. Entrepeneurs? As many modders buy all sorts of extreme cooling things, if the price was right, I think this would sell. I would think the server market would be big as well. I'd buy one.
|
|
|
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
Eh... just had a thought... if you chill all the air in the chassis very much below room temp, you'll get water condensation... oh well... I still think they could improve cooling by simple things like using different metals that conduct heat better, improved fan and heatsink fin designs, and designing cooler running chips to begin with. |
|
|
pharrgSend message
Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level
Scientific publications
|
Hi... just noticed one catch to the cooling process I gave above. If you run a program like media player that is set to disable the screensaver, that will prevent the fans from kicking in since the screensaver will then not come on. However, since BOINC's setting is simply based on time since last keyboard or mouse activity, it will start running anyway. So, when I started to play some music and do other non-computer things, or watch a movie, media player prevented the screensaver, and thus the fan speedup, yet BOINC resumed after a few minutes of idle mouse and it wasn't long before the cards overheated and crashed. So, either turn off the option on apps that prevent the screensaver, or snooze BOINC. Also, remember that the snooze is time limited and it will resume again if you don't watch out. That's another thing I wish they'd change. If I snooze BOINC, I want it to remain suspended until I tell it otherwise. Or, you could just exit BOINC and the science apps altogether until your done with your music, movie, or whatever, then relaunch it afterwards.
Perhaps the best solution right now is to do as Tank_Master suggested and use Riva Tuner to manage fan speeds based on temps. But, then you have to create profiles, launchers, etc.
I just don't understand why nVidia took automatic fan speeds out of their drivers? I understand that there are a few overclockers and such out there that want to control them differently. Fine, add a checkbox in the control panel to disable the built in automatic fan controls in the driver. But, then, for the other 99% of the pc users on the planet that simply want to put a card in and have it work automatically on it's own, it'd be fine. Most pc users don't monitor temps, don't tweak things, don't download 3rd party hardware tweakers, and all this other hassle. All they know is they try to run a game like Crysis on a card that should be able to run it full settings, and suddenly they get 'nvlddmkm.sys' errors or the errors saying the driver encountered an error and recovered, or simply get BSOD's and they don't realize that what's causing it is their cards are overheating and becoming unstable. nVidia needs to fix this because people are increasingly giving them the reputation of cards that are error prone, especially on the GTX 200 series that have a hard time staying cool anyway. It's a simple fix. Why screw 99% of users to make a few overclockers happy when they could use the checkbox as I've said and make everyone happy? All of these ideas we're working on are simply trying to deal with an intentional 'feature' of nVidia's crappy drivers. Frustrating... |
|
|
JeremySend message
Joined: 15 Feb 09 Posts: 55 Credit: 3,542,733 RAC: 0 Level
Scientific publications
|
nVidia did NOT disable automatic fan control. However, I've found that any number of software apps can break it. Default is 40% baseline speed. If you EVER adjust the fan speed in the nVidia control panel, automatic fan speed goes away, and the fan will not spin faster/slower than whatever you set there. Certain tweaks in RivaTune will break automatic fan speed adjustment as well. Uninstall and reinstall nVidia drivers to fix this and make automatic fan adjustment work again.
My eventual solution to the "fan speed doesn't ramp up until the GPU exceeeds 80C" problem was to re-flash my video card BIOS. This is what you'll need:
GPU-Z to DL the current BIOS from the card (this will also allow you to see how fast your GPU fan is spinning)
NiBiTor to modify the BIOS
nvFlash to re-flash your card
Within NiBiTor, under the Temperature tab, there is a FanSpeed IC button. This will allow you to change how the video card fan behaves. I'd be happy to share my settings if you'd like, they've worked quite well for me and temps are much better controlled than they were before.
It's really not as intimidating as it sounds, and is a much more elegant solution than playing around with software to achieve mediocre and unreliable results IMHO. Google searching "nvidia bios flash" and similar will yield step by steps of the process. |
|
|