Author |
Message |
|
Dear all,
I am quite confused by this error and can't wrap my head around it easily. Could you help me out by pointing out what might cause this error to reoccur?
As I only started out recently, I was pretty cautious about overclocking my GPU. It is an ancient GTX 750 Ti, that was not overclocked at the beginning and returned results quite slowly but all things considered it performed well by returning valid WUs. Once I applied a stable overclock, a few WUs, errored out almost immediately, stating that memory leakages were detected. After that I returned to the stock clock speeds and adjusted back up again very slowly from there. At the current modest OC setting, I see WUs flowing in steadily and getting validated up until recently, when I received the following error on 3 consecutive WUs: 195 (0xc3) EXIT_CHILD_FAILED
See this WU f.ex. http://www.gpugrid.net/result.php?resultid=28047723
The following is the output of the stderr file of the corresponding WU. The specs of my system is Windows based, Xeon X5660, 1 designated thread for the GPU WUs and modest overclock on the GPU. Could this error also be connected to my OC? Should I revert back to the stock settings? Thanks for any input!
<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
05:18:52 (6772): wrapper (7.9.26016): starting
05:18:52 (6772): wrapper: running acemd3.exe (--boinc input --device 0)
# Engine failed: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
06:36:29 (6772): acemd3.exe exited; CPU time 4632.046875
06:36:29 (6772): app exit status: 0x1
06:36:29 (6772): called boinc_finish(195)
0 bytes in 0 Free Blocks.
504 bytes in 8 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 53130984 bytes.
Dumping objects ->
{1623} normal block at 0x0000023FEFEF8470, 48 bytes long.
Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52
{1612} normal block at 0x0000023FEFEF88D0, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{1601} normal block at 0x0000023FEFEF8D30, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{1590} normal block at 0x0000023FEFEF8240, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{1579} normal block at 0x0000023FEFEF81D0, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{1568} normal block at 0x0000023FEFED5F80, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {1565} normal block at 0x0000023FEFEF45B0, 8 bytes long.
Data: < ? > 00 00 E7 EF 3F 02 00 00
{895} normal block at 0x0000023FEFED47D0, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{202} normal block at 0x0000023FEFEF4650, 8 bytes long.
Data: <` ? > 60 8F EF EF 3F 02 00 00
{196} normal block at 0x0000023FEFEF8940, 48 bytes long.
Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D
{195} normal block at 0x0000023FEFEF4970, 16 bytes long.
Data: <X ? > 58 BF ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{194} normal block at 0x0000023FEFEF5000, 16 bytes long.
Data: <0 ? > 30 BF ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{193} normal block at 0x0000023FEFEF4E20, 16 bytes long.
Data: < ? > 08 BF ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{192} normal block at 0x0000023FEFEF5230, 16 bytes long.
Data: < ? > E0 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{191} normal block at 0x0000023FEFEF47E0, 16 bytes long.
Data: < ? > B8 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{190} normal block at 0x0000023FEFEF4EC0, 16 bytes long.
Data: < ? > 90 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{189} normal block at 0x0000023FEFEF8710, 48 bytes long.
Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F
{188} normal block at 0x0000023FEFEF4920, 16 bytes long.
Data: <@ ? > 40 B9 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{187} normal block at 0x0000023FEFEEA320, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{186} normal block at 0x0000023FEFEF51E0, 16 bytes long.
Data: < ? > 18 B9 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{184} normal block at 0x0000023FEFEF4AB0, 16 bytes long.
Data: < ? > F0 B8 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{183} normal block at 0x0000023FEFEF4D80, 16 bytes long.
Data: < ? > C8 B8 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{182} normal block at 0x0000023FEFEF52D0, 16 bytes long.
Data: < ? > A0 B8 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{181} normal block at 0x0000023FEFEF48D0, 16 bytes long.
Data: <x ? > 78 B8 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{180} normal block at 0x0000023FEFEF4F10, 16 bytes long.
Data: <P ? > 50 B8 ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{179} normal block at 0x0000023FEFEDB850, 280 bytes long.
Data: < O ? ? > 10 4F EF EF 3F 02 00 00 D0 81 EF EF 3F 02 00 00
{178} normal block at 0x0000023FEFEF49C0, 16 bytes long.
Data: <p ? > 70 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{177} normal block at 0x0000023FEFEF4600, 16 bytes long.
Data: <H ? > 48 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{176} normal block at 0x0000023FEFEF4A60, 16 bytes long.
Data: < ? > 20 BE ED EF 3F 02 00 00 00 00 00 00 00 00 00 00
{175} normal block at 0x0000023FEFEDBE20, 496 bytes long.
Data: <`J ? acemd3.e> 60 4A EF EF 3F 02 00 00 61 63 65 6D 64 33 2E 65
{64} normal block at 0x0000023FEFEF4510, 16 bytes long.
Data: < > 80 EA 18 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000023FEFEE8830, 16 bytes long.
Data: <@ > 40 E9 18 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000023FEFEE83D0, 16 bytes long.
Data: < W > F8 57 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000023FEFEE8790, 16 bytes long.
Data: < W > D8 57 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000023FEFEE86A0, 16 bytes long.
Data: <P > 50 04 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000023FEFEE8330, 16 bytes long.
Data: <0 > 30 04 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000023FEFEE8970, 16 bytes long.
Data: < > E0 02 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000023FEFEE8100, 16 bytes long.
Data: < > 10 04 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000023FEFEE8A10, 16 bytes long.
Data: <p > 70 04 15 1A F7 7F 00 00 00 00 00 00 00 00 00 00
{55} normal block at 0x0000023FEFEE8240, 16 bytes long.
Data: < > 18 C0 13 1A F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]>[url][/url][url][/url] |
|
|
|
a few WUs, errored out almost immediately, stating that memory leakages were detected. That's a false alarm. Even successful tasks contain that message, for example: http://www.gpugrid.net/result.php?resultid=28081659
The specs of my system is ... modest overclock on the GPU. Could this error also be connected to my OC? Yes.
Should I revert back to the stock settings? I would start with that. Different workunits tolerate different overclocking.
If this card does not have its own PCI power connector, then it draws all of its power from the motherboard, in this case overclocking is not recommended.
Check the following too:
What is its operating temperature?
Are its fans rotating ok?
Is its heatsink clean?
Is the thermal interface material ok between the GPU chip and the heatsink?
Is the 12V pins (yellow cable) on the 24-pin MB power connector ok? |
|
|
|
Thanks Zoltán for getting back to me!
Glad that the apparent error message I received can also be seen in non-erroneous WUs.
I already suspected the OC behind it though nonetheless. Probably I'll revert back to stock settings and go from there. Still curious about why it occurred though because it was stable for a couple days before the error messages appeared.
This card specifically is an ASUS GTX750TI - OC - 2GB GDDR5, so it already comes with a stock factory OC. I managed to go as high as roughly 1,4 Ghz on the core clock speed (equivalent OC on mem and core clock of ~200 MHz) which was stable in hour-long stresstesting and prevailed so far in many other BOINC projects. At the same setting GPU-Grid WUs however always errored out immediately (< 3 min). This is why I scaled it back by factor 2 to a "modest" 100 MHz OC on mem and core clock. After 2 days of successfully running GPU Grid 24/7 and only getting valid WUs, I adjusted a higher OC by a rate of 15 MHz per cycle and once the first WU failed, scaled it back. I arrived at an OC setting of +115 MHz on both clocks which seemed stable over the last week since the WU queue was filled up again.
This particular dual slot card is installed on an old MB with a PCIe v2 x16 slot (75W) but comes with an additional separate power connector directly on the card. As the card is rated 60W TDP, power supply shouldn't be an issue. The power supply is 475W and my processor 95W so it should be more than enough juice for this card.
Relating to your checking list, I already can check a couple boxes from the very beginning.
1) operating temp is 61-63 degrees at moderate ambient temps and up to 65 degrees with summer time >=30 degrees outside on a windows machine with a custom fan curve via MSI Afterburner seem okay to me. On Linux it tends to run 2-3 degrees hotter, never above 66/67 degrees though. Both OS of the dual-boot system load the GPU at nearly a 100%.
2) fans are rotating stable and don't make "whining" sound of a defect fan
3+4) Bought it recently second-hand and first thing I did was to clean the fans, heatsink and thermal paste renewal
5) Didn't check that before but it' fine :)
Anyway, at the current processing rate and RAC, I will need roughly 10 days of 24/7 operation for each 1m. Also considering the efficiency aspect here, I might consider "upgrading" my card once the new cards will cause the second-hand market to get swamped with older cards. Electricity ain't cheap here as well... I might aim for a GTX 1050/1060 or any of the more powerful 1080/1080/1650/1660 cards. Would probably prefer a Turing card such as the GTX 1650 which is also rated below the 100W mark. Any advice hereto?
Thanks again!! |
|
|
|
Thanks Zoltán for getting back to me!
+1
Nice to read back these secularly Zoltan's precise comments!
bozz4science: Congratulations! You've perfectly troubleshooted your problem.
I have also curently in production one factory overclocked GTX750TI card, and I appreciate it.
(Recently, I had to hardware-troubleshoot it, as I described at this post).
I might consider "upgrading" my card once the new cards will cause the second-hand market to get swamped with older cards.
You might be interested in taking a look to this thread, where this card's performance and efficiency is compared to other more recent models.
|
|
|
|
Hey ServicEnginIC, thanks for pointing out these threads and the kind words ;)
Especially looking at the detailed RP comparison table of yours, a GTX 1660 Ti should be a safe bet, but after having checked the current resale prices on eBay, I am really surprised how well the prices sustained over time for those cards. The current issue I see, that you never really know what you're getting when buying old hardware as you can't really see into those cards in their listings. As warranty is also long voided, and I would grill those cards pretty good, I don't know how to feel about spending 150-250€ for a GTX 1650/1660 Ti which might only hold up a few months or so...
Probably gonna start saving up to renew my whole system anyways. I'll keep this in mind! Thanks for the valuable information :) |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,267,132,676 RAC: 28,868,347 Level
Scientific publications
|
... This card specifically is an ASUS GTX750TI - OC - 2GB GDDR5, so it already comes with a stock factory OC. I managed to go as high as roughly 1,4 Ghz on the core clock speed
it's interesting for me to read this.
I am also running GTX750Ti on two of my machines, one is make MSI, the other one is Zotac.
The max. core clock speed I can achieve is slightly above 1.2 GHz. Beyond this, GPU-Z shows a performance cap - "limited by total power limit". Which seems to be due to the fact that these cards draw all their electriticy from the motherboard only (as Zoltan mentioned above).
Nevertheless I am very satisfied with these two cards, they have been crunching for more than 3 years.
|
|
|
|
I am pretty happy with this card as well, and might decide to keep it running along a newer GPU someday soon hopefully :)
Note however Erich that those OC settings that I mentioned that yield max. 1414 MHz core clock and 2900 MHz mem clock speed, it is only stable on a few projects and not on GPU Grid. Milkyway f.ex. is such an example. For most math-based projects (Prime Grid, SR Base, NumberFields)that run GPU apps I tend to only get a stable running environment with an OC setting that is right in between the one that I apply here at GPU Grid (and Einstein).
Usually the card can maintain the boost clock as defined by the low- to medium-OC setting over time at a high overall load, but not so for the highest OC setting. They are rather short boosts and usually don't yield any major benefit to average runtimes. Thus, usually I stay with the "modest" OC setting with a core clock sitting at 1320 MHz and mem clock at 2805 MHz that is still yielding a noticeable speedup over stock clocks but should also be more healthy for the card itself... Temps usually sit a little above 60 degrees and fan speeds usually don't go above 60%. So that's fine with me.
Hopefully mine will also endure a couple more years like yours! |
|
|
SteffenSend message
Joined: 2 Mar 19 Posts: 2 Credit: 48,438,972 RAC: 0 Level
Scientific publications
|
I get the same error with A100 cards.
https://www.gpugrid.net/result.php?resultid=28815282
Had the opportunity to also test V100s and these were fine.
Ideas?
Many thanks!
Steffen |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,681,571,308 RAC: 13,214,224 Level
Scientific publications
|
I get the same error with A100 cards.
https://www.gpugrid.net/result.php?resultid=28815282
Had the opportunity to also test V100s and these were fine.
Ideas?
Many thanks!
Steffen
Thanks for the post. I wondered if there would need to be a new application. I thought it likely with the CC capability of 8.6.
All the previous apps topped out at CC level of 7.5 for Turing and Volta.
So we will need a new wrapper app and science app probably.
Interestingly, the 3080 is running OK on the existing Milkyway, Einstein and Primegrid apps. |
|
|
|
I get the same error with A100 cards.
https://www.gpugrid.net/result.php?resultid=28815282
Had the opportunity to also test V100s and these were fine.
You can find the error message which contains the real reason of your failure 5 lines below:
# Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) It tells that the Ampere architecture is not supported (as yet) by the app. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,681,571,308 RAC: 13,214,224 Level
Scientific publications
|
I can understand why the Ampere cards are working OK at Milkyway and Einstein. Both use OpenCL apps with the Nvidia OpenCL API updated accordingly for the cards already.
But why the Primegrid CUDA app is still working, I don't have a clue. The existing app is from May 2019. Well before Ampere. Unless Primegrid's app is pulling from the current Ampere compatible CUDA API too.
And GPUGrid does not. We will need Toni to make some comments on this issue it seems.
|
|
|
|
I also am running a GTX-750ti (pulled from an Alienware 2015 model) and after seeing what ServicEnginIC was able to squeeze out of his Asus (the best, IMHO) I found my card runs cool and stable at 1350 MHz and a memory speed of 2833 MHz.
I tried pushing the memory too hard and threw a couple of exception errors late in the runs.
It now crunches around 109K points a day. |
|
|
|
Just FYI. After a lot of tweaking I now run the card at 1361 GHz core and 2820 GHz mem clock and those settings have been proven to be rock solid across many GPU projects. I have never run GPUGrid exclusively 24/7 for long enough to have RAC converge against a steady value but I reckon that it also produces slightly above 100k points a day. Card also runs cool (60-62) at moderate fan settings (35-55%). I guess, we squeezed the maximum out of this card. |
|
|