Advanced search

Message boards : Number crunching : ERR: cudart64_80.dll all nulls. Should be just a link to the real one

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,729,024
RAC: 815,470
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52129 - Posted: 23 Jun 2019 | 14:07:16 UTC
Last modified: 23 Jun 2019 | 14:32:27 UTC

Got the error message as shown below. I did a binary compare of the file to the same dll on another computer and the problem file had its content all "0".

(1) I have BOINC set to NOT ignore image checks on downloads. One would think this should not happen.

(2) I then noticed that the project directory had copies of all the dlls and I tested them and they were not null.

Directory of C:\ProgramData\BOINC\projects\www.gpugrid.net

06/18/2019 11:06 PM 366,016 _cudart64_80.dll
06/18/2019 11:12 PM 145,769,016 _cufft64_80.dll
06/18/2019 11:07 PM 1,262,080 _tcl86.dll
06/18/2019 11:07 PM 112,640 _zlib1.dll


IMHO, the DLLs in the slots should just be links to the actual DLLs in the project directory. Something is wrong, kaput or not kosher.

Anyway, I replaced the nulled out dll with a good one and am keeping my finger crossed.

Anyone know what happened to the BIONC website? If down for maintenance they should have put a redirect to an info page. I wanted to ask if the message below was from their program. I think it is a windows 10 error and not from their image check.


Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 52132 - Posted: 23 Jun 2019 | 17:10:39 UTC - in response to Message 52129.

Message in the Seti@home forums regarding BOINC. Dr. Anderson is quoted as saying the server for BOINC is dead. They look to replace it tomorrow sometime. Weekend and no one there to do it.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,376,466,723
RAC: 19,051,824
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52133 - Posted: 23 Jun 2019 | 20:54:15 UTC - in response to Message 52129.
Last modified: 23 Jun 2019 | 20:56:45 UTC

It is the deployment policy of this project that the CUDA DLLs are copied, rather than linked, before each task starts. Many projects do this, so that they can download a fully-versioned reference file, but restore it to the generic name on copy as required for the dynamic linking to work.

<file_ref> <file_name>_cudart64_80.dll</file_name> <open_name>cudart64_80.dll</open_name> <copy_file/> </file_ref> <file_ref> <file_name>_cufft64_80.dll</file_name> <open_name>cufft64_80.dll</open_name> <copy_file/> </file_ref>

Note that in the GPUGrid case, the renaming is very subtle - simply removing the prefix underscore. If the file copy on your machine resulted in a nulled image, then something is wrong with your hardware.

Speedy
Send message
Joined: 19 Aug 07
Posts: 43
Credit: 40,991,082
RAC: 68,082
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 52134 - Posted: 23 Jun 2019 | 21:06:18 UTC - in response to Message 52132.
Last modified: 23 Jun 2019 | 21:07:42 UTC

Message in the Seti@home forums regarding BOINC. Dr. Anderson is quoted as saying the server for BOINC is dead. They look to replace it tomorrow sometime. Weekend and no one there to do it.

David said "hopefully we will have a new one up tomorrow" (Monday, June 24 today)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52135 - Posted: 23 Jun 2019 | 21:08:20 UTC - in response to Message 52129.
Last modified: 23 Jun 2019 | 21:09:33 UTC

delete

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52136 - Posted: 23 Jun 2019 | 23:41:44 UTC
Last modified: 23 Jun 2019 | 23:47:12 UTC

I think it is a windows 10 error and not from their image check.

It does look like a Windows error, rather than an Application message.

If the file copy on your machine resulted in a nulled image, then something is wrong with your hardware

Your other two machines don't appear to have the same issue. So I would suspect hardware or OS corruption issue on that one machine.
Your PCs appear identical, so perhaps try swapping the RAM etc to another PC to see if the issue follows the hardware swap.
Are there any errors in the Event log, or has any software changed such as AV? If AV has been updated try White listing your BOINC data directory.

0xc000012f error in Windows can point to Software Driver issues, corruption and failed hardware.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,729,024
RAC: 815,470
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52153 - Posted: 27 Jun 2019 | 16:52:33 UTC - in response to Message 52136.
Last modified: 27 Jun 2019 | 17:33:15 UTC

I think it is a windows 10 error and not from their image check.

It does look like a Windows error, rather than an Application message.

If the file copy on your machine resulted in a nulled image, then something is wrong with your hardware

Your other two machines don't appear to have the same issue. So I would suspect hardware or OS corruption issue on that one machine.
Your PCs appear identical, so perhaps try swapping the RAM etc to another PC to see if the issue follows the hardware swap.
Are there any errors in the Event log, or has any software changed such as AV? If AV has been updated try White listing your BOINC data directory.

0xc000012f error in Windows can point to Software Driver issues, corruption and failed hardware.


Been looking at this. Each GTX1070 works fine on other systems. Both gtx1070 work fine for SETI, Einstein, etc but NOT gpugrid.

Symptom: I put the 2nd gtx1070 in and start boinc. Project is delayed 60 seconds to allow some debug else computer freezes. I get to look at the event messages and sure enough things do not look good (remember, this was working fine before I returned the GTX1070 back to system):

1 6/27/2019 11:17:35 AM Starting BOINC client version 7.14.2 for windows_x86_64
2 6/27/2019 11:17:35 AM log flags: file_xfer, sched_ops, task
3 6/27/2019 11:17:35 AM Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8
4 6/27/2019 11:17:35 AM Data directory: C:\ProgramData\BOINC
5 6/27/2019 11:17:35 AM Running under account frick
6 6/27/2019 11:17:35 AM [error] Couldn't parse account file account_www.gpugrid.net.xml
7 6/27/2019 11:17:36 AM [error] Couldn't parse statistics_www.gpugrid.net.xml
8 6/27/2019 11:17:38 AM CUDA: NVIDIA GPU 0: GeForce GTX 1070 (driver version 430.64, CUDA version 10.2, compute capability 6.1, 4096MB, 3556MB available, 6852 GFLOPS peak)
9 6/27/2019 11:17:39 AM CUDA: NVIDIA GPU 1: GeForce GTX 1070 (driver version 430.64, CUDA version 10.2, compute capability 6.1, 4096MB, 3556MB available, 6463 GFLOPS peak)
10 6/27/2019 11:17:39 AM OpenCL: NVIDIA GPU 0: GeForce GTX 1070 (driver version 430.64, device version OpenCL 1.2 CUDA, 8192MB, 3556MB available, 6852 GFLOPS peak)
11 6/27/2019 11:17:39 AM OpenCL: NVIDIA GPU 1: GeForce GTX 1070 (driver version 430.64, device version OpenCL 1.2 CUDA, 8192MB, 3556MB available, 6463 GFLOPS peak)
12 GPUGRID 6/27/2019 11:17:39 AM [error] Project GPUGRID is in state file but no account file found
13 6/27/2019 11:17:39 AM [error] Application acemdlong outside project in state file
14 6/27/2019 11:17:39 AM [error] Application acemdshort outside project in state file


I stop boinc (this is not necessary if gpugrid is suspended) and go to \projectdata\boinc and examine those two files

account_www.gpugrid... is full of nulls
statistics_www.gpugrid…. is full of nulls

These files have been re-written by, it seems, cs_account.cpp, a module in the boinc client. At least that is where I traced the error messages to. If I could run the client under VS2017 I could possibly help debug the problem. It could be hardware but since both GTX1070 seem to work fine with other GPU projects then code is suspect. I replaced those two files from another system that also had gtx1070 and same problem: Both were re-written full of nulls when the boinc client started up.

Could one of the gtx1070's be attempting to process the checkpoint file that was created by the other gtx1070? There are differences as one is SC the other is plain jane.

Gpugrid makes more demands on hardware than SETI so possibly the hardware problem may not show up easily or frequencly.

[EDIT] With SETI running, I reboot the system. This ensure that SETI will continue running and GPUGRID will have to wait its turn. The account file did not get re-written with nulls. That indicates that cs_account.cpp probably did not re-write so the project executable must have attempted a download of the account and either it works or nulls happen then. Also, the statistics_www.gpugrid is not the same on that I had copied form another computer (which is understandable) so there must be a problem elsewhere. However, the event messages shows a problem the statistics file (but not the account file) which is strange as looking at the xml I don't see anything unusual.
6/27/2019 11:58:27 AM [error] Couldn't parse statistics_www.gpugrid.net.xml


I keep temps low with evga precision x but sometimes that program does not start with windows. Probably will pull the 1070 and use it elsewhere. I am thinking there is a hardware problem or incompatibility of some type. I only noticed this problem after that 1903 Microsoft feature update.

[EDIT-2] Want to clarify something: I dragged the "good" account and statistic file from another system that has gtx1070 and dropped them into the boinc projectdata directory. The files are good, they are only filled with nulls after I start the client and that happens when the gpugrid is NOT suspended. ie: the nulls occur (hardware or software or whichever) when the project code starts executing.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52155 - Posted: 27 Jun 2019 | 23:44:55 UTC - in response to Message 52153.

Sounds like you have spent some time on this!

Other steps to try:
1. "Reset Project" for GPUgrid in BOINC Manager. This should replace the GPUgrid account files.
2. Alternatively try to "Remove" GPUgrid project and then check if files are left over in the "ProgramData\BOINC\projects\www.gpugrid.net" folder. Manually remove any left over files and then re-add GPUgrid project
3. Complete all project tasks, Suspend all projects and then check for files left in ProgramData\BOINC\slots folder, remove any files here manually.

From what you have stated, not sure I would blame the GPU just yet. The nulled files could possibly indicate a disk error. Have you tried a check disk on the hard drive?

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,729,024
RAC: 815,470
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52156 - Posted: 28 Jun 2019 | 3:55:30 UTC - in response to Message 52155.
Last modified: 28 Jun 2019 | 4:03:42 UTC

Have you tried a check disk on the hard drive?


Pretty sure this is a hardware problem. I discovered that it can quickly reboot and pick up where it left off to where I don't notice the problem. I ran that x86 memtest ands swapped out video board and got a flash light and looked for bad capacitors. I checked power supply fan. No obvious problem and I know what bad caps look like. Doing second check disk but unaccountably there is no display of the check disk information, only a very dim raster. I know check disk is working because I can see the LED blink and hear clicking noise from the disk. Do you know of any way to check the results of the chkdsk /f/r after the system boots? This is definitely a hardware problem as boinc was not even running and the system reset the instant I logged in remotely which was totally unexpected. By "reset" I mean it simply turned off as if the power cord was pulled at the exact instant I pressed the return key after entering my password.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52157 - Posted: 28 Jun 2019 | 5:11:36 UTC - in response to Message 52156.

Do you know of any way to check the results of the chkdsk /f/r after the system boots?

Chkdsk results can be found in the Event Viewer, Application Log (Hopefully your Windows starts). This website describes the process of finding chkdsk results:
https://support.4it.com.au/article/how-to-extract-the-check-disk-chkdsk-logs-from-event-viewer-on-windows/

The sudden power outages can cause Disk corruption. Also check your Event Log for unusual restarts of your PC.
The Power Supply is your usual suspect, but have seen bad motherboard, bad RAM or bad CPU cause reboots.

Good luck.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,407,729,024
RAC: 815,470
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52163 - Posted: 30 Jun 2019 | 3:59:20 UTC

Replaced power supply and things are back to normal.

I did not see any bad caps in the power supply but it had dust packed tightly in nooks and corners, the worst I have ever seen. I had to disassemble it and remove the fan to allow cleaning. This was a very dense "pro 1-kW" from XFX and the compressed air cans I use were not capable of clearing out all the dust. Had to use tweezers to remove some of the caked dust. Going to use the 1KW on a mining system. I had forgotten I had one that big and that HPz-400 only needed 600 watts.

Post to thread

Message boards : Number crunching : ERR: cudart64_80.dll all nulls. Should be just a link to the real one

//