How to make sure that the GPU is used during training?

取消
显示结果 
显示  仅  | 搜索替代 
您的意思是: 
已解决

How to make sure that the GPU is used during training?

跳至解决方案
13,335 次查看
khoefle
Contributor II

Hi,

I am new to eIQ Portal and wondering what kind of cudnn/Tensorflow version etc. must be pre installed or if they are delivered with the program itself.

When launching the eIQ Portal.exe I am getting the following hints:

[CONVERTER] 2021-09-14 11:26:19.915128: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-09-14 11:26:19.915436: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

11:26:20.378 > [TRAINER] 2021-09-14 11:26:20.378297: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.
11:26:20.379 > [TRAINER] dll not found
2021-09-14 11:26:20.378535: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

11:26:21.592 > [CONVERTER] 2021-09-14 11:26:21.592557: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll

11:26:21.624 > [CONVERTER] 2021-09-14 11:26:21.624599: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

11:26:21.627 > [CONVERTER] 2021-09-14 11:26:21.627850: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] r
11:26:21.628 > [CONVERTER] etrieving CUDA diagnostic information for host: ***-***-*****
2021-09-14 11:26:21.628134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: ***-***-*****

11:26:22.178 > [TRAINER] 2021-09-14 11:26:22.178820: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dyna
11:26:22.179 > [TRAINER] mic library nvcuda.dll

11:26:22.208 > [TRAINER] 2021-09-14 11:26:22.208431: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found devi
11:26:22.209 > [TRAINER] ce 0 with properties:
pciBusID: 0000:b3:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 11.00GiB deviceMemoryBandwidth: 573.69GiB/s
2021-09-14 11:26:22.209377: W tensorflow/stream_executor/platform/default/
11:26:22.210 > [TRAINER] dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-09-14 11:26:22.210229: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.d
11:26:22.211 > [TRAINER] ll not found
2021-09-14 11:26:22.211134: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library '
11:26:22.211 > [TRAINER] cufft64_10.dll'; dlerror: cufft64_10.dll not found

11:26:22.212 > [TRAINER] 2021-09-14 11:26:22.212279: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not foun
11:26:22.212 > [TRAINER] d
2021-09-14 11:26:22.213105: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] C
11:26:22.213 > [TRAINER] ould not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found

11:26:22.214 > [TRAINER] 2021-09-14 11:26:22.213866: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found

11:26:22.214 > [TRAINER] 2021-09-14 11:26:22.214517: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
2021-09-14 11:26:22.214696: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libra
11:26:22.215 > [TRAINER] ries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...



Which obviously detects the GPU [Geforce RTX 2080 Ti] - however it fails to register the cudnn files. Do I need to install them, and if so which one?

Regards,

K

标记 (3)
0 项奖励
1 解答
13,316 次查看
mgandhi
Contributor II

@khoefleYou will need to install v7.x of the cuDNN that corresponds to your Cuda Driver.  You can download them from the cuDNN archive page (Nvidia Developer login is required I believe).  Here are the instructions on how to install cuDNN. I wasn't able to run cuDNN v8.x, it looks like eIQ portal requires v7.x (cudnn64_7.dll file).  On the achieve page v7.x goes up to Cuda v10.2.  So if you have a later driver, you may need to uninstall and install an older one. 

It would be nice if NXP had some instructions on this (I didn't find any). 

在原帖中查看解决方案

10 回复数
13,307 次查看
david_piskula
NXP Employee
NXP Employee

Hello @khoefle ,

the Cuda driver version depends on the TensorFlow version supported by eIQ Portal. In the currently released eIQ Portal, the TensorFlow version is 2.3.2.

Please try installing the v7.6 cuDNN and CUDA 10.2, as @mgandhi mentioned.

david_piskula_0-1631799368542.png

https://www.tensorflow.org/install/source#gpu

Best Regards,

David

0 项奖励
13,317 次查看
mgandhi
Contributor II

@khoefleYou will need to install v7.x of the cuDNN that corresponds to your Cuda Driver.  You can download them from the cuDNN archive page (Nvidia Developer login is required I believe).  Here are the instructions on how to install cuDNN. I wasn't able to run cuDNN v8.x, it looks like eIQ portal requires v7.x (cudnn64_7.dll file).  On the achieve page v7.x goes up to Cuda v10.2.  So if you have a later driver, you may need to uninstall and install an older one. 

It would be nice if NXP had some instructions on this (I didn't find any). 

13,175 次查看
Ramson
Contributor IV

Hi @mgandhi 
Do We have to build tensorflow 2.3.2 from source, because we are facing an issue in building (https://github.com/tensorflow/tensorflow/issues/52092 )? we have installed CUDA 10.1 with cuDNN 7.6. But still the application is running on CPU and not on GPU.  we are still struggling to run the application in GPU for days. Can you please help me out.

Thanks and regards,

Ramson Jehu K

0 项奖励
13,163 次查看
khoefle
Contributor II

Hey @Ramson 
You do not need to build it from source, it is perfectly fine to do the "official" way as documented by NVIDIA. 

To locate the the problem a good way is to start eiQPortal.exe from the command line  - it will output a lot of information on whats missing and which dlls etc. cannot be loaded

Regards,

Kevin

13,152 次查看
Ramson
Contributor IV

Hi @khoefle , when running in command line, These are the output .Do you find any issues in this? Please let me know

 

C:\nxp\eIQ_Toolkit_v1.0.5>"eIQ Portal.exe" C:\nxp\eIQ_Toolkit_v1.0.5> NXP eIQ Portal version 2.1.30 Launch ->

C:\nxp\eIQ_Toolkit_v1.0.5 12:56:24.705 > Launching Application Display size is 2048x1152

12:56:30.810 > [CONVERTER] 2021-10-06 12:56:30.810279: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll

12:56:31.176 > [TRAINER] 2021-10-06 12:56:31.176046: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll

12:56:45.324 > [CONVERTER] 2021-10-06 12:56:45.324089: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll

12:56:45.422 > [CONVERTER] 2021-10-06 12:56:45.422699: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

12:56:45.425 > [CONVERTER] 2021-10-06 12:56:45.425487: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: Surya-PC 2021-10-06 12:56:45.425564: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: Surya-PC

12:56:50.625 > [TRAINER] 2021-10-06 12:56:50.625269: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll

12:56:50.723 > [TRAINER] 2021-10-06 12:56:50.723698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.83GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s 2021-10-06

12:56:50.723719: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll

12:56:51.347 > [TRAINER] 2021-10-06 12:56:51.347532: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll

12:56:51.518 > [TRAINER] 2021-10-06 12:56:51.518268: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll

12:56:51.632 > [TRAINER] 2021-10-06 12:56:51.632815: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll

12:56:52.157 > [TRAINER] 2021-10-06 12:56:52.156900: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll

12:56:52.523 > [TRAINER] 2021-10-06 12:56:52.522859: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll

12:56:53.614 > [CONVERTER] * Serving Flask app "deepview.modelserver.modelserver" (lazy loading) * Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Debug mode: off

12:56:53.615 > [CONVERTER] * Running on http://127.0.0.1:10816/ (Press CTRL+C to quit)

12:56:53.648 > [TRAINER] 2021-10-06 12:56:53.648551: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll

12:56:53.674 > [TRAINER] 2021-10-06 12:56:53.674163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0

12:56:53.678 > [TRAINER] 2021-10-06 12:56:53.678657: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

12:56:54.013 > [TRAINER] 2021-10-06 12:56:54.013205: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x29e7b24c130 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-10-06 12:56:54.013221: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version

12:56:54.030 > [TRAINER] 2021-10-06

12:56:54.030133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.83GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s 2021-10-06 12:56:54.030153: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll 2021-10-06

12:56:54.030159: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll 2021-10-06

12:56:54.030163: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cufft64_10.dll 2021-10-06

12:56:54.030168: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library curand64_10.dll 2021-10-06

12:56:54.030172: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusolver64_10.dll 2021-10-06

12:56:54.030176: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cusparse64_10.dll 2021-10-06

12:56:54.030181: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll 2021-10-06

12:56:54.030236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 12:56:57.031 > [TRAINER] 2021-10-06

12:56:57.031582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-06

12:56:57.031604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2021-10-06 12:56:57.031609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 12:56:57.076 > [TRAINER] 2021-10-06

12:56:57.076635: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

12:56:57.081 > [TRAINER] 2021-10-06 12:56:57.081194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4722 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5) 12:56:57.116 > [TRAINER] 2021-10-06

12:56:57.116205: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x29e446551c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-10-06 12:56:57.116221: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2060, Compute Capability 7.5

 

Thanks and regards

Ramson jehu 

0 项奖励
13,147 次查看
khoefle
Contributor II

Hi,

I do not see any issues here, it looks roughly the same as mine. I can only tell you that the line:

 [CONVERTER] 2021-10-06 12:56:45.422699: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

IS NOT A PROBLEM. 

I have the same GPU and it utlizes the GPU, you can use nvidia-smi to check if the GPU is utlized.

13,134 次查看
Ramson
Contributor IV

Hi @khoefle ,

Thanks for the help. when given nvidia-smi command I get the following output. Two things worries me is that. First is, the cuda version is showing as 11.2, but i have uninstalled it and installed 10.2 already. Second, the GPU memory usage shows N/A. 

MicrosoftTeams-image (4).png

I get the following output when i gave help for nvidia-smi :

"used_gpu_memory" or "used_memory"
Amount memory used on the device by the context. Not available on Windows when running in WDDM mode because Windows KMD manages all the memory not NVIDIA driver.

0 项奖励
13,130 次查看
khoefle
Contributor II

From this point on I can only guess, but maybe check if your CUDA_PATH etc. is set correctly, sorry for not being able to help you further

13,121 次查看
Ramson
Contributor IV

Thank you so much for the help so far @khoefle . You have been really helpful.

0 项奖励
13,156 次查看
Ramson
Contributor IV

Hi @khoefle ,

Im not asking about building CUDA from source. Im asking regarding building tensorflow from source. 

But running from command line as you said is great idea. Thank you so much Kevin. 

0 项奖励