使用GPU和CUDA、cuDNN进行深度学习计算的浪潮已经持续了很多年,在此期间,显卡驱动和CUDA版本,以及cudnn深度学习工具包的版本已经更新了很多次。随着新的TensorFlow 2.0版和Pytorch1.3版的发布,我们用于深度学习的机器也需要将运行环境更新到最新版本了,尤其是还在使用CUDA 8.0的话。本文将介绍如何卸载旧版CUDA(如8.0版)并安装新版CUDA(10.0版)。
AI柠檬博主曾在2017年写过一篇介绍如何安装gpu版tf的文章:《Linux系统下安装TensorFlow的GPU版本》,对于TensorFlow的安装,可以参考该文,关于软件依赖版本的对应等TensorFlow问题会保持更新。
材料准备
首先需要从NVIDIA官网下载下述两个文件,一个是cuda 10.0,一个是cudnn 7.4
- cuda_10.0.130_410.48_linux
- cudnn-10.0-linux-x64-v7.4.2.24.solitairetheme8
卸载旧版本CUDA
卸载前需要关闭一些跟图形相关的服务,比如X显示管理器lightdm。键盘按ctrl+Alt+F1,从纯命令行输入账号密码登入终端,然后输入下面的命令。
$ sudo systemctl stop lightdm $ cd /usr/local/cuda-8.0/bin $ sudo ./uninstall_cuda_8.0.pl
于是开始卸载CUDA 8.0。卸载的残留”cuda-8.0/”目录可以直接删除。
安装新版本CUDA
找到我们已经下载好的cuda 10和cudnn 7.4文件,并首先输入下列命令安装cuda 10。
$ sudo sh cuda_10.0.130_410.48_linux
首先出现的是关于CUDA的用户协议的事项,可以直接按“Ctrl +C”跳过,并输入“accpet”表示接受协议。
Logging to /tmp/cuda_install_11026.log Using more to view the EULA. End User License Agreement -------------------------- Preface ------- The Software License Agreement in Chapter 1 and the Supplement in Chapter 2 contain license terms and conditions that govern the use of NVIDIA software. By accepting this agreement, you agree to comply with all the terms and conditions applicable to the product(s) included herein. NVIDIA Driver Description This package contains the operating system driver and fundamental system software components for NVIDIA GPUs. NVIDIA CUDA Toolkit Description The NVIDIA CUDA Toolkit provides command-line and graphical tools for building, debugging and optimizing the performance of applications accelerated by NVIDIA GPUs, runtime and math libraries, and documentation including programming guides, user manuals, and API references. Default Install Location of CUDA Toolkit Windows platform: %ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v#.# Linux platform: /usr/local/cuda-#.# Mac platform: /Developer/NVIDIA/CUDA-#.# NVIDIA CUDA Samples Description This package includes over 100+ CUDA examples that demonstrate various CUDA programming principles, and efficient CUDA implementation of algorithms in specific application domains. Do you accept the previously read EULA? accept/decline/quit: accept
由于需要更新NVIDIA驱动的版本,其中有一个“Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?”需要输入“y”以安装新版驱动。
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? (y)es/(n)o/(q)uit: y Do you want to install the OpenGL libraries? (y)es/(n)o/(q)uit [ default is yes ]: y Do you want to run nvidia-xconfig? This will update the system X configuration file so that the NVIDIA X driver is used. The pre-existing X configuration file will be backed up. This option should not be used on systems that require a custom X configuration, such as systems with multiple GPU vendors. (y)es/(n)o/(q)uit [ default is no ]: Install the CUDA 10.0 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-10.0 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 10.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /home/gpu ]: Installing the NVIDIA display driver... Installing the CUDA Toolkit in /usr/local/cuda-10.0 ... Missing recommended library: libGLU.so Missing recommended library: libXmu.so Installing the CUDA Samples in /home/gpu ... Copying samples to /home/gpu/NVIDIA_CUDA-10.0_Samples now... Finished copying samples. =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /home/gpu, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA. Logfile is /tmp/cuda_install_11026.log Signal caught, cleaning up
当最后出现这类输出,没有其他报错之后,就算成功安装了新版CUDA了。然后我们接着需要安装配置新的环境变量。在 ~/.bashrc 的最后添加:
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda
其中,前 2 个(PATH, LD_LIBRARY_PATH) 是 CUDA 官网安装文档中建议的变量。第 3 个(CUDA_HOME)是 tensorflow-GPU 版本要求的变量。
配置完环境变量之后,一定要更新一下,否则不能立即生效。也可以通过重启电脑使得环境变量生效。
$ source ~/.bashrc
如果遗漏了这一步,对于新手来说,是致命的灾难,会出现明明正确按照教程配置,却根本无法使用GPU的情况。
接着,我们检查一下新版显卡驱动安装结果:
$ nvidia-smi Fri Oct 27 15:46:57 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.48 Driver Version: 410.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... Off | 00000000:06:00.0 Off | 0 | | N/A 29C P0 24W / 250W | 0MiB / 12198MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
最后,需要恢复图形图像显示:
$ sudo systemctl start lightdm
配置cudnn库
首先,更改cudnn文件名称,以方便解压。其他版本的文件名需根据实际情况做相应修改。
$ cp cudnn-10.0-linux-x64-v7.4.2.24.solitairetheme8 cudnn-10.0-linux-x64-v7.4.2.24.tgz
然后解压
$ tar zxvf cudnn-10.0-linux-x64-v7.4.2.24.tgz
然后将库和头文件copy到cuda目录(一定是你自己安装的目录如/usr/local/cuda-10.0),不过正确安装的话,ubuntu一般就会有软链接/usr/local/cuda -> /usr/local/cuda-10.0/
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include $ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
接下来就是修改文件访问权限:
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
然后,我们就可以放心大胆地安装最新版TensorFlow和Pytorch啦。
附:验证TensorFlow是否可以使用GPU
打开终端,输入下列命令:
$ python Python 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.test.is_gpu_available() True >>>
如果我们能够看到一个“True”,那么就说明可以正常使用GPU了,否则,需要根据具体的报错信息,再次检查上述过程,或者通过谷歌百度搜索看看是不是遗漏了什么。
版权声明本博客的文章除特别说明外均为原创,本人版权所有。欢迎转载,转载请注明作者及来源链接,谢谢。本文地址: https://blog.ailemon.net/2019/10/28/linux-remove-old-cuda-and-install-new-cuda/ All articles are under Attribution-NonCommercial-ShareAlike 4.0 |