远程服务器无 root 权限时 LightGBM GPU 版的安装方法

正文

首先感谢 LightGBM 团队的帮助：

https://github.com/microsoft/LightGBM/issues/6399

声明

本文适用于在远程服务器（Linux）上安装 LightGBM 的情况。

不同的编译器版本在遇到报错时的解决方案可能略有不同，以下是我在编译时使用的相关版本：

1 2	-- The C compiler identification is GNU 8.5.0 -- The CXX compiler identification is GNU 4.8.5

本文默认已在服务器上配置有 CUDA，如果 CUDA 另有路径请通过 -DOpenCL_LIBRARY 和 -DOpenCL_INCLUDE_DIR 指定。

编译

下载 LightGBM，并进行 CUDA 版本的编译：

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake -DUSE_CUDA=1 .. # 指定 CUDA version
make -j4

注：

不知道是否属于个人原因，如果我使用 -DUSE_GPU=1 而非 -DUSE_CUDA=1（即安装 GPU version 而非 CUDA version），虽然最后能安装成功，但是在 Jupyter Notebook 中运行 LightGBM GPU 时会直接发生以下错误：
1
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

在上述 CUDA version 编译过程中，如果在 CMake 时发生以下错误：

1	x86_64-conda-linux-gnu-cc: error: unrecognized command-line option '-march'

解决方案：打开 build 文件夹中的 CMakeCache.txt，搜索 -march，找到类似以下这段的内容：

1
2

//Flags used by the C compiler during all build types.
CMAKE_C_FLAGS:STRING=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /public/home/xxxx/mambaforge/envs/kaggle/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /public/home/xxxx/mambaforge/envs/ncsvp/include -march

将最后的 -march 删除再重新运行 cmake -DUSE_CUDA=1 ..，如果成功则继续运行 make -j4。

如果与上述相关内容有所出入，则可自己根据具体情况尝试修改 march 相关内容。

安装

编译完成后，你已经可以使用 Linux 下的 LightGBM，如果要继续安装 Python LightGBM，则还需进行以下操作：

返回到 LightGBM 目录，并安装 Python 接口：

1 2	$ cd ../ $ sh ./build-python.sh install --precompile

安装好 LightGBM Module 后，就可以使用 device='CUDA' 调用 GPU 进行加速。

关于以上流程，更详细内容可以移至文章开头的 github issue 查看。

后记

其实去年我也尝试过在服务器上安装 LightGBM GPU，但折腾了一段时间并没有搞成功，所以也就放弃了。

那么为什么现在又弄了呢？这个动机可以追溯到上个月在 Kaggle 上打 PlayGround 竞赛的时候。因为 LightGBM 没有 GPU 加速实在是跑的太慢，而我加入竞赛的时间又比较晚（只剩六七天就结束了），所以我选择了使用自己的电脑来做 XGBoost 和 LightGBM GPU 的 HPO。所幸最后的结果是不错的 —— 最好的结果可以排第八名，可惜选错了 submission~~（盲目相信 public score 的代价）~~。

离题了，如上所述，虽然拿着自己电脑的 2060 跑确实比 CPU 要快很多，但是每天拿着一个游戏本跑来跑去，还要早去课室为它霸占充电位以及忍受它狂躁的风扇声和令人发指的续航能力，实在是让人有些心力交瘁。看着服务器上三台 3090，我还是下定决心在服务器上也安装一个。

所幸最后安装成功，在这里记录下自己走的弯路，希望能帮助别人省些时间……