Skip to content

Training on GPU fails (OSError: exception: access violation) #1717

@Mtale

Description

@Mtale

I have been trying to run LightGBM GPU for some time without success. The software works well on CPU.

I've compiled LightGBM using MinGW following the instructions here and using MSVC like instructed here. I used Visual Studio 2017 to compile.

No matter the way of compilation, while I try to train a model in Jupyter on Python I get the same error message:

OSError: exception: access violation reading 0x0000000000000020

More details on error below. The referenced error is for sklearn API but the error stays the same if I use lightgbm.cv API.

While trying to run CLI example in the instructions of MinGW compilation, the program fails silently. I have MSVC compilation installed right now and can't reproduce but if you refer to image in the instructions, silent fail occurs after the line Total bins 6143.

output of CLI example

I've run Tensorflow GPU earlier, hence the GPU does work. However, GPU Caps Viewer fails silently while starting. Probably related, but I wan't able to find anything on that problem online.

I've tried suggestions in the following issues:

#836
#1028

Environment info

Operating System: Windows 10 Home

CPU Model: i7 7700
GPU model: Geforce GTX 1060 6Gb
CUDA: 9.0.176.2
OpenCL: 1.2

C++/Python/R version: Python 3.6

Error message

in model(features, test_features, encoding, n_folds)
125 eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
126 eval_names = ['valid', 'train'], categorical_feature = cat_indices,
--> 127 early_stopping_rounds = 100, verbose = 200)
128
129 # Record the best iteration

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
697 verbose=verbose, feature_name=feature_name,
698 categorical_feature=categorical_feature,
--> 699 callbacks=callbacks)
700 return self
701

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
500 verbose_eval=verbose, feature_name=feature_name,
501 categorical_feature=categorical_feature,
--> 502 callbacks=callbacks)
503
504 if evals_result:

C:\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
188 # construct booster
189 try:
--> 190 booster = Booster(params=params, train_set=train_set)
191 if is_valid_contain_train:
192 booster.set_train_data_name(train_data_name)

C:\Anaconda3\lib\site-packages\lightgbm\basic.py in init(self, params, train_set, model_file, silent)
1474 train_set.construct().handle,
1475 c_str(params_str),
-> 1476 ctypes.byref(self.handle)))
1477 # save reference to data
1478 self.train_set = train_set

OSError: exception: access violation reading 0x0000000000000020

Metadata

Metadata

Assignees

Labels

buggpu (OpenCL)Issue is related to the OpenCL-based GPU variant.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions