-
Notifications
You must be signed in to change notification settings - Fork 4k
OverflowError when training with 100k+ iterations #2265
Description
Environment info
Operating System: Windows 7 SP2 (and same issue on macOS 10.13.6 but it crashes python kernel without any message)
CPU/GPU model: CPU
C++/Python/R version: Python 3.6
LightGBM version or commit hash: 2.2.3 (and 2.2.0)
Error message
When training lightgbm with more than 100,000 iterations, the model can finish training (still enough memory) but fail when it try to exit the training process.
[358000] training's mape: 0.000139252
[360000] training's mape: 0.00013805
[362000] training's mape: 0.000136836
[364000] training's mape: 0.000135664
[366000] training's mape: 0.000134525
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-22-f940fa105e9d> in <module>()
11
12 # train model
---> 13 model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)
14
15 y_pred = model.predict(df_test[cols_feats])
c:\python36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
240 booster.best_score[dataset_name][eval_name] = score
241 if not keep_training_booster:
--> 242 booster.model_from_string(booster.model_to_string(), False).free_dataset()
243 return booster
244
c:\python36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
2096 # if buffer length is not long enough, re-allocate a buffer
2097 if actual_len > buffer_len:
-> 2098 string_buffer = ctypes.create_string_buffer(actual_len)
2099 ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
2100 _safe_call(_LIB.LGBM_BoosterSaveModelToString(
c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
58 return buf
59 elif isinstance(init, int):
---> 60 buftype = c_char * init
61 buf = buftype()
62 return buf
OverflowError: The '_length_' attribute is too large
However, if I set the keep_training_booster=True, it can finish the entire training without problem. So this seems to happen only when Lightgbm is trying to turn the model into a string before removing it.
Reproducible examples
You can try with any regression problem with ~50,000 samples and 150 features, and train it with ~300,000 iterations but small learning rate like 0.001.
params = {
'boosting_type': 'gbdt', 'task': 'train', 'objective': 'mse', 'metric': 'mse',
'feature_fraction': 0.9, 'learning_rate': 0.001, 'num_leaves': 255,
}
lgb_other_params = {'num_boost_round': 366000, 'verbose_eval': 2000}
lgb_train = lgb.Dataset(df_train[cols_feats], df_train[col_target]).construct()
model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)
where df_train in our case has about 50,000 samples and 150 features and it still fit in our 16GB memory during training. But only fail when exiting the training with keep_training_booster=False.