Skip to content

OverflowError when training with 100k+ iterations #2265

@louis925

Description

@louis925

Environment info

Operating System: Windows 7 SP2 (and same issue on macOS 10.13.6 but it crashes python kernel without any message)

CPU/GPU model: CPU

C++/Python/R version: Python 3.6

LightGBM version or commit hash: 2.2.3 (and 2.2.0)

Error message

When training lightgbm with more than 100,000 iterations, the model can finish training (still enough memory) but fail when it try to exit the training process.

[358000]	training's mape: 0.000139252
[360000]	training's mape: 0.00013805
[362000]	training's mape: 0.000136836
[364000]	training's mape: 0.000135664
[366000]	training's mape: 0.000134525
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-22-f940fa105e9d> in <module>()
     11 
     12 # train model
---> 13 model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)
     14 
     15 y_pred = model.predict(df_test[cols_feats])

c:\python36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    240         booster.best_score[dataset_name][eval_name] = score
    241     if not keep_training_booster:
--> 242         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    243     return booster
    244 

c:\python36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2096         # if buffer length is not long enough, re-allocate a buffer
   2097         if actual_len > buffer_len:
-> 2098             string_buffer = ctypes.create_string_buffer(actual_len)
   2099             ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
   2100             _safe_call(_LIB.LGBM_BoosterSaveModelToString(

c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
     58         return buf
     59     elif isinstance(init, int):
---> 60         buftype = c_char * init
     61         buf = buftype()
     62         return buf

OverflowError: The '_length_' attribute is too large

However, if I set the keep_training_booster=True, it can finish the entire training without problem. So this seems to happen only when Lightgbm is trying to turn the model into a string before removing it.

Reproducible examples

You can try with any regression problem with ~50,000 samples and 150 features, and train it with ~300,000 iterations but small learning rate like 0.001.

params = {
    'boosting_type': 'gbdt', 'task': 'train', 'objective': 'mse', 'metric': 'mse',
    'feature_fraction': 0.9, 'learning_rate': 0.001, 'num_leaves': 255,
}
lgb_other_params = {'num_boost_round': 366000, 'verbose_eval': 2000}
lgb_train = lgb.Dataset(df_train[cols_feats], df_train[col_target]).construct()
model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)

where df_train in our case has about 50,000 samples and 150 features and it still fit in our 16GB memory during training. But only fail when exiting the training with keep_training_booster=False.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions