[Experimental] Add cache mechanism for dataset groups to avoid long waiting time for initilization#1178
Conversation
|
Thank you for this! This seems to be really useful. I was thinking about how I can reduce the waiting time for preprocessing. I had not thought of the method for pickling of dataset group object. However, I feel the pickling seems to be a bit aggressive. I wonder caching the sizes of images might be enough for reducing the waiting time... |
Actually not only size So we don't need to wait for listdir and check image I can ensure the startup time with cached dataset is less than 1min From I press enter to I see the tqdm progress bar Pickling is aggressive, I just use this t o show how it help at firstXD BTW |
I will implement a version which cache absolute path list and imagesize for each subset |
That's nice! I think it is straightforward :) |
|
@kohya-ss I have done the implementation I only implement it for DreamboothDataset at first. |
|
Thank you for update! This is really nice. I will copy it to other datasets :) I may change the format to JSON or something else for future proof. It makes the metadata bigger three times or more, but I believe it is no problem. I appreciate your understanding. |
…aiting time for initilization (kohya-ss#1178) * support meta cached dataset * add cache meta scripts * random ip_noise_gamma strength * random noise_offset strength * use correct settings for parser * cache path/caption/size only * revert mess up commit * revert mess up commit * Update requirements.txt * Add arguments for meta cache. * remove pickle implementation * Return sizes when enable cache --------- Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>
…aiting time for initilization (kohya-ss#1178) * support meta cached dataset * add cache meta scripts * random ip_noise_gamma strength * random noise_offset strength * use correct settings for parser * cache path/caption/size only * revert mess up commit * revert mess up commit * Update requirements.txt * Add arguments for meta cache. * remove pickle implementation * Return sizes when enable cache --------- Co-authored-by: Kohya S <52813779+kohya-ss@users.noreply.github.com>
For large scale dataset, sd-scripts will suffer from long waiting time to read the image size or other meta.
So I propose 2 improvements:
imagesizelibrary to read the image size, don't use PIL which is overkill.In my cache script, I successfully only use half hour to get the dataset groups. (Which will cost 4hour if I directly run 4card training)
And the loading for cached dataset groups also be fine. I have do a quick sanity check that first few images are same. But need more check from community