You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.
Possible solutions
Is there maybe some bottleneck in the code somewhere where complexity explodes? Considering for how fast text-splitter works for normal-sized docs, it feels like this should be doable in a couple of seconds.
Could we maybe apply some additional heuristics here for dealing with very large docs?
Problem
I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.
Possible solutions
Here is my example (37Mb parquet file):
Couldn't upload the file here in the issue so I'm hosting it on my Drive here.
In case, do you have any other ideas how to make it work?