Heuristics for very large documents

### Problem 
I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar. 

### Possible solutions
1. Is there maybe some bottleneck in the code somewhere where complexity explodes? Considering for how fast text-splitter works for normal-sized docs, it feels like this should be doable in a couple of seconds.
2. Could we maybe apply some additional heuristics here for dealing with very large docs? 
3. Maybe point 2 in combination with multiprocessing #165 might be an idea? 

Here is my example (37Mb parquet file): 

```python 
from semantic_text_splitter import TextSplitter
import pandas as pd 
df = pd.read_parquet("free_trade_agreement.parquet")
splitter = TextSplitter((1500,2000)) 
test_chunks = splitter.chunks(df.text.item())

``` 
Couldn't upload the file here in the issue so I'm hosting it on my Drive [here](https://drive.google.com/file/d/1Xnp5jJhjIWNA6R5u9w96L9WO_Hb61Jmh/view?usp=sharing).

In case, do you have any other ideas how to make it work? 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristics for very large documents #184

Problem

Possible solutions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Heuristics for very large documents #184

Description

Problem

Possible solutions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions