Skip to content

Heuristics for very large documents #184

@do-me

Description

@do-me

Problem

I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.

Possible solutions

  1. Is there maybe some bottleneck in the code somewhere where complexity explodes? Considering for how fast text-splitter works for normal-sized docs, it feels like this should be doable in a couple of seconds.
  2. Could we maybe apply some additional heuristics here for dealing with very large docs?
  3. Maybe point 2 in combination with multiprocessing Performance: use all available CPU cores #165 might be an idea?

Here is my example (37Mb parquet file):

from semantic_text_splitter import TextSplitter
import pandas as pd 
df = pd.read_parquet("free_trade_agreement.parquet")
splitter = TextSplitter((1500,2000)) 
test_chunks = splitter.chunks(df.text.item())

Couldn't upload the file here in the issue so I'm hosting it on my Drive here.

In case, do you have any other ideas how to make it work?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions