Skip to content

Max Body Size #28

@JakeAustwick

Description

@JakeAustwick

While crawling through sites, sometimes you hit something like a video file - or just an extremely large response. This usually wouldn't be a problem, however because gocrawl is a single goroutine per host - this can lock the entire thing up for a long time.

Suggestions is that rather than using ioutil.ReadAll() (https://github.com/PuerkitoBio/gocrawl/blob/master/worker.go#L331), supporting a configuration option to only read the first N bytes - and then just proceed.

The problem I foresee is that maybe this would require an API change, because in your Visit function you would have to receive a variable which told you whether the entire body was downloaded or not. The only way around this that I can think of that would not require an API change would be to pass everything forward to a VisitBodyNotCompleted function, however this isn't as ideal.

How open are you to breaking API changes, or can you think of a way around this?

What are you thoughts on supporting this?

Thanks,
Jake

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions