While crawling through sites, sometimes you hit something like a video file - or just an extremely large response. This usually wouldn't be a problem, however because gocrawl is a single goroutine per host - this can lock the entire thing up for a long time.
Suggestions is that rather than using ioutil.ReadAll() (https://github.com/PuerkitoBio/gocrawl/blob/master/worker.go#L331), supporting a configuration option to only read the first N bytes - and then just proceed.
The problem I foresee is that maybe this would require an API change, because in your Visit function you would have to receive a variable which told you whether the entire body was downloaded or not. The only way around this that I can think of that would not require an API change would be to pass everything forward to a VisitBodyNotCompleted function, however this isn't as ideal.
How open are you to breaking API changes, or can you think of a way around this?
What are you thoughts on supporting this?
Thanks,
Jake
While crawling through sites, sometimes you hit something like a video file - or just an extremely large response. This usually wouldn't be a problem, however because gocrawl is a single goroutine per host - this can lock the entire thing up for a long time.
Suggestions is that rather than using ioutil.ReadAll() (https://github.com/PuerkitoBio/gocrawl/blob/master/worker.go#L331), supporting a configuration option to only read the first N bytes - and then just proceed.
The problem I foresee is that maybe this would require an API change, because in your Visit function you would have to receive a variable which told you whether the entire body was downloaded or not. The only way around this that I can think of that would not require an API change would be to pass everything forward to a VisitBodyNotCompleted function, however this isn't as ideal.
How open are you to breaking API changes, or can you think of a way around this?
What are you thoughts on supporting this?
Thanks,
Jake