Skip to content

A Guide on Configuring a Gradio Queue for High-Volume Traffic#2558

Merged
abidlabs merged 19 commits into
mainfrom
performance-guide
Nov 6, 2022
Merged

A Guide on Configuring a Gradio Queue for High-Volume Traffic#2558
abidlabs merged 19 commits into
mainfrom
performance-guide

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Oct 28, 2022

@github-actions
Copy link
Copy Markdown
Contributor

All the demos for this PR have been deployed at https://huggingface.co/spaces/gradio-pr-deploys/pr-2558-all-demos

@abidlabs abidlabs marked this pull request as ready for review November 4, 2022 20:17
Copy link
Copy Markdown
Collaborator

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome guide @abidlabs ! I love the structure and concrete recommendations. Noticed a couple of typos but that's about it.

Is it worth mentioning api_open in the sense that "closing" the api may help with scaling by making sure the queue isn't skipped?

Comment thread guides/06_other_tutorials/setting_up_a_gradio_demo_for_maximum_performance.md Outdated
Comment thread guides/06_other_tutorials/setting_up_a_gradio_demo_for_maximum_performance.md Outdated

If you write your function to process a batch of samples, Gradio will automatically batch incoming requests together and pass them into your function as a batch of samples. You need to set `batch` to `True` (by default it is `False`) and set a `max_batch_size` (by default it is `4`) based on the maximum number of samples your function is able to handle. These two parameters can be passed into `gr.Interface()` or to an event in Blocks such as `.click()`.

While setting a batch is conceptually similar to having workers process requests in parallel, it is often *faster* than setting the `concurrency_count` for deep learning models. The downside is that you might need to adapt your function a little bit to accept batches of samples instead of individual samples.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an answer as to whether concurrency_count plays well on gpus?

I wonder if we should mention that concurrency_count is better suited to IO-bound demos and batch_size is better suited to CPU/GPU bound demos. You kind of hint at that below where you say that high batch size likely means that concurrency count should be set to 1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this right now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assigned a GPU to this Space and it seems to work just fine: https://huggingface.co/spaces/abidlabs/image-classifier. Specifically, I was able to process requests in parallel and cut the latency in half on average by using concurrency_count=2!

A user just needs to keep in mind that their GPU memory might be different than their CPU memory so they need to ensure that multiple workers will not OOM their GPU. I'll add a note in the hardware section

Comment thread guides/06_other_tutorials/setting_up_a_gradio_demo_for_maximum_performance.md Outdated
@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Nov 4, 2022

Thanks for the review @freddyaboulton! Will fix the typos and add a section on api_open as well as we've seen at least one concrete example of that being an issue

@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Nov 5, 2022

@aliabd I'm getting a weird behavior when I try to add my Guide to the Interface class and build the website locally: only this Guide is showing up underneath the docs for the Interface class:

image


What's weird is that if I remove my additional guide from the Interface docstring (in other words, revert all changes to the interface.py file), not a single Guide shows up:

image

Even though the Interface class lists 6 guides here:

    Guides: quickstart, key_features, sharing_your_app, interface_state, reactive_interfaces, advanced_interface_features

Do you know what might be going on?

@abidlabs abidlabs merged commit 05dcd87 into main Nov 6, 2022
@abidlabs abidlabs deleted the performance-guide branch November 6, 2022 01:36

So why not set this parameter much higher? Keep in mind that since requests are processed in parallel, each request will consume memory to store the data and weights for processing. This means that you might get out-of-memory errors if you increase the the `concurrency_count` too high.

**Recommendation**: Increase the `concurrency_count` parameter as high as you can until you hit memory limits on your machine. You can [read about Hugging Face Spaces machine specs here](https://huggingface.co/docs/hub/spaces-overview).
Copy link
Copy Markdown
Contributor

@omerXfaruq omerXfaruq Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @abidlabs, a beautiful guide as always, good work!

IMO this recommendation is not a good one, because increasing concurrency does not directly translate into performance due to various reasons as costs associated with context switching and GIL limitations. So I would suggest to change it to smt like,

'Increase concurrency as long as you see improvement in the throughput.'

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the suggestion @farukozderim! I'll update the Guide to reflect that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write an Advanced Guide about Queue, Concurrency, Batching in Gradio

3 participants