gradio-app · abidlabs · Nov 6, 2022 · Oct 28, 2022 · Nov 3, 2022 · Nov 3, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -46,7 +46,8 @@ The `api_name` parameter will take precendence over the `fn_index` parameter.
 * `gr.Blocks.load()` now correctly loads example files from Spaces [@abidlabs](https://github.com/abidlabs) in [PR 2594](https://github.com/gradio-app/gradio/pull/2594)
 
 ## Documentation Changes:
-No changes to highlight.
+* Added a Guide on how to configure the queue for maximum performance by [@abidlabs](https://github.com/abidlabs) in [PR 2558](https://github.com/gradio-app/gradio/pull/2558)
+
 
 ## Testing and Infrastructure Changes:
 No changes to highlight.

diff --git a/gradio/interface.py b/gradio/interface.py
@@ -59,7 +59,7 @@ def image_classifier(inp):
         demo = gr.Interface(fn=image_classifier, inputs="image", outputs="label")
         demo.launch()
     Demos: hello_world, hello_world_3, gpt_j
-    Guides: quickstart, key_features, sharing_your_app, interface_state, reactive_interfaces, advanced_interface_features
+    Guides: quickstart, key_features, sharing_your_app, interface_state, reactive_interfaces, advanced_interface_features, setting_up_a_gradio_demo_for_maximum_performance
     """
 
     # stores references to all currently existing Interface instances

diff --git a/guides/1)getting_started/1)quickstart.md → guides/01_getting_started/01_quickstart.md b/guides/1)getting_started/1)quickstart.md → guides/01_getting_started/01_quickstart.md
diff --git a/guides/1)getting_started/2)key_features.md → guides/01_getting_started/02_key_features.md b/guides/1)getting_started/2)key_features.md → guides/01_getting_started/02_key_features.md
diff --git a/...s/1)getting_started/3)sharing_your_app.md → ...01_getting_started/03_sharing_your_app.md b/...s/1)getting_started/3)sharing_your_app.md → ...01_getting_started/03_sharing_your_app.md
diff --git a/...)building_interfaces/1)interface_state.md → ...building_interfaces/01_interface_state.md b/...)building_interfaces/1)interface_state.md → ...building_interfaces/01_interface_state.md
diff --git a/...lding_interfaces/2)reactive_interfaces.md → ...ding_interfaces/02_reactive_interfaces.md b/...lding_interfaces/2)reactive_interfaces.md → ...ding_interfaces/02_reactive_interfaces.md
diff --git a/...rfaces/3)more_on_examples_and_flagging.md → ...faces/03_more_on_examples_and_flagging.md b/...rfaces/3)more_on_examples_and_flagging.md → ...faces/03_more_on_examples_and_flagging.md
diff --git a/...terfaces/4)advanced_interface_features.md → ...erfaces/04_advanced_interface_features.md b/...terfaces/4)advanced_interface_features.md → ...erfaces/04_advanced_interface_features.md
diff --git a/...th_blocks/1)blocks_and_event_listeners.md → ...h_blocks/01_blocks_and_event_listeners.md b/...th_blocks/1)blocks_and_event_listeners.md → ...h_blocks/01_blocks_and_event_listeners.md
diff --git a/...lding_with_blocks/2)controlling_layout.md → ...ding_with_blocks/02_controlling_layout.md b/...lding_with_blocks/2)controlling_layout.md → ...ding_with_blocks/02_controlling_layout.md
diff --git a/...building_with_blocks/3)state_in_blocks.md → ...uilding_with_blocks/03_state_in_blocks.md b/...building_with_blocks/3)state_in_blocks.md → ...uilding_with_blocks/03_state_in_blocks.md
diff --git a/...ilding_with_blocks/4)custom_CSS_and_JS.md → ...lding_with_blocks/04_custom_CSS_and_JS.md b/...ilding_with_blocks/4)custom_CSS_and_JS.md → ...lding_with_blocks/04_custom_CSS_and_JS.md
diff --git a/...h_blocks/5)using_blocks_like_functions.md → ...h_blocks/5)using_blocks_like_functions.md b/...h_blocks/5)using_blocks_like_functions.md → ...h_blocks/5)using_blocks_like_functions.md
diff --git a/...orks/1)using_hugging_face_integrations.md → ...rks/01_using_hugging_face_integrations.md b/...orks/1)using_hugging_face_integrations.md → ...rks/01_using_hugging_face_integrations.md
diff --git a/...eworks/Gradio_and_ONNX_on_Hugging_Face.md → ...eworks/Gradio_and_ONNX_on_Hugging_Face.md b/...eworks/Gradio_and_ONNX_on_Hugging_Face.md → ...eworks/Gradio_and_ONNX_on_Hugging_Face.md
diff --git a/...rameworks/Gradio_and_Wandb_Integration.md → ...rameworks/Gradio_and_Wandb_Integration.md b/...rameworks/Gradio_and_Wandb_Integration.md → ...rameworks/Gradio_and_Wandb_Integration.md
diff --git a/...eworks/image_classification_in_pytorch.md → ...eworks/image_classification_in_pytorch.md b/...eworks/image_classification_in_pytorch.md → ...eworks/image_classification_in_pytorch.md
diff --git a/...rks/image_classification_in_tensorflow.md → ...rks/image_classification_in_tensorflow.md b/...rks/image_classification_in_tensorflow.md → ...rks/image_classification_in_tensorflow.md
diff --git a/...lassification_with_vision_transformers.md → ...lassification_with_vision_transformers.md b/...lassification_with_vision_transformers.md → ...lassification_with_vision_transformers.md
diff --git a/...e_and_plots/1)connecting_to_a_database.md → ...e_and_plots/1)connecting_to_a_database.md b/...e_and_plots/1)connecting_to_a_database.md → ...e_and_plots/1)connecting_to_a_database.md
diff --git a/...ence_and_plots/plot_component_for_maps.md → ...ence_and_plots/plot_component_for_maps.md b/...ence_and_plots/plot_component_for_maps.md → ...ence_and_plots/plot_component_for_maps.md
diff --git a/...ots/using_gradio_for_tabular_workflows.md → ...ots/using_gradio_for_tabular_workflows.md b/...ots/using_gradio_for_tabular_workflows.md → ...ots/using_gradio_for_tabular_workflows.md
diff --git a/...er_tutorials/building_a_pictionary_app.md → ...er_tutorials/building_a_pictionary_app.md b/...er_tutorials/building_a_pictionary_app.md → ...er_tutorials/building_a_pictionary_app.md
diff --git a/...als/create_your_own_friends_with_a_gan.md → ...als/create_your_own_friends_with_a_gan.md b/...als/create_your_own_friends_with_a_gan.md → ...als/create_your_own_friends_with_a_gan.md
diff --git a/...s/6)other_tutorials/creating_a_chatbot.md → .../06_other_tutorials/creating_a_chatbot.md b/...s/6)other_tutorials/creating_a_chatbot.md → .../06_other_tutorials/creating_a_chatbot.md
diff --git a/...her_tutorials/creating_a_new_component.md → ...her_tutorials/creating_a_new_component.md b/...her_tutorials/creating_a_new_component.md → ...her_tutorials/creating_a_new_component.md
diff --git a/...als/custom_interpretations_with_blocks.md → ...als/custom_interpretations_with_blocks.md b/...als/custom_interpretations_with_blocks.md → ...als/custom_interpretations_with_blocks.md
diff --git a/...als/developing_faster_with_reload_mode.md → ...als/developing_faster_with_reload_mode.md b/...als/developing_faster_with_reload_mode.md → ...als/developing_faster_with_reload_mode.md
diff --git a/...utorials/how_to_use_3D_model_component.md → ...utorials/how_to_use_3D_model_component.md b/...utorials/how_to_use_3D_model_component.md → ...utorials/how_to_use_3D_model_component.md
diff --git a/...her_tutorials/named_entity_recognition.md → ...her_tutorials/named_entity_recognition.md b/...her_tutorials/named_entity_recognition.md → ...her_tutorials/named_entity_recognition.md
diff --git a/...tutorials/real_time_speech_recognition.md → ...tutorials/real_time_speech_recognition.md b/...tutorials/real_time_speech_recognition.md → ...tutorials/real_time_speech_recognition.md
diff --git a/...her_tutorials/running_background_tasks.md → ...her_tutorials/running_background_tasks.md b/...her_tutorials/running_background_tasks.md → ...her_tutorials/running_background_tasks.md
diff --git a/guides/06_other_tutorials/setting_up_a_gradio_demo_for_maximum_performance.md b/guides/06_other_tutorials/setting_up_a_gradio_demo_for_maximum_performance.md
@@ -0,0 +1,101 @@
+# Setting Up a Gradio Demo for Maximum Performance
+
+Let's say that your Gradio demo goes *viral* on social media -- you have lots of users trying it out simultaneously, and you want to provide your users with the best possible experience or, in other words, minimize the amount of time that each user has to wait in the queue to see their prediction.
+
+How can you configure your Gradio demo to handle the most traffic? In this Guide, we dive into some of the parameters of Gradio's `.queue()` method as well as some other related configuations, and discuss how to set these parameters in a way that allows you to serve lots of users simultaneously withminimal latency.
+
+This is an advanced guide, so make sure you know the basics of Gradio already, such as [how to create and launch a Gradio demo](https://gradio.app/quickstart/). Most of the information in this Guide is relevant whether you are hosting your demo on [Hugging Face Spaces](https://hf.space) or on your own server.
+
+## Enabling Gradio's Queueing System
+
+By default, a Gradio demo does not use queueing and instead sends prediction requests via a POST request to the server where your Gradio server and Python code are running. However, regular POST requests have two big limitations:
+
+(1) They time out -- most browsers raise a timeout error
+if they do not get a response to a POST request after a short period of time (e.g. 1 min).
+This can be a problem if your inference function takes longer than 1 minute to run or
+if many people are trying out your demo at the same time, resulting in increased latency.
+
+(2) They do not allow bi-directional communication between the Gradio demo and the Gradio server. This means, for example, that you cannot get a real-time ETA of how long your prediction will take to complete.
+
+To address these limitations, any Gradio app can be converted to use **websockets** instead, simply by adding `.queue()` before launching an Interface or a Blocks. Here's an example:
+
+```py
+app = gr.Interface(lambda x:x, "image", "image")
+app.queue()  # <-- Sets up a queue with default parameters
+app.launch()
+```
+
+In the demo `app` above, predictions will now be sent over a websocket instead.
+Unlike POST requests, websockets do not timeout and they allow bidirectional traffic. On the Gradio server, a **queue** is set up, which adds each request that comes to a list. When a worker is free, the first available request is passed into the worker for inference. When the inference is complete, the queue sends the prediction back through the websocket tothe particular Gradio user who called that prediction. 
+
+Note: If you host your Gradio app on [Hugging Face Spaces](https://hf.space), the queue is already **enabled by default**. You can still call the `.queue()` method manually in order to configure the queue parameters described below.
+
+## Queuing Parameters
+
+There are several parameters that can be used to configure the queue and help reduce latency. Let's go through them one-by-one.
+
+### The `concurrency_count` parameter
+
+The first parameter we will explore is the `concurrency_count` parameter of `queue()`. This parameter is used to set the number of worker threads in the Gradio server that will be processing your requests in parallel. By default, this parameter is set to `1` but increasing this can linearly multiply the capacity of your server to handle requests.
+
+So why not set this parameter much higher? Keep in mind that since requests are processed in parallel, each request will consume memory to store the data and weights for processing. This means that you might get out-of-memory errors if you increase the the `concurrency_count` too high.
+
+**Recommendation**: Increase the `concurrency_count` parameter as high as you can until you hit memory limits on your machine. You can [read about Hugging Face Spaces machine specs here](https://huggingface.co/docs/hub/spaces-overview). 
+
+### The `max_size` parameter
+
+A more blunt way to reduce the wait times is simply to prevent too many people from joining the queue in the first place. You can set the maximum number of requests that the queue processes using the `max_size` parameter of `queue()`. If a request arrives when the queue is already of the maximum size, it will not be allowed to join the queue and instead, the user will receive an error saying that the queue is full and to try again. By default, `max_size=None`, meaning that there is no limit to the number of users that can join the queue.
+
+Paradoxically, setting a `max_size` can often improve user experience because users are not dissuaded by very long queue wait times. Users who are more interested and invested in your demo will keep trying to join the queue, and will be able to get their results faster. 
+
+**Recommendation**: For a better user experience, set a `max_size` that is reasonable given your expectations of how long users might be willing to wait for a prediction. 
+
+### The `max_batch_size` parameter
+
+Another way to increase the parallelism of your Gradio demo is to write your function so that it can accept **batches** of inputs. Most deep learning models can process batches of samples more efficiently than processing individual samples. 
+
+If you write your function to process a batch of samples, Gradio will automatically batch incoming requests together and pass them into your function as a batch of samples. You need to set `batch` to `True` (by default it is `False`) and set a `max_batch_size` (by default it is `4`) based on the maximum number of samples your function is able to handle. These two parameters can be passed into  `gr.Interface()` or to an event in Blocks such as `.click()`. 
+
+While setting a batch is conceptually similar to having workers process requests in parallel, it is often *faster* than setting the `concurrency_count` for deep learning models. The downside is that you might need to adapt your function a little bit to accept batches of samples instead of individual samples. 
+
+Here's an example of a saimple function that does *not* accept a batch of inputs -- it processes a single input at a time:
+
+```py
+import time
+
+def trim_words(word, length):
+    return w[:int(length)]
+
+```
+
+Here's the same function rewritten to take in a batch of samples:
+
+```py
+import time
+
+def trim_words(words, lengths):
+    trimmed_words = []
+    for w, l in zip(words, lengths):
+        trimmed_words.append(w[:int(l)])        
+    return [trimmed_words]
+
+```
+
+The second function can be used with `batch=True` and an appropriate `max_batch_size` parameter.
+
+**Recommendation**: If possible, write your function to accept batches of samples, and then set `batch` to `True` and the `max_batch_size` as high as possible based on your machine's memory limits. If you set `max_batch_size` as high as possible, you will most likely need to set `concurrency_count` back to `1` since you will no longer have the memory to have multiple workers running in parallel. 
+
+### Upgrading your Hardware (GPUs, TPUs, etc.)
+
+If you have done everything above, and your demo is still not fast enough, you can upgrade the hardware that your model is running on. Changing the model from running on CPUs to running on GPUs will usually provide a 10x-50x increase in inference time for deep learning models.
+
+It is particularly straightforward to upgrade your Hardware on Hugging Face Spaces. Simply click on the "Settings" tab in your tab and choose the Space Hardware you'd like.
+
+![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-gpu-settings.png)
+
+While you might need to adapt portions of your code to run on a GPU (here's a [handy guide](https://cnvrg.io/pytorch-cuda/) if you are using PyTorch), Gradio is completely agnostic to the choice of hardware and will work completely fine if you use it with CPUs, GPUs, TPUs, or any other hardware!
+
+## Conclusion
+
+Congratulations! You know how to set up a Gradio demo for maximum performance. Good luck on your next viral demo! 
+
diff --git a/guides/6)other_tutorials/using_flagging.md → guides/06_other_tutorials/using_flagging.md b/guides/6)other_tutorials/using_flagging.md → guides/06_other_tutorials/using_flagging.md