How to increase Inference speed #2089

Dammerzone · 2025-03-18T15:31:19Z

Dammerzone
Mar 18, 2025

Hello guys,

I'm working on a local version of the last Gemma 3 collab on a Jetson AGX Orin card with 64Go of memory.

Everything is working fine and with only 17Go of reserved VRAM but the token generation is quite slow even if I used the for_inference() method.

My question is: Is there a way to speed up the token generation by increasing the VRAM allocation as I only use 17Go on my 61 available ?

Here is how I load my model

### Note, I tried to use FastVisionModel it doesn't change anythin
vlm, processor = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    max_seq_length = 2048, 
    load_in_4bit = True,  
    load_in_8bit = False, 
    full_finetuning = False,
    device_map="cuda"# token = "hf_...", 
)

FastModel.for_inference(vlm)

gauravrawat-ai · 2025-08-07T08:26:37Z

gauravrawat-ai
Aug 7, 2025

did you find the solution?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to increase Inference speed #2089

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to increase Inference speed #2089

Uh oh!

Uh oh!

Dammerzone Mar 18, 2025

Replies: 1 comment

Uh oh!

gauravrawat-ai Aug 7, 2025

Dammerzone
Mar 18, 2025

gauravrawat-ai
Aug 7, 2025