Question about asymmetric scaling_factor usage in encode/decode during training vs. inference

Hi, thanks for the great work!

I'm currently trying to write a training script based on the provided inference code and ran into some confusion regarding the `scaling_factor` of the VAE.

In the inference code (`decode_latents`), I noticed you apply the unscaling operation:
`latents = 1 / self.vae.config.scaling_factor * latents`

However, in the encoding function (`_encode_vae_frames`), the `scaling_factor` is **not** applied before passing the latents to the diffusion process:
`frame_latent = self.vae.encode(frames).latent_dist.mode()`

Looking at the `AutoencoderKLTemporalDecoder` source, the `encode` method itself doesn't apply the scaling factor internally either.

Could you clarify the intended usage?
1. Is the scaling factor effectively **not used** during the forward diffusion process, and the unscaling in `decode_latents` is a remnant from standard SVD pipelines?
2. Or, should the training script apply `latents = latents * scaling_factor` after encoding to match the expected input variance of the UNet, and the current inference code relies on the diffusion model having learned the unscaled distribution?

I want to make sure the training latents match the distribution expected by the model. Any insight would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about asymmetric scaling_factor usage in encode/decode during training vs. inference #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about asymmetric scaling_factor usage in encode/decode during training vs. inference #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions