Skip to content

Question about asymmetric scaling_factor usage in encode/decode during training vs. inference #41

@aaPoriaa

Description

@aaPoriaa

Hi, thanks for the great work!

I'm currently trying to write a training script based on the provided inference code and ran into some confusion regarding the scaling_factor of the VAE.

In the inference code (decode_latents), I noticed you apply the unscaling operation:
latents = 1 / self.vae.config.scaling_factor * latents

However, in the encoding function (_encode_vae_frames), the scaling_factor is not applied before passing the latents to the diffusion process:
frame_latent = self.vae.encode(frames).latent_dist.mode()

Looking at the AutoencoderKLTemporalDecoder source, the encode method itself doesn't apply the scaling factor internally either.

Could you clarify the intended usage?

  1. Is the scaling factor effectively not used during the forward diffusion process, and the unscaling in decode_latents is a remnant from standard SVD pipelines?
  2. Or, should the training script apply latents = latents * scaling_factor after encoding to match the expected input variance of the UNet, and the current inference code relies on the diffusion model having learned the unscaled distribution?

I want to make sure the training latents match the distribution expected by the model. Any insight would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions