Hi, thanks for the great work!
I'm currently trying to write a training script based on the provided inference code and ran into some confusion regarding the scaling_factor of the VAE.
In the inference code (decode_latents), I noticed you apply the unscaling operation:
latents = 1 / self.vae.config.scaling_factor * latents
However, in the encoding function (_encode_vae_frames), the scaling_factor is not applied before passing the latents to the diffusion process:
frame_latent = self.vae.encode(frames).latent_dist.mode()
Looking at the AutoencoderKLTemporalDecoder source, the encode method itself doesn't apply the scaling factor internally either.
Could you clarify the intended usage?
- Is the scaling factor effectively not used during the forward diffusion process, and the unscaling in
decode_latents is a remnant from standard SVD pipelines?
- Or, should the training script apply
latents = latents * scaling_factor after encoding to match the expected input variance of the UNet, and the current inference code relies on the diffusion model having learned the unscaled distribution?
I want to make sure the training latents match the distribution expected by the model. Any insight would be appreciated!
Hi, thanks for the great work!
I'm currently trying to write a training script based on the provided inference code and ran into some confusion regarding the
scaling_factorof the VAE.In the inference code (
decode_latents), I noticed you apply the unscaling operation:latents = 1 / self.vae.config.scaling_factor * latentsHowever, in the encoding function (
_encode_vae_frames), thescaling_factoris not applied before passing the latents to the diffusion process:frame_latent = self.vae.encode(frames).latent_dist.mode()Looking at the
AutoencoderKLTemporalDecodersource, theencodemethod itself doesn't apply the scaling factor internally either.Could you clarify the intended usage?
decode_latentsis a remnant from standard SVD pipelines?latents = latents * scaling_factorafter encoding to match the expected input variance of the UNet, and the current inference code relies on the diffusion model having learned the unscaled distribution?I want to make sure the training latents match the distribution expected by the model. Any insight would be appreciated!