Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder?

I am confused why the speaker embedding `g` is used to condition multiple model components (_Posterior Encoder, Decoder, Flow_) as opposed to just _Flow_.

From the model diagram in **Fig. 1 (a)**  (Training procedure), the speaker embedding `g` is used to condition the normalising _Flow_. This makes sense: at inference time, this information in the reversed _Flow_ to reverse the `z'` distribution into a speaker-informed `z` which was modelled after the real data `x_lin` with the _Posterior Encoder_.

To me this seems like enough supervision, and I am confused why `g` is used in other places too:
- in _Posterior Encoder_ which uses `x_lin` as input, `g` is also supplied - but it shouldn't be needed as `x_lin` already contains the speaker information! (And `g` is not mentioned in section 2.2.2. of the paper when this encoder is discussed)
- in _Decoder_, similarly, `z` is already informed with the speaker embedding, so why do we need to explicitly supply it here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why is the speaker embedding g used to condition the Posterior Encoder and the Decoder? #88

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions