Skip to content

[bugfix] firered conv2dsubsampling4 and transformer cmvn for padded inputs#2806

Open
shen9712 wants to merge 5 commits intowenet-e2e:mainfrom
shen9712:main
Open

[bugfix] firered conv2dsubsampling4 and transformer cmvn for padded inputs#2806
shen9712 wants to merge 5 commits intowenet-e2e:mainfrom
shen9712:main

Conversation

@shen9712
Copy link
Copy Markdown

  1. Incorrect mask computation in conv2dsubsampling4

Current implementation:

mask = x_mask[:, :, :-2:2][:, :, :-2:2]

This slicing-based mask downsampling is inaccurate and does not correctly reflect the actual output lengths of conv2d subsampling.

I propose to recompute the mask from sequence lengths instead:

x_lens = torch.floor((torch.floor((x_lens - 1) / 2) - 1) / 2).to(x_lens.dtype)
mask = make_non_pad_mask(x_lens).unsqueeze(1)

This matches the real length transformation of conv2dsubsampling4 and avoids accumulated alignment errors.

  1. CMVN is applied to padded positions before convolution

Current code:

if self.global_cmvn is not None:
    xs = self.global_cmvn(xs)
xs, pos_emb, masks = self.embed(xs, masks)

Here CMVN is applied to padded frames. Since the embedding module contains convolution with right context, padded positions (after CMVN) will leak into valid frames, leading to incorrect features.

CMVN should ignore padded positions, or padded frames should be explicitly zeroed after CMVN before convolution.

if self.global_cmvn is not None:
    xs = self.global_cmvn(xs)
xs = xs * masks.transpose(1, 2)
xs, pos_emb, masks = self.embed(xs, masks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant