[bugfix] firered conv2dsubsampling4 and transformer cmvn for padded inputs#2806
Open
shen9712 wants to merge 5 commits intowenet-e2e:mainfrom
Open
[bugfix] firered conv2dsubsampling4 and transformer cmvn for padded inputs#2806shen9712 wants to merge 5 commits intowenet-e2e:mainfrom
shen9712 wants to merge 5 commits intowenet-e2e:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current implementation:
This slicing-based mask downsampling is inaccurate and does not correctly reflect the actual output lengths of conv2d subsampling.
I propose to recompute the mask from sequence lengths instead:
This matches the real length transformation of conv2dsubsampling4 and avoids accumulated alignment errors.
Current code:
Here CMVN is applied to padded frames. Since the embedding module contains convolution with right context, padded positions (after CMVN) will leak into valid frames, leading to incorrect features.
CMVN should ignore padded positions, or padded frames should be explicitly zeroed after CMVN before convolution.