Hi,
Thank you for sharing this project! I have a few questions about the implementation details and would appreciate your clarification:
-
In the code, I noticed that get_enhance_weight() seems to act like a “temperature” parameter mentioned in the paper, but I couldn’t find any reference in the paper to multiplying mean_scores.mean() by num_frames.
enhance_scores = mean_scores.mean() * (num_frames + get_enhance_weight())
- Why is
mean_scores.mean() used here? What exactly is being averaged, and what does it represent?
- How does this averaging influence the final enhancement process?
- What is the rationale behind multiplying by
num_frames, and how does this relate to the method described in the paper?
-
I also noticed that enhance_scores is multiplied by the attention output, and it is mentioned that this improves temporal consistency. Could you explain the theoretical or intuitive reasoning behind this? Specifically, why does scaling the attention output with enhance_scores contribute to better temporal consistency?
Any insights you can share would be greatly appreciated. Thank you in advance for your time!
Hi,
Thank you for sharing this project! I have a few questions about the implementation details and would appreciate your clarification:
In the code, I noticed that
get_enhance_weight()seems to act like a “temperature” parameter mentioned in the paper, but I couldn’t find any reference in the paper to multiplyingmean_scores.mean()bynum_frames.mean_scores.mean()used here? What exactly is being averaged, and what does it represent?num_frames, and how does this relate to the method described in the paper?I also noticed that
enhance_scoresis multiplied by the attention output, and it is mentioned that this improves temporal consistency. Could you explain the theoretical or intuitive reasoning behind this? Specifically, why does scaling the attention output withenhance_scorescontribute to better temporal consistency?Any insights you can share would be greatly appreciated. Thank you in advance for your time!