Conversation
Ritesh1905
reviewed
Sep 20, 2025
Ritesh1905
reviewed
Sep 20, 2025
Ritesh1905
reviewed
Sep 20, 2025
allenwang28
reviewed
Sep 20, 2025
Contributor
allenwang28
left a comment
There was a problem hiding this comment.
Prompt: What is the capital of Japan?
Responses: ['Aardvark', 'Durian', 'Tokyo']
Generation Results:
================================================================================
Sample 1
Evaluation: 3
--------------------------------------------------------------------------------
Sample 2
Evaluation: 3
--------------------------------------------------------------------------------
Sample 3
Evaluation: 3
--------------------------------------------------------------------------------
Sample 4
Evaluation: 3
--------------------------------------------------------------------------------
lol is this working correctly?
Contributor
Author
I wrote this prompt from the deep archives of my mind and I'm also shocked that the prompting worked. |
Jack-Khuu
commented
Sep 24, 2025
allenwang28
reviewed
Oct 6, 2025
joecummings
reviewed
Oct 6, 2025
Contributor
Author
|
Update (10/10): This is a very much simplified version of the previous iterations of the PRs. It provides just a basic example of LLM Judges for GRPO There is out of scope future work leveraging structured decoding, but that requires additional investigation on how to configure with CoT models (which need to think) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update(10/10) : This is a very much simplified version of the previous iterations of the PRs. It provides just a basic example of LLM Judges for GRPO
Judges can both be used as "Verifiers" or "Graders". This PR adds to the sandbox, a
CorrectnessJudgeexample of how an LLM Judge can be used in GRPO (note that this PR does not integrate)It takes as input (prompt + response) generated from a model, and returns whether the model thinks it accurately responded to the prompt. Results can then be used to make decisions during GRPO whitening