feat: make reward functions to support parallel computation#398
feat: make reward functions to support parallel computation#3980x404 wants to merge 4 commits intohuggingface:mainfrom
Conversation
| def _single_thread_call(self, completions: List[Dict[str, str]], **kwargs) -> List[float]: | ||
| results = [] | ||
| for idx, completion in enumerate(completions): | ||
| # prepare per-completion kwargs | ||
| per_completion_kwargs = {} | ||
| for key, value in kwargs.items(): | ||
| if isinstance(value, list): | ||
| per_completion_kwargs[key] = value[idx] | ||
| else: | ||
| per_completion_kwargs[key] = value | ||
| results.append(self.reward_on_single_completion(completion, **per_completion_kwargs)) | ||
| return results |
There was a problem hiding this comment.
The reason for implementing _single_thread_call is that some reward evaluations are not thread-safe. When max_workers==1, we use _single_thread_call to sequentially compute the reward for each example.
Currently, math_verify.parse is not thread-safe: huggingface/Math-Verify#22
This sequential execution mode ensures correct reward computation for non-thread-safe evaluators, even though it doesn't take advantage of parallel processing capabilities.
|
I'll update to multiprocessing instead of multithreading if you find this PR acceptable. (huggingface/Math-Verify#22 (comment)) |
edbeeching
left a comment
There was a problem hiding this comment.
Thanks for adding this, can you add a test that compares single vs multi-thread/proc to ensure they both return the same results.
|
Sure |
|
Hey @0x404 I am looking at GRPO training with code rewards today, so I will try and get this PR cleaned up and merged. |
|
Sure, thanks. I've been busy with other things these couple of days and haven't had time to update this PR. |
|
Hi @0x404 thanks for the PR! Aside from code execution with E2B, do you know which (if any) of the reward functions are slow to execute? If it's just the E2B sandbox, I'm wondering if it's better to use their native async sandbox instead of enforcing multi-threading on all rewards. |
|
hi, @lewtun, I think currently only e2b rewards are relatively slow |
Thanks, let me take a stab at making it async first since I am quite partial to the simplicity of having simple functions per reward. |
|
Resolved with aysnc by #484 ? |
|
yes |
Motivation:
Current reward functions generally take a completion list and calculate rewards for each example in this list serially. In some cases, calculating a reward for a single example may take much time, for example, when running code evaluation in e2b. In such situations, if the reward scores for each example could be calculated in parallel, it would improve our training speed.
Approach:
This PR abstracts out a
BaseRewardFunctionclass, where each reward function inherits from this class and implements its ownreward_on_single_completion. This function will receive the completion for each example to be evaluated, along with its corresponding kwargs.This refactor should make the code clearer and help us support more reward functions in the future.