Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the data annotation framework by introducing advanced example selection strategies (similarity, diversity, and quality-based), task-specific guidance, and specialized workflows like self-consistency and multi-turn dialogs. It also implements batch processing for improved efficiency. The review feedback highlights several improvement opportunities, including replacing hardcoded absolute paths with relative ones to ensure portability, optimizing tokenizer initialization, correcting function type hints, and increasing the flexibility of the batch annotation utility.
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | ||
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' |
There was a problem hiding this comment.
| parser.add_argument('--max_input_length', type=int, default=10_000, | ||
| help='Maximum input length for the model.') | ||
| parser.add_argument('--log_path_prefix', type=str, | ||
| default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/', |
There was a problem hiding this comment.
The default value for --log_path_prefix is a hardcoded absolute path. It is better to use a relative path like './outputs/' to ensure the script works across different environments without modification.
| default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/', | |
| default='./outputs/', |
| default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/', | ||
| help='Prefix path to save the evaluation logs.') | ||
| parser.add_argument('--tokenizer_path', type=str, | ||
| default='/root/flagos/Qwen3-4B') |
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
| lines = text.strip().split('\n') |
|
|
||
|
|
||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
The tokenizer is being re-initialized from a hardcoded absolute path every time select_examples is called. This is inefficient and makes the code dependent on a specific file system structure. Consider passing the tokenizer as an argument to the function or using a global cache to store the initialized tokenizer.
|
|
||
|
|
||
|
|
||
| def count_answer(text: str) -> tuple[list, dict]: |
| return cleaned_text | ||
|
|
||
|
|
||
| def annotate_nvidia(input_prompt:str)->list[str]: |
There was a problem hiding this comment.
| results = [None] * len(prompts) | ||
|
|
||
| with ThreadPoolExecutor(max_workers=num_workers) as executor: | ||
| future_to_idx = {executor.submit(annotate_ascend, p, max_tokens, use_count_answer, task_id): i for i, p in enumerate(prompts)} |
Signed-off-by: peyiran peyiran@proton.me