Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the data annotation pipeline by implementing batch processing, self-consistency checks, and advanced example selection strategies using TF-IDF similarity, diversity clustering, and quality filtering. It also introduces task-specific guidance and Chain-of-Thought reasoning. The review feedback identifies several opportunities to improve code portability and efficiency, such as replacing hardcoded absolute paths with relative ones, passing the pre-loaded tokenizer to functions to avoid redundant loading, correcting function type hints, and making model names configurable.
| examples_str = select_examples(icl_examples, task_description, text2annotate, | ||
| is_code_generation=is_code_generation, | ||
| use_task_aware=True, task_id=task_id, | ||
| use_quality_filter=True, quality_threshold=0.5, | ||
| use_diversity=False, use_similarity=False, | ||
| use_cot=use_cot) |
There was a problem hiding this comment.
The qwen_tokenizer instance already loaded in main.py should be passed to select_examples. This avoids the overhead of re-loading the tokenizer from disk and ensures that the tokenizer path provided via command-line arguments is actually used.
| examples_str = select_examples(icl_examples, task_description, text2annotate, | |
| is_code_generation=is_code_generation, | |
| use_task_aware=True, task_id=task_id, | |
| use_quality_filter=True, quality_threshold=0.5, | |
| use_diversity=False, use_similarity=False, | |
| use_cot=use_cot) | |
| examples_str = select_examples(icl_examples, task_description, text2annotate, | |
| tokenizer=qwen_tokenizer, | |
| is_code_generation=is_code_generation, | |
| use_task_aware=True, task_id=task_id, | |
| use_quality_filter=True, quality_threshold=0.5, | |
| use_diversity=False, use_similarity=False, | |
| use_cot=use_cot) |
| return examples_str, i | ||
| return examples_str | ||
|
|
||
| def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str: |
There was a problem hiding this comment.
Update the function signature to accept the tokenizer as an argument. This allows the function to use the tokenizer already loaded in the main script, improving efficiency and removing the dependency on a hardcoded path.
| def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str: | |
| def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, tokenizer: AutoTokenizer, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str: |
|
|
||
|
|
||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
Remove this line as the tokenizer should now be passed as an argument to the function. Hardcoding the path here makes the function ignore the user-specified tokenizer path.
| tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True) | |
| # Tokenizer is now passed as an argument |
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | ||
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' |
There was a problem hiding this comment.
Hardcoding absolute paths like /root/flagos/... makes the code non-portable and environment-dependent. It is better to use paths relative to the script location or environment variables to ensure the project can be run on different machines.
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | |
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' | |
| DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'data') | |
| OUTPUT_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'outputs') |
| default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/', | ||
| help='Prefix path to save the evaluation logs.') | ||
| parser.add_argument('--tokenizer_path', type=str, | ||
| default='/root/flagos/Qwen3-4B') |
| try: | ||
|
|
||
| tfidf_matrix = vectorizer.fit_transform(texts) | ||
| except: | ||
|
|
||
| tfidf_matrix = vectorizer.transform(texts) |
There was a problem hiding this comment.
The try-except block here is fragile and potentially hides errors. Since vectorizer is a global instance that is re-fitted on a combined list of texts in every call to ensure a common feature space, you should simply use fit_transform. If it fails, it indicates an issue with the input data that should be addressed rather than ignored.
tfidf_matrix = vectorizer.fit_transform(texts)|
|
||
| openai.api_key = "EMPTY" | ||
| openai.base_url = "http://localhost:9010/v1/" | ||
| model = "Qwen3-4B-ascend-flagos" |
| return prediction, whole_result | ||
|
|
||
|
|
||
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: |
There was a problem hiding this comment.
The return type hint for annotate_batch is incorrect. The implementation returns a list of tuples (prediction, whole_result) from annotate_ascend, not a list of strings.
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: | |
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]: |
No description provided.