Conversation
Signed-off-by: jwtmd <jwtmd@email.com>
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the data annotation pipeline by introducing batch processing for improved efficiency, task-specific guidance for the LLM, and advanced example selection strategies including similarity, diversity, and quality-based filtering. Feedback primarily addresses the use of hardcoded absolute paths which limit portability, inefficient file I/O operations within loops, and the presence of redundant or unreachable code in method.py. Additionally, improvements are suggested for tokenizer management to avoid expensive re-initialization and to correct a type hint mismatch in the batch processing function.
|
|
||
|
|
||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | ||
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' |
There was a problem hiding this comment.
Hardcoded absolute paths (/root/flagos/...) make the script non-portable and environment-dependent. Consider using relative paths or environment variables to define DATA_DIR and OUTPUT_DIR.
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | |
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' | |
| DATA_DIR = './data' | |
| OUTPUT_DIR = './outputs' |
| default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/', | ||
| help='Prefix path to save the evaluation logs.') | ||
| parser.add_argument('--tokenizer_path', type=str, | ||
| default='/root/flagos/Qwen3-4B') |
There was a problem hiding this comment.
The default values for --log_path_prefix and --tokenizer_path contain absolute paths specific to the current environment. These should be changed to relative paths or generic identifiers to allow the script to run on other systems without manual modification.
parser.add_argument('--log_path_prefix', type=str,
default='./outputs/',
help='Prefix path to save the evaluation logs.')
parser.add_argument('--tokenizer_path', type=str,
default='Qwen/Qwen2-7B-Instruct')| for sid, (pred, _) in zip(sample_ids_batch, results): | ||
| test_record = {'test_sample_id': sid, 'prediction': pred} | ||
| with open(output_file, 'a') as f: | ||
| f.write(json.dumps(test_record)+'\n') |
There was a problem hiding this comment.
Opening the output file in append mode inside a loop for every sample in a batch is inefficient due to repeated I/O overhead. It is better to open the file once and write all results from the batch at once.
| for sid, (pred, _) in zip(sample_ids_batch, results): | |
| test_record = {'test_sample_id': sid, 'prediction': pred} | |
| with open(output_file, 'a') as f: | |
| f.write(json.dumps(test_record)+'\n') | |
| with open(output_file, 'a') as f: | |
| for sid, (pred, _) in zip(sample_ids_batch, results): | |
| test_record = {'test_sample_id': sid, 'prediction': pred} | |
| f.write(json.dumps(test_record) + '\n') |
| for sid, (pred, _) in zip(sample_ids_batch, results): | ||
| test_record = {'test_sample_id': sid, 'prediction': pred} | ||
| with open(output_file, 'a') as f: | ||
| f.write(json.dumps(test_record)+'\n') |
There was a problem hiding this comment.
Similar to the batch processing loop, opening the file for each remaining sample is inefficient. Consider opening the file once to write the remaining results.
| for sid, (pred, _) in zip(sample_ids_batch, results): | |
| test_record = {'test_sample_id': sid, 'prediction': pred} | |
| with open(output_file, 'a') as f: | |
| f.write(json.dumps(test_record)+'\n') | |
| with open(output_file, 'a') as f: | |
| for sid, (pred, _) in zip(sample_ids_batch, results): | |
| test_record = {'test_sample_id': sid, 'prediction': pred} | |
| f.write(json.dumps(test_record) + '\n') |
| """ | ||
| Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B). | ||
| Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags). | ||
| """ | ||
| prompt = ( | ||
| "### Role Definition\n" | ||
| "You are a professional data annotation expert specializing in long-context text labeling. " | ||
| "Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n" | ||
|
|
||
| "### Core Annotation Task\n" | ||
| f"{task_description}\n\n" | ||
|
|
||
| "### Non-Negotiable Annotation Rules (Highest Priority)\n" | ||
| "1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n" | ||
| "2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n" | ||
| "3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label> Good Review </label> is invalid).\n" | ||
| "4. **Prohibited Outputs**: \n" | ||
| " - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n" | ||
| " - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n" | ||
| " - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n" | ||
|
|
||
| "### Correct vs. Incorrect Examples\n" | ||
| "✅ Correct Example 1: <label>answer</label>\n" | ||
| "✅ Correct Example 2: <label>Bad Review</label>\n" | ||
| "❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n" | ||
| "❌ Incorrect Example 2: <label> Neutral Review </label> (extra spaces inside tags)\n" | ||
| "❌ Incorrect Example 3: Neutral Review (no label tags)\n\n" | ||
|
|
||
| "### Reference Annotation Examples\n" | ||
| "{EXAMPLES}\n\n" | ||
|
|
||
| "### Text to Annotate\n" | ||
| f"{text2annotate}\n\n" | ||
|
|
||
| "### Final Output Command (Re-emphasized)\n" | ||
| "You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n" | ||
| "Annotation Result: " | ||
| ) | ||
| return prompt | ||
|
|
||
| def build_prompt_backup(task_description:str, text2annotate:str)->str: | ||
| """ | ||
| Construct the prompt for annotation based on the task description. | ||
| task_description: | ||
| The description of the annotation task. | ||
| For example, ``Given an English language product review, | ||
| determine if it is a Good Review or a Bad Review.`` | ||
| text2annotate: | ||
| The text that needs to be annotated. | ||
| For example, ``My son received this book as a gift. I was extremely disappointed.`` | ||
| """ | ||
| prompt = ( | ||
| "You are a data annotation assistant. " | ||
| "Your task is to label the given texts according to the task description " | ||
| "and annotation guidelines provided below.\n\n" | ||
| f"[Task Description]\n {task_description}\n\n" | ||
| "[Examples]\n {EXAMPLES}\n\n" | ||
| "Please follow these instructions when labeling:\n" | ||
| "1. **Output Format**: Annotate the text directly by wrapping each labeled " | ||
| "span with <label> tags in the following format: <label> annotation result </label>.\n" | ||
| # "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n" | ||
| f"[Task Description (repeat)] \n {task_description}\n\n" | ||
| f"[Input Texts]\n {text2annotate}\n\n" | ||
| "Please output the annotation results: " | ||
| ) | ||
| return prompt | ||
|
|
||
| def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str: | ||
| """ | ||
| Select examples from all_examples to fit into the target context length. | ||
| all_examples: | ||
| A list of examples, where each example is a dict with keys 'input', 'output', and 'length'. | ||
| For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``}, | ||
| task_description: | ||
| The description of the annotation task which may be used for example evaluation. | ||
| For example, ``Given an English language product review, | ||
| determine if it is a Good Review or a Bad Review.`` | ||
| text2annotate: | ||
| The text that needs to be annotated which may be used for example retrieval. | ||
| For example, ``My son received this book as a gift. I was extremely disappointed.`` | ||
|
|
||
| """ | ||
| # Notice that the maximum context length is restricted. | ||
| target_length = 10_000 | ||
|
|
||
| input_list = [example['input'] for example in all_examples] | ||
| output_list = [example['output'][0] for example in all_examples] | ||
| length_list = [example['length'] for example in all_examples] | ||
|
|
||
| # <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token. | ||
| examples_str, token_num = "", 0 | ||
| for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)): | ||
| if length + token_num <= target_length: | ||
| token_num += (length + 2 + 3 + 1 + 1) | ||
| example_str = f"# {input_text} <label> {output_text} </label>\n" | ||
| examples_str += example_str | ||
| else: | ||
| return examples_str, i | ||
| return examples_str | ||
|
|
| return prediction, whole_result | ||
|
|
||
|
|
||
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: |
There was a problem hiding this comment.
The return type hint list[str] is incorrect. The function actually returns a list of tuples (prediction, whole_result) as returned by annotate_ascend.
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: | |
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]: |
No description provided.