Skip to content

Add files via upload#190

Open
jwtmd wants to merge 1 commit intoFlagAI-Open:mainfrom
jwtmd:jwtmd
Open

Add files via upload#190
jwtmd wants to merge 1 commit intoFlagAI-Open:mainfrom
jwtmd:jwtmd

Conversation

@jwtmd
Copy link
Copy Markdown

@jwtmd jwtmd commented Apr 16, 2026

No description provided.

Signed-off-by: jwtmd <jwtmd@email.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the data annotation pipeline by introducing batch processing for improved efficiency, task-specific guidance for the LLM, and advanced example selection strategies including similarity, diversity, and quality-based filtering. Feedback primarily addresses the use of hardcoded absolute paths which limit portability, inefficient file I/O operations within loops, and the presence of redundant or unreachable code in method.py. Additionally, improvements are suggested for tokenizer management to avoid expensive re-initialization and to correct a type hint mismatch in the batch processing function.




tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tokenizer is being re-loaded from a hardcoded absolute path on every call to select_examples. This is computationally expensive and non-portable. Since the tokenizer is already loaded in main.py, it should be passed as an argument to this function instead of being re-initialized here.

Comment on lines +13 to +14
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoded absolute paths (/root/flagos/...) make the script non-portable and environment-dependent. Consider using relative paths or environment variables to define DATA_DIR and OUTPUT_DIR.

Suggested change
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
DATA_DIR = './data'
OUTPUT_DIR = './outputs'

Comment on lines +34 to +37
default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
help='Prefix path to save the evaluation logs.')
parser.add_argument('--tokenizer_path', type=str,
default='/root/flagos/Qwen3-4B')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default values for --log_path_prefix and --tokenizer_path contain absolute paths specific to the current environment. These should be changed to relative paths or generic identifiers to allow the script to run on other systems without manual modification.

    parser.add_argument('--log_path_prefix', type=str, 
                        default='./outputs/',
                        help='Prefix path to save the evaluation logs.')
    parser.add_argument('--tokenizer_path', type=str,
                        default='Qwen/Qwen2-7B-Instruct')

Comment on lines +99 to +102
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
with open(output_file, 'a') as f:
f.write(json.dumps(test_record)+'\n')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Opening the output file in append mode inside a loop for every sample in a batch is inefficient due to repeated I/O overhead. It is better to open the file once and write all results from the batch at once.

Suggested change
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
with open(output_file, 'a') as f:
f.write(json.dumps(test_record)+'\n')
with open(output_file, 'a') as f:
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
f.write(json.dumps(test_record) + '\n')

Comment on lines +109 to +112
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
with open(output_file, 'a') as f:
f.write(json.dumps(test_record)+'\n')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the batch processing loop, opening the file for each remaining sample is inefficient. Consider opening the file once to write the remaining results.

Suggested change
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
with open(output_file, 'a') as f:
f.write(json.dumps(test_record)+'\n')
with open(output_file, 'a') as f:
for sid, (pred, _) in zip(sample_ids_batch, results):
test_record = {'test_sample_id': sid, 'prediction': pred}
f.write(json.dumps(test_record) + '\n')

Comment on lines +441 to +540
"""
Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B).
Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags).
"""
prompt = (
"### Role Definition\n"
"You are a professional data annotation expert specializing in long-context text labeling. "
"Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n"

"### Core Annotation Task\n"
f"{task_description}\n\n"

"### Non-Negotiable Annotation Rules (Highest Priority)\n"
"1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n"
"2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n"
"3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label> Good Review </label> is invalid).\n"
"4. **Prohibited Outputs**: \n"
" - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n"
" - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n"
" - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n"

"### Correct vs. Incorrect Examples\n"
"✅ Correct Example 1: <label>answer</label>\n"
"✅ Correct Example 2: <label>Bad Review</label>\n"
"❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n"
"❌ Incorrect Example 2: <label> Neutral Review </label> (extra spaces inside tags)\n"
"❌ Incorrect Example 3: Neutral Review (no label tags)\n\n"

"### Reference Annotation Examples\n"
"{EXAMPLES}\n\n"

"### Text to Annotate\n"
f"{text2annotate}\n\n"

"### Final Output Command (Re-emphasized)\n"
"You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n"
"Annotation Result: "
)
return prompt

def build_prompt_backup(task_description:str, text2annotate:str)->str:
"""
Construct the prompt for annotation based on the task description.
task_description:
The description of the annotation task.
For example, ``Given an English language product review,
determine if it is a Good Review or a Bad Review.``
text2annotate:
The text that needs to be annotated.
For example, ``My son received this book as a gift. I was extremely disappointed.``
"""
prompt = (
"You are a data annotation assistant. "
"Your task is to label the given texts according to the task description "
"and annotation guidelines provided below.\n\n"
f"[Task Description]\n {task_description}\n\n"
"[Examples]\n {EXAMPLES}\n\n"
"Please follow these instructions when labeling:\n"
"1. **Output Format**: Annotate the text directly by wrapping each labeled "
"span with <label> tags in the following format: <label> annotation result </label>.\n"
# "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n"
f"[Task Description (repeat)] \n {task_description}\n\n"
f"[Input Texts]\n {text2annotate}\n\n"
"Please output the annotation results: "
)
return prompt

def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str:
"""
Select examples from all_examples to fit into the target context length.
all_examples:
A list of examples, where each example is a dict with keys 'input', 'output', and 'length'.
For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``},
task_description:
The description of the annotation task which may be used for example evaluation.
For example, ``Given an English language product review,
determine if it is a Good Review or a Bad Review.``
text2annotate:
The text that needs to be annotated which may be used for example retrieval.
For example, ``My son received this book as a gift. I was extremely disappointed.``

"""
# Notice that the maximum context length is restricted.
target_length = 10_000

input_list = [example['input'] for example in all_examples]
output_list = [example['output'][0] for example in all_examples]
length_list = [example['length'] for example in all_examples]

# <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token.
examples_str, token_num = "", 0
for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)):
if length + token_num <= target_length:
token_num += (length + 2 + 3 + 1 + 1)
example_str = f"# {input_text} <label> {output_text} </label>\n"
examples_str += example_str
else:
return examples_str, i
return examples_str

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section contains unreachable code (lines 441-478) following a return statement, as well as redundant backup functions (build_prompt_backup, select_examples_backup). These should be removed to improve code quality and maintainability.

return prediction, whole_result


def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint list[str] is incorrect. The function actually returns a list of tuples (prediction, whole_result) as returned by annotate_ascend.

Suggested change
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant