Add files via upload by jwtmd · Pull Request #190 · FlagAI-Open/OpenSeek

jwtmd · 2026-04-16T16:18:10Z

No description provided.

Signed-off-by: jwtmd <jwtmd@email.com>

gemini-code-assist

Code Review

This pull request significantly enhances the data annotation pipeline by introducing batch processing for improved efficiency, task-specific guidance for the LLM, and advanced example selection strategies including similarity, diversity, and quality-based filtering. Feedback primarily addresses the use of hardcoded absolute paths which limit portability, inefficient file I/O operations within loops, and the presence of redundant or unreachable code in method.py. Additionally, improvements are suggested for tokenizer management to avoid expensive re-initialization and to correct a type hint mismatch in the batch processing function.

gemini-code-assist · 2026-04-16T16:22:20Z

+
+
+
+    tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)


The tokenizer is being re-loaded from a hardcoded absolute path on every call to select_examples. This is computationally expensive and non-portable. Since the tokenizer is already loaded in main.py, it should be passed as an argument to this function instead of being re-initialized here.

gemini-code-assist · 2026-04-16T16:22:20Z

+DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
+OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'


Hardcoded absolute paths (/root/flagos/...) make the script non-portable and environment-dependent. Consider using relative paths or environment variables to define DATA_DIR and OUTPUT_DIR.

Suggested change

DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'

OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'

DATA_DIR = './data'

OUTPUT_DIR = './outputs'

gemini-code-assist · 2026-04-16T16:22:21Z

+                        default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
+                        help='Prefix path to save the evaluation logs.')
+    parser.add_argument('--tokenizer_path', type=str,
+                        default='/root/flagos/Qwen3-4B')


The default values for --log_path_prefix and --tokenizer_path contain absolute paths specific to the current environment. These should be changed to relative paths or generic identifiers to allow the script to run on other systems without manual modification.

parser.add_argument('--log_path_prefix', type=str, default='./outputs/', help='Prefix path to save the evaluation logs.') parser.add_argument('--tokenizer_path', type=str, default='Qwen/Qwen2-7B-Instruct')

gemini-code-assist · 2026-04-16T16:22:21Z

+            for sid, (pred, _) in zip(sample_ids_batch, results):
+                test_record = {'test_sample_id': sid, 'prediction': pred}
+                with open(output_file, 'a') as f:
+                    f.write(json.dumps(test_record)+'\n')


Opening the output file in append mode inside a loop for every sample in a batch is inefficient due to repeated I/O overhead. It is better to open the file once and write all results from the batch at once.

Suggested change

for sid, (pred, _) in zip(sample_ids_batch, results):

test_record = {'test_sample_id': sid, 'prediction': pred}

with open(output_file, 'a') as f:

f.write(json.dumps(test_record)+'\n')

with open(output_file, 'a') as f:

for sid, (pred, _) in zip(sample_ids_batch, results):

test_record = {'test_sample_id': sid, 'prediction': pred}

f.write(json.dumps(test_record) + '\n')

gemini-code-assist · 2026-04-16T16:22:21Z

+        for sid, (pred, _) in zip(sample_ids_batch, results):
+            test_record = {'test_sample_id': sid, 'prediction': pred}
+            with open(output_file, 'a') as f:
+                f.write(json.dumps(test_record)+'\n')


Similar to the batch processing loop, opening the file for each remaining sample is inefficient. Consider opening the file once to write the remaining results.

Suggested change

for sid, (pred, _) in zip(sample_ids_batch, results):

test_record = {'test_sample_id': sid, 'prediction': pred}

with open(output_file, 'a') as f:

f.write(json.dumps(test_record)+'\n')

with open(output_file, 'a') as f:

for sid, (pred, _) in zip(sample_ids_batch, results):

test_record = {'test_sample_id': sid, 'prediction': pred}

f.write(json.dumps(test_record) + '\n')

gemini-code-assist · 2026-04-16T16:22:21Z

+    """
+    Build a high-precision English prompt for long-context data annotation (optimized for Qwen3-4B).
+    Core requirement: Final answer MUST be wrapped in <label> tags (no extra content outside tags).
+    """
+    prompt = (
+        "### Role Definition\n"
+        "You are a professional data annotation expert specializing in long-context text labeling. "
+        "Your work must strictly comply with the following rules, with the highest priority given to output format accuracy.\n\n"
+
+        "### Core Annotation Task\n"
+        f"{task_description}\n\n"
+
+        "### Non-Negotiable Annotation Rules (Highest Priority)\n"
+        "1. **Final Output Mandate**: Your annotation result MUST be wrapped in <label> tags — NO text, symbols, spaces, or explanations are allowed outside the tags.\n"
+        "2. **Internal Reasoning Permission**: You may perform logical reasoning, text analysis, or context comprehension internally (in your thought process), but NONE of these thoughts may appear in the final output.\n"
+        "3. **Label Format Strictness**: <label> is the opening tag and </label> is the closing tag — they must appear in pairs, with NO extra spaces or characters inside the tags (e.g., <label>  Good Review  </label> is invalid).\n"
+        "4. **Prohibited Outputs**: \n"
+        "   - ❌ Prohibited: 'After analysis, this is a positive review: <label>Good Review</label>' (extra text outside tags)\n"
+        "   - ❌ Prohibited: 'Bad Review' (missing <label> tags entirely)\n"
+        "   - ❌ Prohibited: '<label>Bad Review' (unpaired/closing tag missing)\n\n"
+
+        "### Correct vs. Incorrect Examples\n"
+        "✅ Correct Example 1: <label>answer</label>\n"
+        "✅ Correct Example 2: <label>Bad Review</label>\n"
+        "❌ Incorrect Example 1: I think this review is negative → <label>Bad Review</label>\n"
+        "❌ Incorrect Example 2: <label>  Neutral Review  </label> (extra spaces inside tags)\n"
+        "❌ Incorrect Example 3: Neutral Review (no label tags)\n\n"
+
+        "### Reference Annotation Examples\n"
+        "{EXAMPLES}\n\n"
+
+        "### Text to Annotate\n"
+        f"{text2annotate}\n\n"
+
+        "### Final Output Command (Re-emphasized)\n"
+        "You may complete any internal reasoning process, but your FINAL OUTPUT MUST consist solely of the annotation result wrapped in <label> tags (no other content whatsoever).\n"
+        "Annotation Result: "
+    )
+    return prompt
+
+def build_prompt_backup(task_description:str, text2annotate:str)->str:
+    """
+        Construct the prompt for annotation based on the task description.
+        task_description: 
+            The description of the annotation task. 
+            For example, ``Given an English language product review, 
+            determine if it is a Good Review or a Bad Review.`` 
+        text2annotate:
+            The text that needs to be annotated.
+            For example, ``My son received this book as a gift. I was extremely disappointed.``
+    """
+    prompt = (
+        "You are a data annotation assistant. "
+        "Your task is to label the given texts according to the task description "
+        "and annotation guidelines provided below.\n\n"
+        f"[Task Description]\n {task_description}\n\n"
+        "[Examples]\n {EXAMPLES}\n\n"
+        "Please follow these instructions when labeling:\n"
+        "1. **Output Format**: Annotate the text directly by wrapping each labeled "
+        "span with <label> tags in the following format: <label> annotation result </label>.\n"
+        # "2. Do not add any extra text, explanations, or commentary in the labeled spans.\n\n"
+        f"[Task Description (repeat)] \n {task_description}\n\n"
+        f"[Input Texts]\n {text2annotate}\n\n"
+        "Please output the annotation results: "
+    )
+    return prompt
+
+def select_examples_backup(all_examples:list[dict], task_description:str, text2annotate:str)->str:
+    """
+        Select examples from all_examples to fit into the target context length.
+        all_examples:
+            A list of examples, where each example is a dict with keys 'input', 'output', and 'length'.
+            For example, ``{"input": "The material is good and looks great.", "output": "Good Review", "length": 79``},
+        task_description:
+            The description of the annotation task which may be used for example evaluation. 
+            For example, ``Given an English language product review, 
+            determine if it is a Good Review or a Bad Review.`` 
+        text2annotate:
+            The text that needs to be annotated  which may be used for example retrieval.
+            For example, ``My son received this book as a gift. I was extremely disappointed.``
+
+    """
+    # Notice that the maximum context length is restricted.
+    target_length = 10_000
+
+    input_list = [example['input'] for example in all_examples]
+    output_list = [example['output'][0] for example in all_examples]
+    length_list = [example['length'] for example in all_examples]
+
+    # <label> have 2 tokens; </label> have 3 tokens; \n have 1 token; # have 1 token.
+    examples_str, token_num = "", 0
+    for i, (input_text, output_text, length) in enumerate(zip(input_list, output_list, length_list)):
+        if length + token_num <= target_length:
+            token_num += (length + 2 + 3 + 1 + 1)
+            example_str = f"# {input_text} <label> {output_text} </label>\n"
+            examples_str += example_str
+        else:
+            return examples_str, i
+    return examples_str
+


This section contains unreachable code (lines 441-478) following a return statement, as well as redundant backup functions (build_prompt_backup, select_examples_backup). These should be removed to improve code quality and maintainability.

gemini-code-assist · 2026-04-16T16:22:21Z

+    return prediction, whole_result
+
+
+def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:


The return type hint list[str] is incorrect. The function actually returns a list of tuples (prediction, whole_result) as returned by annotate_ascend.

Suggested change

def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:

def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]:

Add files via upload

c95ca0f

Signed-off-by: jwtmd <jwtmd@email.com>

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add files via upload#190

Add files via upload#190
jwtmd wants to merge 1 commit intoFlagAI-Open:mainfrom
jwtmd:jwtmd

jwtmd commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant




		tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)

		DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
		OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'

		return prediction, whole_result


		def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:

Conversation

jwtmd commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant