Skip to content

submit#194

Open
peyiran wants to merge 1 commit intoFlagAI-Open:mainfrom
peyiran:peyiran-patch-1
Open

submit#194
peyiran wants to merge 1 commit intoFlagAI-Open:mainfrom
peyiran:peyiran-patch-1

Conversation

@peyiran
Copy link
Copy Markdown

@peyiran peyiran commented Apr 19, 2026

Signed-off-by: peyiran peyiran@proton.me

Signed-off-by: peyiran peyiran@proton.me
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the data annotation framework by introducing advanced example selection strategies (similarity, diversity, and quality-based), task-specific guidance, and specialized workflows like self-consistency and multi-turn dialogs. It also implements batch processing for improved efficiency. The review feedback highlights several improvement opportunities, including replacing hardcoded absolute paths with relative ones to ensure portability, optimizing tokenizer initialization, correcting function type hints, and increasing the flexibility of the batch annotation utility.

Comment on lines +15 to +16
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The paths DATA_DIR and OUTPUT_DIR are hardcoded to absolute paths on a specific machine (/root/flagos/...). This makes the script non-portable and likely to fail in other environments. Consider using relative paths or environment variables to define these directories.

parser.add_argument('--max_input_length', type=int, default=10_000,
help='Maximum input length for the model.')
parser.add_argument('--log_path_prefix', type=str,
default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for --log_path_prefix is a hardcoded absolute path. It is better to use a relative path like './outputs/' to ensure the script works across different environments without modification.

Suggested change
default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
default='./outputs/',

default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
help='Prefix path to save the evaluation logs.')
parser.add_argument('--tokenizer_path', type=str,
default='/root/flagos/Qwen3-4B')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for --tokenizer_path is a hardcoded absolute path. Consider using a more generic default or requiring the user to provide this path via the command line.

Comment on lines +44 to +50






lines = text.strip().split('\n')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several unnecessary empty lines at the beginning of the parse_category_and_clue function. Removing them would improve code readability and maintainability.

Suggested change
lines = text.strip().split('\n')
lines = text.strip().split('\n')




tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tokenizer is being re-initialized from a hardcoded absolute path every time select_examples is called. This is inefficient and makes the code dependent on a specific file system structure. Consider passing the tokenizer as an argument to the function or using a global cache to store the initialized tokenizer.




def count_answer(text: str) -> tuple[list, dict]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for count_answer is incorrect. It specifies tuple[list, dict], but the function implementation returns a string (str) or None.

Suggested change
def count_answer(text: str) -> tuple[list, dict]:
def count_answer(text: str) -> str | None:

return cleaned_text


def annotate_nvidia(input_prompt:str)->list[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for annotate_nvidia is incorrect. It specifies list[str], but it returns the result of count_answer, which is a string or None.

Suggested change
def annotate_nvidia(input_prompt:str)->list[str]:
def annotate_nvidia(input_prompt:str) -> str | None:

results = [None] * len(prompts)

with ThreadPoolExecutor(max_workers=num_workers) as executor:
future_to_idx = {executor.submit(annotate_ascend, p, max_tokens, use_count_answer, task_id): i for i, p in enumerate(prompts)}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

annotate_batch is hardcoded to use annotate_ascend. This prevents the use of other annotation methods (such as annotate_nvidia) when batch processing is required. Consider passing the annotation function as a parameter to make this utility more flexible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant