Skip to content

add 2#191

Open
bitfact1 wants to merge 1 commit intoFlagAI-Open:mainfrom
bitfact1:bitfact1-patch-1
Open

add 2#191
bitfact1 wants to merge 1 commit intoFlagAI-Open:mainfrom
bitfact1:bitfact1-patch-1

Conversation

@bitfact1
Copy link
Copy Markdown

No description provided.

Signed-off-by: bitfact1 <shanzhi_gentle@163.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the data annotation pipeline by implementing batch processing, self-consistency checks, and advanced example selection strategies using TF-IDF similarity, diversity clustering, and quality filtering. It also introduces task-specific guidance and Chain-of-Thought reasoning. The review feedback identifies several opportunities to improve code portability and efficiency, such as replacing hardcoded absolute paths with relative ones, passing the pre-loaded tokenizer to functions to avoid redundant loading, correcting function type hints, and making model names configurable.

Comment on lines +92 to +97
examples_str = select_examples(icl_examples, task_description, text2annotate,
is_code_generation=is_code_generation,
use_task_aware=True, task_id=task_id,
use_quality_filter=True, quality_threshold=0.5,
use_diversity=False, use_similarity=False,
use_cot=use_cot)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The qwen_tokenizer instance already loaded in main.py should be passed to select_examples. This avoids the overhead of re-loading the tokenizer from disk and ensures that the tokenizer path provided via command-line arguments is actually used.

Suggested change
examples_str = select_examples(icl_examples, task_description, text2annotate,
is_code_generation=is_code_generation,
use_task_aware=True, task_id=task_id,
use_quality_filter=True, quality_threshold=0.5,
use_diversity=False, use_similarity=False,
use_cot=use_cot)
examples_str = select_examples(icl_examples, task_description, text2annotate,
tokenizer=qwen_tokenizer,
is_code_generation=is_code_generation,
use_task_aware=True, task_id=task_id,
use_quality_filter=True, quality_threshold=0.5,
use_diversity=False, use_similarity=False,
use_cot=use_cot)

return examples_str, i
return examples_str

def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the function signature to accept the tokenizer as an argument. This allows the function to use the tokenizer already loaded in the main script, improving efficiency and removing the dependency on a hardcoded path.

Suggested change
def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:
def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, tokenizer: AutoTokenizer, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:




tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Remove this line as the tokenizer should now be passed as an argument to the function. Hardcoding the path here makes the function ignore the user-specified tokenizer path.

Suggested change
tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)
# Tokenizer is now passed as an argument

Comment on lines +14 to +15
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding absolute paths like /root/flagos/... makes the code non-portable and environment-dependent. It is better to use paths relative to the script location or environment variables to ensure the project can be run on different machines.

Suggested change
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'data')
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'outputs')

Comment on lines +35 to +38
default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
help='Prefix path to save the evaluation logs.')
parser.add_argument('--tokenizer_path', type=str,
default='/root/flagos/Qwen3-4B')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default values for --log_path_prefix and --tokenizer_path contain hardcoded absolute paths. These should be made relative or configurable via environment variables to maintain flexibility across different deployment environments.

Comment on lines +35 to +40
try:

tfidf_matrix = vectorizer.fit_transform(texts)
except:

tfidf_matrix = vectorizer.transform(texts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try-except block here is fragile and potentially hides errors. Since vectorizer is a global instance that is re-fitted on a combined list of texts in every call to ensure a common feature space, you should simply use fit_transform. If it fails, it indicates an issue with the input data that should be addressed rather than ignored.

    tfidf_matrix = vectorizer.fit_transform(texts)


openai.api_key = "EMPTY"
openai.base_url = "http://localhost:9010/v1/"
model = "Qwen3-4B-ascend-flagos"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name is hardcoded. It is better to define this as a constant at the top of the file or pass it as a parameter to allow for easier model updates and better maintainability.

return prediction, whole_result


def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint for annotate_batch is incorrect. The implementation returns a list of tuples (prediction, whole_result) from annotate_ascend, not a list of strings.

Suggested change
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant