add 2 by bitfact1 · Pull Request #191 · FlagAI-Open/OpenSeek

bitfact1 · 2026-04-16T16:38:55Z

No description provided.

Signed-off-by: bitfact1 <shanzhi_gentle@163.com>

gemini-code-assist

Code Review

This pull request significantly enhances the data annotation pipeline by implementing batch processing, self-consistency checks, and advanced example selection strategies using TF-IDF similarity, diversity clustering, and quality filtering. It also introduces task-specific guidance and Chain-of-Thought reasoning. The review feedback identifies several opportunities to improve code portability and efficiency, such as replacing hardcoded absolute paths with relative ones, passing the pre-loaded tokenizer to functions to avoid redundant loading, correcting function type hints, and making model names configurable.

gemini-code-assist · 2026-04-16T16:41:08Z

+            examples_str = select_examples(icl_examples, task_description, text2annotate, 
+                                          is_code_generation=is_code_generation,
+                                          use_task_aware=True, task_id=task_id,
+                                          use_quality_filter=True, quality_threshold=0.5,
+                                          use_diversity=False, use_similarity=False,
+                                          use_cot=use_cot)


The qwen_tokenizer instance already loaded in main.py should be passed to select_examples. This avoids the overhead of re-loading the tokenizer from disk and ensures that the tokenizer path provided via command-line arguments is actually used.

Suggested change

examples_str = select_examples(icl_examples, task_description, text2annotate,

is_code_generation=is_code_generation,

use_task_aware=True, task_id=task_id,

use_quality_filter=True, quality_threshold=0.5,

use_diversity=False, use_similarity=False,

use_cot=use_cot)

examples_str = select_examples(icl_examples, task_description, text2annotate,

tokenizer=qwen_tokenizer,

is_code_generation=is_code_generation,

use_task_aware=True, task_id=task_id,

use_quality_filter=True, quality_threshold=0.5,

use_diversity=False, use_similarity=False,

use_cot=use_cot)

gemini-code-assist · 2026-04-16T16:41:09Z

+            return examples_str, i
+    return examples_str
+
+def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:


Update the function signature to accept the tokenizer as an argument. This allows the function to use the tokenizer already loaded in the main script, improving efficiency and removing the dependency on a hardcoded path.

Suggested change

def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:

def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, tokenizer: AutoTokenizer, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:

gemini-code-assist · 2026-04-16T16:41:09Z

+
+
+
+    tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)


Remove this line as the tokenizer should now be passed as an argument to the function. Hardcoding the path here makes the function ignore the user-specified tokenizer path.

Suggested change

tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)

# Tokenizer is now passed as an argument

gemini-code-assist · 2026-04-16T16:41:09Z

+DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
+OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'


Hardcoding absolute paths like /root/flagos/... makes the code non-portable and environment-dependent. It is better to use paths relative to the script location or environment variables to ensure the project can be run on different machines.

Suggested change

DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'

OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'

DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'data')

OUTPUT_DIR = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'outputs')

gemini-code-assist · 2026-04-16T16:41:09Z

+                        default='/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs/',
+                        help='Prefix path to save the evaluation logs.')
+    parser.add_argument('--tokenizer_path', type=str,
+                        default='/root/flagos/Qwen3-4B')


The default values for --log_path_prefix and --tokenizer_path contain hardcoded absolute paths. These should be made relative or configurable via environment variables to maintain flexibility across different deployment environments.

gemini-code-assist · 2026-04-16T16:41:09Z

+    try:
+
+        tfidf_matrix = vectorizer.fit_transform(texts)
+    except:
+
+        tfidf_matrix = vectorizer.transform(texts)


The try-except block here is fragile and potentially hides errors. Since vectorizer is a global instance that is re-fitted on a combined list of texts in every call to ensure a common feature space, you should simply use fit_transform. If it fails, it indicates an issue with the input data that should be addressed rather than ignored.

tfidf_matrix = vectorizer.fit_transform(texts)

gemini-code-assist · 2026-04-16T16:41:09Z

+
+    openai.api_key = "EMPTY"
+    openai.base_url = "http://localhost:9010/v1/"
+    model = "Qwen3-4B-ascend-flagos"


The model name is hardcoded. It is better to define this as a constant at the top of the file or pass it as a parameter to allow for easier model updates and better maintainability.

gemini-code-assist · 2026-04-16T16:41:09Z

+    return prediction, whole_result
+
+
+def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:


The return type hint for annotate_batch is incorrect. The implementation returns a list of tuples (prediction, whole_result) from annotate_ascend, not a list of strings.

Suggested change

def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:

def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple]:

add 2

1a43b17

Signed-off-by: bitfact1 <shanzhi_gentle@163.com>

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add 2#191

add 2#191
bitfact1 wants to merge 1 commit intoFlagAI-Open:mainfrom
bitfact1:bitfact1-patch-1

bitfact1 commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:
	def select_examples(all_examples: list[dict], task_description: str, text2annotate: str, tokenizer: AutoTokenizer, is_code_generation: bool = False, use_diversity: bool = True, use_similarity: bool = False, use_task_aware: bool = False, task_id: int = None, use_quality_filter: bool = False, quality_threshold: float = 0.5, use_cot: bool = False) -> str:




		tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)

	tokenizer = AutoTokenizer.from_pretrained("/root/flagos/Qwen3-4B", trust_remote_code=True)
	# Tokenizer is now passed as an argument

		DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
		OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'

		return prediction, whole_result


		def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:

Conversation

bitfact1 commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant