Skip to content

Dddug patch 2#192

Open
Dddug wants to merge 2 commits intoFlagAI-Open:mainfrom
Dddug:Dddug-patch-2
Open

Dddug patch 2#192
Dddug wants to merge 2 commits intoFlagAI-Open:mainfrom
Dddug:Dddug-patch-2

Conversation

@Dddug
Copy link
Copy Markdown

@Dddug Dddug commented Apr 16, 2026

No description provided.

Dddug added 2 commits April 17, 2026 00:49
Signed-off-by: Dddug <xxx@email.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the data annotation framework by introducing advanced example selection strategies (similarity, diversity, and quality-based), task-specific guidance, and robust annotation methods like batch processing and self-consistency. It also integrates social media sentiment analysis and Chain-of-Thought (CoT) reasoning. Feedback focuses on improving portability by avoiding hardcoded absolute paths, correcting mismatched type hints in count_answer and annotate_batch, and removing redundant code such as unnecessary try-except blocks, duplicate imports, and unused variables.

Comment on lines +14 to +15
DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data'
OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using hardcoded absolute paths makes the code non-portable and dependent on a specific environment structure. It is better to use relative paths or environment variables to define data and output directories.

DATA_DIR = './data'\nOUTPUT_DIR = './outputs'

Comment on lines +35 to +40
try:

tfidf_matrix = vectorizer.fit_transform(texts)
except:

tfidf_matrix = vectorizer.transform(texts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try-except block here is redundant and potentially misleading. fit_transform will always fit the vectorizer to the provided text if the input is valid, making the except block unreachable for its intended purpose (falling back to transform if already fitted). Since this function is intended to compute similarity for a specific set of texts, fit_transform is sufficient.

    tfidf_matrix = vectorizer.fit_transform(texts)

score *= 0.7


digit_ratio = sum(1 for c in input_text if c.isdigit()) / max(input_len, 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable digit_ratio is calculated but never used to adjust the quality score. If this was intended to be part of the quality heuristic, it should be applied to the score variable; otherwise, it should be removed.




def count_answer(text: str) -> tuple[list, dict]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint tuple[list, dict] for count_answer does not match the implementation, which returns a single string (or None). Additionally, the docstring's return description is inconsistent with the actual return value.

Suggested change
def count_answer(text: str) -> tuple[list, dict]:
def count_answer(text: str) -> str | None:

return prediction, whole_result


def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint list[str] is incorrect. The function returns a list of tuples containing the prediction and the whole result from annotate_ascend.

Suggested change
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]:
def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple[str | None, str | None]]:




import re
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import re statement inside extract_social_media_features is redundant as re is already imported at the top of the file (line 1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant