Conversation
Signed-off-by: Dddug <xxx@email.com>
There was a problem hiding this comment.
Code Review
This pull request significantly expands the data annotation framework by introducing advanced example selection strategies (similarity, diversity, and quality-based), task-specific guidance, and robust annotation methods like batch processing and self-consistency. It also integrates social media sentiment analysis and Chain-of-Thought (CoT) reasoning. Feedback focuses on improving portability by avoiding hardcoded absolute paths, correcting mismatched type hints in count_answer and annotate_batch, and removing redundant code such as unnecessary try-except blocks, duplicate imports, and unused variables.
| DATA_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data' | ||
| OUTPUT_DIR = '/root/flagos/OpenSeek/openseek/competition/LongContext-ICL-Annotation/outputs' |
| try: | ||
|
|
||
| tfidf_matrix = vectorizer.fit_transform(texts) | ||
| except: | ||
|
|
||
| tfidf_matrix = vectorizer.transform(texts) |
There was a problem hiding this comment.
The try-except block here is redundant and potentially misleading. fit_transform will always fit the vectorizer to the provided text if the input is valid, making the except block unreachable for its intended purpose (falling back to transform if already fitted). Since this function is intended to compute similarity for a specific set of texts, fit_transform is sufficient.
tfidf_matrix = vectorizer.fit_transform(texts)| score *= 0.7 | ||
|
|
||
|
|
||
| digit_ratio = sum(1 for c in input_text if c.isdigit()) / max(input_len, 1) |
|
|
||
|
|
||
|
|
||
| def count_answer(text: str) -> tuple[list, dict]: |
There was a problem hiding this comment.
The type hint tuple[list, dict] for count_answer does not match the implementation, which returns a single string (or None). Additionally, the docstring's return description is inconsistent with the actual return value.
| def count_answer(text: str) -> tuple[list, dict]: | |
| def count_answer(text: str) -> str | None: |
| return prediction, whole_result | ||
|
|
||
|
|
||
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: |
There was a problem hiding this comment.
The return type hint list[str] is incorrect. The function returns a list of tuples containing the prediction and the whole result from annotate_ascend.
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[str]: | |
| def annotate_batch(prompts: list[str], num_workers: int = 4, max_tokens: int = 128, use_count_answer: bool = True, task_id: int = None) -> list[tuple[str | None, str | None]]: |
|
|
||
|
|
||
|
|
||
| import re |
No description provided.