Skip to content

[Bug] SmolInstruct -- Unexpected data format when loading subset property_prediction-sider.jsonl #2440

@zqyz333

Description

@zqyz333

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

Any Environment.

Reproduces the problem - code/configuration sample

config.py

from mmengine.config import read_base

with read_base():
    from opencompass.configs.datasets.SmolInstruct.smolinstruct_nc_0shot_instruct import mini_nc_0shot_instruct_datasets
    from opencompass.configs.datasets.SmolInstruct.smolinstruct_pp_acc_0_shot_instruct import mini_pp_acc_datasets_0shot_instruct
    from opencompass.configs.datasets.SmolInstruct.smolinstruct_rmse_0shot_instruct import mini_pp_rmse_0shot_instruct_datasets
    from opencompass.configs.datasets.SmolInstruct.smolinstruct_fts_0shot_instruct import mini_fts_0shot_instruct_datasets
    from opencompass.configs.datasets.SmolInstruct.smolinstruct_meteor_0shot_instruct import mini_meteor_0shot_instruct_datasets
    from opencompass.configs.models.internlm.internlm_7b import models

mini_smolinstruct_datasets_0shot_instruct = mini_nc_0shot_instruct_datasets + mini_pp_rmse_0shot_instruct_datasets + mini_pp_acc_datasets_0shot_instruct + mini_meteor_0shot_instruct_datasets + mini_fts_0shot_instruct_datasets
    

from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLEvalTask, OpenICLInferTask

infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
    runner=dict(type=LocalRunner,
                max_num_workers=8,
                task=dict(type=OpenICLInferTask)),
)

eval = dict(
    partitioner=dict(type=NaivePartitioner, n=10),
    runner=dict(type=LocalRunner,
                max_num_workers=256,
                task=dict(type=OpenICLEvalTask)),
)

work_dir = 'outputs/test/SmolInstruct'

Reproduces the problem - command or script

opencompass /path/to/config.py --reuse

Reproduces the problem - error message

I redirected the dataset path to the already downloaded and decompressed path in opencompass/configs/datasets/SmolInstruct/smolinstruct_fts_0shot_instruct.py -- 63 lines and other mini subset's, but when the evaluation started loading the data, this error was raised:

TypeError: Couldn't cast array of type struct<Hepatobiliary disorders: string, Metabolism and nutrition disorders: string, Eye disorders: string, Musculoskeletal and connective tissue disorders: string, Gastrointestinal disorders: string, Immune system disorders: string, Reproductive system and breast disorders: string, Neoplasms benign, malignant and unspecified (incl cysts and polyps): string, Endocrine disorders: string, Vascular disorders: string, Blood and lymphatic system disorders: string, Skin and subcutaneous tissue disorders: string, Congenital, familial and genetic disorders: string, Respiratory, thoracic and mediastinal disorders: string, Psychiatric disorders: string, Renal and urinary disorders: string, Pregnancy, puerperium and perinatal conditions: string, Ear and labyrinth disorders: string, Cardiac disorders: string, Nervous system disorders: string> to string

Other information

Unlike other datasets where output is a plain string, property_prediction-sider.jsonl in SmolInstruct stores output as a dictionary. Reading all files under the pp directory with a uniform parser therefore causes an error when processing this specific file.

Sample entry from property_prediction-sider.jsonl:

{
  "input": "NC1=CC=C(N=NC2=CC=CC=C2)C(N)=N1",
  "output": {
    "Hepatobiliary disorders": "Yes",
    "Metabolism and nutrition disorders": "No",
    "Eye disorders": "Yes",
    "Musculoskeletal and connective tissue disorders": "No",
    "Gastrointestinal disorders": "Yes",
    "Immune system disorders": "Yes",
    "Reproductive system and breast disorders": "No",
    "Neoplasms benign, malignant and unspecified (incl cysts and polyps)": "No",
    "Endocrine disorders": "No",
    "Vascular disorders": "No",
    "Blood and lymphatic system disorders": "Yes",
    "Skin and subcutaneous tissue disorders": "Yes",
    "Congenital, familial and genetic disorders": "Yes",
    "Respiratory, thoracic and mediastinal disorders": "No",
    "Psychiatric disorders": "No",
    "Renal and urinary disorders": "Yes",
    "Pregnancy, puerperium and perinatal conditions": "No",
    "Ear and labyrinth disorders": "No",
    "Cardiac disorders": "No",
    "Nervous system disorders": "Yes"
  },
  "task": "property_prediction-sider",
  "split": "test"
}

To ensure reproducibility across the community, we propose one of the following:

  1. Provide a unified, revised version of the dataset; or
  2. Update the evaluation pipeline's data-loading configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions