Skip to content

Identity pairs are included in the training #620

@SzymonKozl

Description

@SzymonKozl

Currently the training always includes identity pairs as shuffle_combinations(iterable: Iterable, replacement: bool = True) has the replacement parameter set by default to true. This is inconsistent with the setfit paper:

Assuming that a small number (K) of labeled examples are given for a binary classification task, the
potential size of the ST fine-tuning set T is derived
from the number of unique sentence pairs that can
be generated, namely K(K − 1)/2, which is significantly larger than just K.

Putting aside the alignment with the paper, I think it does not make sense to include these pairs in the dataset as they are already "perfectly fitted".
Is this behaviour intentional? If yes, are there any plans for a possibility for switching the identity pairs inclusion off?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions