Identity pairs are included in the training

Currently the training always includes identity pairs as `shuffle_combinations(iterable: Iterable, replacement: bool = True)` has the `replacement` parameter set by default to true. This is inconsistent with the setfit paper:

> Assuming that a small number (K) of labeled examples are given for a binary classification task, the
potential size of the ST fine-tuning set T is derived
from the number of unique sentence pairs that can
be generated, namely K(K − 1)/2, which is significantly larger than just K.

Putting aside the alignment with the paper, I think it does not make sense to include these pairs in the dataset as they are already "perfectly fitted". 
Is this behaviour intentional? If yes, are there any plans for a possibility for switching the identity pairs inclusion off?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity pairs are included in the training #620

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Identity pairs are included in the training #620

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions