Skip to content

Issue when resuming from LORA checkpoint #3701

@conceptofmind

Description

@conceptofmind

Hi Tom,

I am opening an issue since I am currently running into problems when trying to resume training from a lora checkpoint.

I start by training a model following:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:128"

from sentence_transformers import SentenceTransformer
from peft import LoraConfig, TaskType
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses
)
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset, load_dataset


model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
model.max_seq_length = 512

peft_config = LoraConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    inference_mode=False,
    target_modules="all-linear",
    r=32,
    lora_alpha=32,
    lora_dropout=0.01,
)
model.add_adapter(peft_config)

dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"].select(range(1_000_000))
eval_dataset: Dataset = dataset_dict["test"]

loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=32)
loss = losses.MatryoshkaLoss(model, loss, matryoshka_dims=[1024, 768, 512, 384, 256])

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mb-gooaq",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=1024,
    per_device_eval_batch_size=1024,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,
    bf16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=5000,
    save_strategy="steps",
    save_steps=5,
    save_total_limit=2,
    logging_steps=1,
    logging_first_step=True,
    run_name="mb-gooaq"
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

When I attempt to resume from checkpoint with:

trainer.train(resume_from_checkpoint=True)

Or

trainer.train(resume_from_checkpoint=checkpoint_dir)

I get the below:

Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 11448.33it/s]
Loading weights: 0it [00:00, ?it/s]
BertModel LOAD REPORT from: models/mb-gooaq/checkpoint-5
Key                                                                                  | Status     | 
-------------------------------------------------------------------------------------+------------+-
base_model.model.encoder.layer.{0...23}.attention.output.dense.lora_A.default.weight | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.query.lora_B.default.weight   | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.value.lora_B.default.weight   | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.output.dense.lora_A.default.weight           | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.output.dense.lora_B.default.weight           | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.output.dense.lora_B.default.weight | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.intermediate.dense.lora_A.default.weight     | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.key.lora_A.default.weight     | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.query.lora_A.default.weight   | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.key.lora_B.default.weight     | UNEXPECTED | 
base_model.model.pooler.dense.lora_B.default.weight                                  | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.attention.self.value.lora_A.default.weight   | UNEXPECTED | 
base_model.model.encoder.layer.{0...23}.intermediate.dense.lora_B.default.weight     | UNEXPECTED | 
base_model.model.pooler.dense.lora_A.default.weight                                  | UNEXPECTED | 
encoder.layer.{0...23}.attention.self.value.lora_A.default.weight                    | MISSING    | 
encoder.layer.{0...23}.attention.self.query.lora_A.default.weight                    | MISSING    | 
encoder.layer.{0...23}.attention.self.value.lora_B.default.weight                    | MISSING    | 
encoder.layer.{0...23}.attention.output.dense.lora_B.default.weight                  | MISSING    | 
encoder.layer.{0...23}.output.dense.lora_B.default.weight                            | MISSING    | 
encoder.layer.{0...23}.attention.self.key.lora_A.default.weight                      | MISSING    | 
encoder.layer.{0...23}.attention.self.key.lora_B.default.weight                      | MISSING    | 
encoder.layer.{0...23}.attention.self.query.lora_B.default.weight                    | MISSING    | 
encoder.layer.{0...23}.attention.output.dense.lora_A.default.weight                  | MISSING    | 
encoder.layer.{0...23}.intermediate.dense.lora_A.default.weight                      | MISSING    | 
encoder.layer.{0...23}.intermediate.dense.lora_B.default.weight                      | MISSING    | 
encoder.layer.{0...23}.output.dense.lora_A.default.weight                            | MISSING    | 
pooler.dense.lora_B.default.weight                                                   | MISSING    | 
pooler.dense.lora_A.default.weight                                                   | MISSING    | 

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING       :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.

Versions:

    "peft>=0.18.1",
    "transformers>=5.3.0",
    "sentence-transformers>=5.2.2",

Any help would be greatly appreciated.

Thank you,

Enrico

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions