Hi Tom,
I am opening an issue since I am currently running into problems when trying to resume training from a lora checkpoint.
I start by training a model following:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,max_split_size_mb:128"
from sentence_transformers import SentenceTransformer
from peft import LoraConfig, TaskType
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
losses
)
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset, load_dataset
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
model.max_seq_length = 512
peft_config = LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
inference_mode=False,
target_modules="all-linear",
r=32,
lora_alpha=32,
lora_dropout=0.01,
)
model.add_adapter(peft_config)
dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"].select(range(1_000_000))
eval_dataset: Dataset = dataset_dict["test"]
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=32)
loss = losses.MatryoshkaLoss(model, loss, matryoshka_dims=[1024, 768, 512, 384, 256])
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="models/mb-gooaq",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=1024,
per_device_eval_batch_size=1024,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False,
bf16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=5,
save_total_limit=2,
logging_steps=1,
logging_first_step=True,
run_name="mb-gooaq"
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=loss,
)
trainer.train()
When I attempt to resume from checkpoint with:
trainer.train(resume_from_checkpoint=True)
Or
trainer.train(resume_from_checkpoint=checkpoint_dir)
I get the below:
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:00<00:00, 11448.33it/s]
Loading weights: 0it [00:00, ?it/s]
BertModel LOAD REPORT from: models/mb-gooaq/checkpoint-5
Key | Status |
-------------------------------------------------------------------------------------+------------+-
base_model.model.encoder.layer.{0...23}.attention.output.dense.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.query.lora_B.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.value.lora_B.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.output.dense.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.output.dense.lora_B.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.output.dense.lora_B.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.intermediate.dense.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.key.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.query.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.key.lora_B.default.weight | UNEXPECTED |
base_model.model.pooler.dense.lora_B.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.attention.self.value.lora_A.default.weight | UNEXPECTED |
base_model.model.encoder.layer.{0...23}.intermediate.dense.lora_B.default.weight | UNEXPECTED |
base_model.model.pooler.dense.lora_A.default.weight | UNEXPECTED |
encoder.layer.{0...23}.attention.self.value.lora_A.default.weight | MISSING |
encoder.layer.{0...23}.attention.self.query.lora_A.default.weight | MISSING |
encoder.layer.{0...23}.attention.self.value.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.attention.output.dense.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.output.dense.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.attention.self.key.lora_A.default.weight | MISSING |
encoder.layer.{0...23}.attention.self.key.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.attention.self.query.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.attention.output.dense.lora_A.default.weight | MISSING |
encoder.layer.{0...23}.intermediate.dense.lora_A.default.weight | MISSING |
encoder.layer.{0...23}.intermediate.dense.lora_B.default.weight | MISSING |
encoder.layer.{0...23}.output.dense.lora_A.default.weight | MISSING |
pooler.dense.lora_B.default.weight | MISSING |
pooler.dense.lora_A.default.weight | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
Versions:
"peft>=0.18.1",
"transformers>=5.3.0",
"sentence-transformers>=5.2.2",
Any help would be greatly appreciated.
Thank you,
Enrico
Hi Tom,
I am opening an issue since I am currently running into problems when trying to resume training from a lora checkpoint.
I start by training a model following:
When I attempt to resume from checkpoint with:
Or
I get the below:
Versions:
Any help would be greatly appreciated.
Thank you,
Enrico