Skip to content

When using model with Pyspark on worker machine #103

@LovAsawa-Draup

Description

@LovAsawa-Draup

The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.

As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.

Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions