When using model with Pyspark on worker machine

The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.

As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.

**Proposed Solution:**
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using model with Pyspark on worker machine #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When using model with Pyspark on worker machine #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions