The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.
As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.
Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.
The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.
As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.
Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.