fix: update e2e-ml-workflow deps to fix deployment container crash#3847
Open
Chakradhar886 wants to merge 14 commits intomainfrom
Open
fix: update e2e-ml-workflow deps to fix deployment container crash#3847Chakradhar886 wants to merge 14 commits intomainfrom
Chakradhar886 wants to merge 14 commits intomainfrom
Conversation
- Update conda.yaml: python 3.8->3.10, scikit-learn 0.24.2->1.5.1, numpy, scipy, pandas, azureml-mlflow versions aligned with sklearn-1.5 - Update Environment: base image ubuntu20.04->ubuntu22.04, remove hardcoded version to avoid stale cached environments - Update workflow: python-version 3.8->3.10 (3.8 EOL, sklearn-1.5 requires 3.9+) Root cause: data prep environment (python 3.8, sklearn 0.24.2) was incompatible with curated sklearn-1.5 training environment, causing 502 liveness probe failure on deployment container startup.
|
@Chakradhar886 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
…ent crash - Update conda.yaml: python 3.8->3.10, scikit-learn 0.24.2->1.5.1, numpy, scipy, pandas, azureml-mlflow versions aligned with sklearn-1.5 - Update Environment: base image ubuntu20.04->ubuntu22.04, remove hardcoded version to avoid stale cached environments - Update workflow: python-version 3.8->3.10 (3.8 EOL, sklearn-1.5 requires 3.9+) - Add ml_client.jobs.stream() call after pipeline submission to wait for pipeline completion before deploying the model Root cause: pipeline job was submitted asynchronously but notebook proceeded to deploy without waiting. The deployment picked up a stale model from a previous run trained with incompatible sklearn, causing 502 liveness probe failure on the inference container.
…Azure/azureml-examples into fix/e2e-ml-workflow-update-deps
…Azure/azureml-examples into fix/e2e-ml-workflow-update-deps
…Azure/azureml-examples into fix/e2e-ml-workflow-update-deps
- Revert train.py to module-level mlflow (matches working pipeline.ipynb) - Change train.yml to use custom environment (azureml:aml-scikit-learn@latest) instead of curated sklearn-1.5 to ensure consistency between data_prep and train pipeline steps
- Remove inference-schema[numpy-support]==1.3.0 (incompatible with Python 3.10 / numpy 1.26.4, not needed for training pipeline) - Add mlflow==2.14.3 pin (matching working azureml-in-a-day tutorial) - These likely caused conda environment build failure -> train_job crash
Add missing pip packages (psutil, tqdm, ipykernel, matplotlib) that are present in the working azureml-in-a-day conda.yaml. The environment build may require these for proper initialization.
The e2e-ml-workflow notebook was submitting the pipeline asynchronously and immediately proceeding to deployment. This caused the deployment to use a stale model from a previous run, resulting in a 502 liveness probe failure. Added ml_client.jobs.stream(pipeline_job.name) to wait for pipeline completion before deploying the model.
…Azure/azureml-examples into fix/e2e-ml-workflow-update-deps
The data_prep_component declared its data input as uri_folder, but the pipeline passes uri_file (CSV data). Newer AzureML SDK enforces the declared type, causing args.data to be a directory path instead of a file path, which makes pd.read_csv(args.data) fail. Changed to uri_file to match the actual data type being passed, consistent with the quickstart.ipynb tutorial pattern.
- Added version='1.0.0' to Environment() to ensure a fresh environment build with the updated conda.yaml packages, avoiding potential stale environment resolution in the shared workspace. - Reverted data_prep_component input type back to uri_folder (original value) since the pipeline was passing before the dep updates.
…sion - Added select_first_file() to data_prep.py to properly handle the uri_folder input type. The component declares data as uri_folder, so args.data is a folder path - pd.read_csv needs a file path. This matches the pattern already used in train.py. - Removed explicit version='1.0.0' from Environment() to avoid colliding with existing stale environment versions in the shared workspace. Uses auto-versioning like the working azureml-in-a-day tutorial.
mlflow==2.14.3 imports pkg_resources (from setuptools) at module level. With pip=24.0 and Python 3.10, setuptools is no longer bundled by default in conda environments, causing: ModuleNotFoundError: No module named 'pkg_resources'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause: data prep environment (python 3.8, sklearn 0.24.2) was incompatible with curated sklearn-1.5 training environment, causing 502 liveness probe failure on deployment container startup.
Description
Checklist