How to manage your machine learning pipeline with MLflow

How to manage your machine learning pipeline with MLflow

Here at Kili we’re always exited to improve our processes, thats why we’re starting to work with MLflow. In this article we share with you some insights in how it works and how you can set up you machine learning pipeline with it and we show you a hands on example. MLflow is an open source project that aims to help managing the lifecycle of machine learning models all the way from training to production.

As a data scientist you might have noticed that the process of working with machine learning is quite more complex than software development in some senses. We’re not only working with different versions of code, we’re also working with different versions of data and different parameters.

In any case, it’s not very interesting, in general, to track parameters with git. That’s because in most case, when adjusting parameters, the code was essentially the same. You might want to separate your runs in experiments, but you’ll need to create your own convention for those experiments and also for the place to store their results.

Another very important aspect is deploying this model to production. There are a myriad of tools available out there to create machine learning models, thus, the models are often not compatible with each other but you want to expose a single unified interface to interact with your models. You might, for example, train one model with scikit-learn and another one with pytorch and use them both in production for a given task.

MLflow tries to tackle most of those problems with three main concepts, MLflow tracking, MLflow projects and MLflow Models.

MLflow Tracking

Tracking tries to solve lots of problems, in general, it’s a unified logging hub where you’ll be able to manage your experiments in a team, versions etc. This way you won’t have to worry about logging your results, how to share them with your team and how to access them later on as they’ll be available via the tracking server.

Tracking will be responsible for logging all of your activity during your machine learning experiments. The main information that is tracked is:

  • Code and model versions
  • Timestamps
  • Authors
  • Parameters / Hyperparameters
  • Metrics
  • Artifacts (any kind of file)
  • Models

In MLflow the unit of execution is called run , every run will contain some of the information given above and you’re able to log all the necessary information to be able to repeat the run. You can associate a run with the act of running a script for whatever reason: training, testing, ploting results etc.

You should aggregate runs into experiments. Every experiment should contain, for example, runs of the same code/model with slight variation in architecture or, more commonly, in parameters. It’s important to note that runs in the same experiments will store their large files in the same location, this will be explained better below.

The API

When using MLflow tracking, you should use the API to log all of the information you need, this includes models and artifacts, the example below shows a python script that logs a parameter, some metrics and in the end an artifact. For more informations, you can refer to https://www.mlflow.org/docs/latest/python_api/index.html.

import os
import mlflow
from random import random, randint

from mlflow import log_metric, log_param, log_artifacts

if __name__ == "__main__":
    mlflow.set_tracking_uri('<http://localhost:5000>')

    log_param("param1", randint(0, 100))

    log_metric("foo", random())
    log_metric("foo", random() + 1)
    log_metric("foo", random() + 2)

    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")

    log_artifacts("outputs", "mlruns")
    mlflow.end_run()

The backend store

Although you are able to track your parameters without running a server the recommended approach is to create a MLflow tracking server. This server will give you access to a graphical interface where you are able to manage your runs and experiments. Some of the functionality available via the API is also available via the interface, an example is creating an experiment. When running a MLflow server you can specify a backend store, this store is any SQLAlchemy compatible database that will be responsible for logging anything that is not an artifact. Concretely this means that only the large files will be left out of the backend store, this also includes models. In the end we’ll present a simple example of using MLflow where you’ll be able to see how to configure your own backend store.

The artifact store

Just as the backend store, it’s another storage place for the MLflow tracking server. You’ll also need to configure it as in the example at the end. The goal of the artifact store is to hold large files that are not suitable for a relational database. In general you’ll specify a cloud bucket. Some examples of artifacts are: your models, images (can be plots of metrics etc), binaries etc. Any file can be stored as an artifact.

MLflow Projects

In simple terms projects describe how to run scripts. Their goal to in make every run reproductible, this way machine learning code is easily reproductible. To achieve that they use two main concepts: entry points and environments. Both are described in an MLproject file (example below).

name: My Project

conda_env: my_env.yaml
# Can have a docker_env instead of a conda_env, e.g.
# docker_env:
#    image:  mlflow-docker-example

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"

The project will consist of all the folders within the folder that contains a MLproject file, it’s recommended that you create a git repo for each project or at least that the project is contained within a git repo. The reason for that is that MLflow uses the commits to track the version of the code that was used to trigger a run.

The entry points

Each entry point corresponds to a script that should be run, each entry point should be described in the MLproject file and it has a unique name. You can also add parameters to the entry points and they’ll be automatically logged after each run. In the example below we have two entry points train and test. After that, to run the project you just need to use the mlflow run -e <entry point> command. More details in the example at the end.

name: test_project

conda_env: conda_env.yaml

entry_points:
  train:
    command: "python3 train.py --gpus=1"
  test:
    command: "python3 test.py --vis-preds"

The environment

It describes the dependencies to run the code. If your code is complex and needs lots of dependencies you may want to specify them as a Dockerfile. An example is shown below:

RUN apt-get update && apt-get install -y --no-install-recommends \\
        build-essential \\
	    git \\
	    curl \\
        libglib2.0-0 \\
        software-properties-common \\
        python3.6-dev \\
        python3-pip \\
        python3-tk

RUN pip3 install --upgrade pip
RUN pip3 install setuptools
RUN pip3 install matplotlib numpy pandas scipy tqdm pyyaml easydict scikit-image bridson Pillow ninja
RUN pip3 install imgaug mxboard graphviz
RUN pip3 install albumentations==0.5       
RUN pip3 install opencv-python-headless
RUN pip3 install Cython
RUN pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f <https://download.pytorch.org/whl/torch_stable.html>
RUN pip3 install scikit-learn
RUN pip3 install tensorboard

In this case you just specify what needs to be installed, no need to copy folders or else. When you use the mlflow run, MLflow will create a docker and use this docker as base image. In the docker created by MLflow it’ll copy the full directory tree. It’s important to know that it’ll only build the docker when it’s needed and not at every run. More details for docker environments can be found at https://github.com/mlflow/mlflow/tree/master/examples/docker.

Using a docker to build the project gives us the most flexibility but might introduce some problems, for instance the memory management. When possible it’s preferable to use conda environments. They define the python installations needed. You’ll find below an example of a conda environment file.

# Conda environment used for mlflow
name: iseg
dependencies:
  - python=3.8.5
  - pyyaml
  - tqdm
  - pip
  - pip:
    - --find-links <https://download.pytorch.org/whl/torch_stable.html>
    - google-cloud-storage
    - mlflow
    - albumentations==0.5 
    - easydict
    - tensorboard
    - torch==1.5.0+cu101
    - torchvision==0.6.0+cu101

Note that we see mlflow, this dependency is always needed. If you use a cloud artifact store you’ll also need to add a dependency to support artifact logging. In the example above you can find google-cloud-storage as the artifacts will be stored in google buckets.

MLflow Models

Models solve the lack of unification in machine learning code. There’s a myriad of tools and frameworks and using them all at the same time can make it difficult to package those models for production. With models you can have a unified interface to interact and serve models which ultimately speeds up the management of the lifecycle in machine learning.

An MLflow model is essentially a folder of artifacts that contains the information needed to execute a model alongside a MLmodel file. When saving a model you’ll save a MLmodel file, the model itself and an environment that describes the dependencies used while running the model.

An example of MLmodel file is shown below

time_created: 2018-05-25T17:28:53.35

flavors:
  sklearn:
    sklearn_version: 0.19.1
    pickled_model: model.pkl
  python_function:
    loader_module: mlflow.sklearn

To create a model you just need to use the mlflow.***.log_model method, this method should log all the necessary artifacts.

MLflow provides already different log functions for pytorch models, tensorflow models, sklearn models etc. In any case, you can define your custom models using the pyfunc flavor. A flavor is a convention used by deployment tools for deploying models and is one of the main strengths of MLflow Models as it allows to standardise the interface of different types of models. For more information go to https://www.mlflow.org/docs/latest/models.html.

Another very important concept is the MLflow Model Registry.

Model Registry

In the tracking interface you’ll be able to see the Models tab, there you can manage the model registry. When you log a model it’ll be accessible inside the run in the artifacts area, however it’s useful to version your models independently from the code. The model registry allows you to version and manage the lifecycle of your models.

You can use either the UI or the API to register models. The steps in a model lifecycle are:

  • Registering → adds the model to the model register or adds a new version
  • Transition versions between stages

You can see the interface for the model registry below:

A hands on example

In this example we’ll develop a set up that allows you to track and persist your tracking information on the cloud. This set up is fully usable if you want to run your machine learning pipelines.

The tracking server

First we’ll prepare the tracking server, it’s useful to set up the server inside a docker, this way it’s much easier to make it run in different systems. You’ll find below the Dockerfile to the tracking server:

FROM python:3.7-slim-buster

# Install python packages
RUN pip install mlflow google-cloud-storage psycopg2-binary

ADD app/ /app

WORKDIR /app

CMD ./run.sh

Here, we install google-cloud-store because we need it for our artifact store and we install psycopg2-binary for our backend store (postgres in google cloud sql). To run the server you can simply run docker run -it --rm -p 5000:5000 mlflow-image . But first you should build the image with docker build -t mlflow-image . .

Now you’re able to run a tracking server in the port 5000. You might have noticed that the command used to run is in fact ./run.sh ; that’s because we’ve also created a folder app which contains this script with the run command. This file should contain the following:

mlflow server \\
  -h 0.0.0.0 \\
  --backend-store-uri postgresql+psycopg2://$MLFLOW__DATABASE_USER:$MLFLOW__DATABASE_PASSWORD@$MLFLOW__DATABASE_IP/$MLFLOW__DATABASE_NAME \\
  --default-artifact-root $MLFLOW__ARTIFACTS_BUCKET

The values with an $ should be replaced by your artifact store and your backend store. You can find more information at https://docs.sqlalchemy.org/en/14/core/engines.html on how to create the connection for the backend store.

The final file structure should be:

app/
  run.sh
Dockerfile

At the end of this step after accessing via the browser https://localhost:5000 you should see the tracking UI.

Creating a simple project

Now we’ll create a simple project that trains a simple logistic regression model. The structure for the project will be:

test_project/
   conda_env.yaml
   create_expariment.py
   MLproject
   test.py
   train.py

We start by the conda_env.yaml file:

# Conda environment used for mlflow
name: test
dependencies:
        - python=3.7
        - scikit-learn
        - pip
        - pip:
                - mlflow
                - google-cloud-storage

This file contains the dependencies to run the project, we need mlflow, but also google-cloud-storage to store artifacts and skit-learn which will be used to train the model.

Then the file create_experiment.py which will be used to create a new experiment via the API. The first command you run should be python create_experiment.py .

import mlflow

def main():
    mlflow.set_tracking_uri('<http://localhost:5000>')
    mlflow.create_experiment('test_experiment', 'gs://path/to/bucket')

if __name__ == "__main__":
    main()

Now the train.py script, this one trains a simple logistic regression on the iris dataset, it also logs a parameter, three metrics and the trained model at the end.

import os
import mlflow
from random import random, randint

from mlflow import log_metric, log_param, log_artifacts
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

def main():
    # Logging parameter
    log_param("param1", randint(0, 100))
    
    # Logging metric
    log_metric("foo", random())
    log_metric("foo", random() + 1)
    log_metric("foo", random() + 2)

    # Logging artifacts
    if not os.path.exists("outputs"):
        os.makedirs("outputs")
    with open("outputs/test.txt", "w") as f:
        f.write("hello world!")

    log_artifacts("outputs", "mlruns")
    
    # Logging models
    X, y = load_iris(return_X_y=True)
    clf = LogisticRegression(random_state=0).fit(X, y)
    mlflow.sklearn.log_model(clf, "logistic_regression")

    mlflow.end_run()

if __name__ == "__main__":
    main()

The test.py script is shown below, it takes one parameter as input a run id, it uses that to fetch the model from the given run and use this model to print some information

import mlflow
import click
from sklearn.datasets import load_iris

@click.command()
@click.option('--run_id', required=True)
def main(run_id):
    # Load model as a PyFuncModel
    mlflow.set_tracking_uri('<http://localhost:5000>')
    logged_model = f'runs:/{run_id}/logistic_regression'
    loaded_model = mlflow.sklearn.load_model(logged_model)

    X, y = load_iris(return_X_y=True)
    print('===== Predict probas ======')
    print(loaded_model.predict_proba(X[:2, :]))
    print('===========================')

if __name__ == "__main__":
    main()

Finally we have the MLproject file, here we define the entry points and the parameter for the second script as well as the environment file.

name: test_project

conda_env: conda_env.yaml

entry_points:
  train:
    command: "python3 train.py" 
  test:
    parameters:
      run_id: string
    command: "python3 test.py --run_id {run_id}"

We set up the environment to be the one defined in the conda environment file and two entry points, one for the training and the second for testing, now, to run training we just have to run:

MLFLOW_TRACKING_URI=http://localhost:5000 MLFLOW_EXPERIMENT_NAME=test_experiment GOOGLE_APPLICATION_CREDENTIALS=~/path/to/gcloud-key-file.json mlflow run . -e train

And for testing we do:

MLFLOW_TRACKING_URI=http://localhost:5000 MLFLOW_EXPERIMENT_NAME=test_experiment GOOGLE_APPLICATION_CREDENTIALS=~/path/to/gcloud-key-file.json mlflow run . -e test -P run_id=<run_id>

There are two main considerations, first, we’ve added three environment variables. When running a server you have to provide the URI to the server as environment variable to mflow run. You also need to provide credentials if you log artifacts or models to a bucket and finally you need to specify the experiment’s name.

After triggering a run of a project you’ll automatically create a run on the tracking server and any parameters used are also logged, that’s what you’ll see after running the test script.

After running both entry points you’ll be able to see a train run and a test run, for the train one you’ll see the model and you’ll also be able to register it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.