LoadingLoading
Date2021-09-03 13:19
Read7min

How to manage your machine learning pipeline with MLflow

In this article, we'll give insights on the machine learning process and how MLflow can help to set up your machine learning pipeline, alongside with a hands-on example.

How to manage your machine learning pipeline with MLflow

How to manage your machine learning pipeline with MLflow

Here at Kili we're always excited to improve our processes. That's why we're starting to work with MLflow. In this article we'll give some insights into how it works and how you can use MLflow to set up your machine learning pipeline, which we’ll show with  a hands-on example. MLflow is an open source project that aims to help with managing the lifecycle of machine learning models, all the way from training to production.

As a data scientist, you might have noticed that the process of working with machine learning is, in certain respects, more complex than the task of software development. Because we're not only working with different versions of code; we're also working with different versions of data and different parameters.

In any case, it's not very interesting, in general, to track parameters with Git. That's because in most cases of adjusting parameters, the code is essentially the same. You might want to separate your runs in experiments, but you'll need to create your own convention for those experiments and also for the place to store their results.

Another very important aspect is deploying this model in production. There are a myriad of tools available out there to create machine learning models, and because those models are often not compatible with each other, you will still want to expose a single unified interface to interact with them. You might, for example, train one model with scikit-learn and another one with pytorch and use them both in production for a given task.

MLflow tries to tackle most of those problems with three main concepts: MLflow Tracking, MLflow Projects and MLflow Models.

MLflow Tracking

Tracking tries to solve lots of problems. In general, it's a unified logging hub where you'll be able to manage your experiments in a team, track versions, and so on. This way, you won't have to worry about logging your results, how to share them with your team or how to access them later on, as they'll be available via the tracking server.

Tracking will be responsible for logging all of your activity during your machine learning experiments. The main information that is tracked is:

  • Code and model versions

  • Timestamps

  • Authors

  • Parameters / Hyperparameters

  • Metrics

  • Artifacts (any kind of file)

  • Models

In MLflow the unit of execution is called a “run”. Every run will contain some of the information given above and you're able to log all the necessary information to be able to repeat the run. You can associate a run with the act of running a script for whatever reason: training, testing, plotting results, etc.

You should aggregate runs into experiments. Every experiment should contain, for example, runs of the same code/model with a slight variation in architecture or, more commonly, in parameters. It's important to note that runs in the same experiments will store their large files in the same location. This will be explained in more detail below.

The API

When using MLflow tracking, you should use the API to log all of the information you need, including models and artifacts. The example below shows a python script that logs a parameter, some metrics and, in the end, an artifact. For more information, you can refer to https://www.mlflow.org/docs/latest/python_api/index.html.

import os

import mlflow

from random import random, randint


from mlflow import log_metric, log_param, log_artifacts


if __name__ == "__main__":

    mlflow.set_tracking_uri('<http://localhost:5000>')


    log_param("param1", randint(0, 100))


    log_metric("foo", random())

    log_metric("foo", random() + 1)

    log_metric("foo", random() + 2)


    if not os.path.exists("outputs"):

        os.makedirs("outputs")

    with open("outputs/test.txt", "w") as f:

        f.write("hello world!")


    log_artifacts("outputs", "mlruns")

    mlflow.end_run()


The backend store

Although you are able to track your parameters without running a server the recommended approach is to create a MLflow tracking server. This server will give you access to a graphical interface where you are able to manage your runs and experiments. Some of the functionality available via the API is also available via the interface, an example is creating an experiment. When running a MLflow server you can specify a backend store, this store is any SQLAlchemy compatible database that will be responsible for logging anything that is not an artifact. Concretely this means that only the large files will be left out of the backend store, this also includes models. In the end we'll present a simple example of using MLflow where you'll be able to see how to configure your own backend store.

The artifact store

Just as the backend store, it's another storage place for the MLflow tracking server. You'll also need to configure it as in the example at the end. The goal of the artifact store is to hold large files that are not suitable for a relational database. In general you'll specify a cloud bucket. Some examples of artifacts are: your models, images (can be plots of metrics etc), binaries etc. Any file can be stored as an artifact.

MLflow Projects

In simple terms projects describe how to run scripts. Their goal to in make every run reproducible, this way machine learning code is easily reproductible. To achieve that they use two main concepts: entry points and environments. Both are described in an MLproject file (example below).

name: My Project


conda_env: my_env.yaml

# Can have a docker_env instead of a conda_env, e.g.

# docker_env:

#    image:  mlflow-docker-example


entry_points:

  main:

    parameters:

      data_file: path

      regularization: {type: float, default: 0.1}

    command: "python train.py -r {regularization} {data_file}"

  validate:

    parameters:

      data_file: path

    command: "python validate.py {data_file}"

The project will consist of all the folders within the folder that contains a MLproject file, it's recommended that you create a git repo for each project or at least that the project is contained within a git repo. The reason for that is that MLflow uses the commits to track the version of the code that was used to trigger a run.

The entry points

Each entry point corresponds to a script that should be run, each entry point should be described in the MLproject file and it has a unique name. You can also add parameters to the entry points and they'll be automatically logged after each run. In the example below we have two entry points train and test. After that, to run the project you just need to use the mlflow run -e <entry point> command. More details in the example at the end.

name: test_project


conda_env: conda_env.yaml


entry_points:

  train:

    command: "python3 train.py --gpus=1"

  test:

    command: "python3 test.py --vis-preds"


The environment

It describes the dependencies to run the code. If your code is complex and needs lots of dependencies you may want to specify them as a Dockerfile. An example is shown below:

RUN apt-get update && apt-get install -y --no-install-recommends \\

        build-essential \\

    git \\

    curl \\

        libglib2.0-0 \\

        software-properties-common \\

        python3.6-dev \\

        python3-pip \\

        python3-tk


RUN pip3 install --upgrade pip

RUN pip3 install setuptools

RUN pip3 install matplotlib numpy pandas scipy tqdm pyyaml easydict scikit-image bridson Pillow ninja

RUN pip3 install imgaug mxboard graphviz

RUN pip3 install albumentations==0.5       

RUN pip3 install opencv-python-headless

RUN pip3 install Cython

RUN pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f <https://download.pytorch.org/whl/torch_stable.html>

RUN pip3 install scikit-learn

RUN pip3 install tensorboard


In this case you just specify what needs to be installed, no need to copy folders or else. When you use the mlflow run, MLflow will create a docker and use this docker as base image. In the docker created by MLflow it will copy the full directory tree. It's important to know that it will only build the docker when it's needed and not at every run. More details for docker environments can be found at https://github.com/mlflow/mlflow/tree/master/examples/docker.

Using a docker to build the project gives us the most flexibility but might introduce some problems –for instance the memory management. When possible it's preferable to use conda environments. They define the python installations needed. You'll find below an example of a conda environment file.

# Conda environment used for mlflow

name: iseg

dependencies:

  - python=3.8.5

  - pyyaml

  - tqdm

  - pip

  - pip:

    - --find-links <https://download.pytorch.org/whl/torch_stable.html>

    - google-cloud-storage

    - mlflow

    - albumentations==0.5 

    - easydict

    - tensorboard

    - torch==1.5.0+cu101

    - torchvision==0.6.0+cu101


Note that we see mlflow, this dependency is always needed. If you use a cloud artifact store you'll also need to add a dependency to support artifact logging. In the example above you can find google-cloud-storage as the artifacts will be stored in google buckets.

MLflow Models

Models solve the lack of unification in machine learning code. There's a myriad of tools and frameworks and using them all at the same time can make it difficult to package those models for production. With models you can have a unified interface to interact and serve models which ultimately speeds up the management of the lifecycle in machine learning.

An MLflow model is essentially a folder of artifacts that contains the information needed to execute a model alongside a MLmodel file. When saving a model you'll save a MLmodel file, the model itself and an environment that describes the dependencies used while running the model.

An example of MLmodel file is shown below

time_created: 2018-05-25T17:28:53.35


flavors:

  sklearn:

    sklearn_version: 0.19.1

    pickled_model: model.pkl

  python_function:

    loader_module: mlflow.sklearn


To create a model you just need to use the mlflow.***.log_model method, this method should log all the necessary artifacts.

MLflow provides different log functions for pytorch models, tensorflow models, sklearn models etc. In any case, you can define your custom models using the pyfunc flavor. A flavor is a convention used by deployment tools for deploying models and is one of the main strengths of MLflow Models as it allows you to standardise the interface of different types of models. For more information go to https://www.mlflow.org/docs/latest/models.html.

Another very important concept is the MLflow Model Registry.

Get started

Learn more!

Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.

Model Registry

In the tracking interface you'll be able to see the Models tab, there you can manage the model registry. When you log a model it'll be accessible inside the run in the artifacts area, however it's useful to version your models independently from the code. The model registry allows you to version and manage the lifecycle of your models.

You can use either the UI or the API to register models. The steps in a model lifecycle are:

  • Registering → adds the model to the model register or adds a new version

  • Transition versions between stages

You can see the interface for the model registry below:

A hands on example

In this example we'll develop a setup that allows you to track and persist your tracking information on the cloud. This set up is fully usable if you want to run your machine learning pipelines.

The tracking server

First we'll prepare the tracking server, It's useful to set up the server inside a docker, this way it's much easier to make it run in different systems. You'll find below the Dockerfile to the tracking server:

FROM python:3.7-slim-buster


# Install python packages

RUN pip install mlflow google-cloud-storage psycopg2-binary


ADD app/ /app


WORKDIR /app


CMD ./run.sh


Here, we install google-cloud-store because we need it for our artifact store and we install psycopg2-binary for our backend store (postgres in google cloud sql). To run the server you can simply run docker run -it --rm -p 5000:5000 mlflow-image. But first you should build the image with docker build -t mlflow-image . .

Now you're able to run a tracking server in the port 5000. You might have noticed that the command used to run is in fact ./run.sh ; that's because we've also created a folder app which contains this script with the run command. This file should contain the following:

mlflow server \\

  -h 0.0.0.0 \\

  --backend-store-uri postgresql+psycopg2://$MLFLOW__DATABASE_USER:[email protected]$MLFLOW__DATABASE_IP/$MLFLOW__DATABASE_NAME \\

  --default-artifact-root $MLFLOW__ARTIFACTS_BUCKET


The values with an $ should be replaced by your artifact store and your backend store. You can find more information at https://docs.sqlalchemy.org/en/14/core/engines.html on how to create the connection for the backend store.

The final file structure should be:

app/

  run.sh

Dockerfile


At the end of this step after accessing via the browser https://localhost:5000 you should see the tracking UI.

Creating a simple project

Now we'll create a simple project that trains a simple logistic regression model. The structure for the project will be:

test_project/

   conda_env.yaml

   create_expariment.py

   MLproject

   test.py

   train.py


We start by the conda_env.yaml file:

# Conda environment used for mlflow

name: test

dependencies:

        - python=3.7

        - scikit-learn

        - pip

        - pip:

                - mlflow

                - google-cloud-storage


This file contains the dependencies to run the project, We need mlflow, but also google-cloud-storage to store artifacts and skit-learn which will be used to train the model.

Then the file create_experiment.py which will be used to create a new experiment via the API. The first command you run should be python create_experiment.py .

import mlflow


def main():

    mlflow.set_tracking_uri('<http://localhost:5000>')

    mlflow.create_experiment('test_experiment', 'gs://path/to/bucket')


if __name__ == "__main__":

    main()


Now the train.py script, this one trains a simple logistic regression on the iris dataset, it also logs a parameter, three metrics and the trained model at the end.

import os

import mlflow

from random import random, randint


from mlflow import log_metric, log_param, log_artifacts

from sklearn.datasets import load_iris

from sklearn.linear_model import LogisticRegression


def main():

    # Logging parameter

    log_param("param1", randint(0, 100))

    

    # Logging metric

    log_metric("foo", random())

    log_metric("foo", random() + 1)

    log_metric("foo", random() + 2)


    # Logging artifacts

    if not os.path.exists("outputs"):

        os.makedirs("outputs")

    with open("outputs/test.txt", "w") as f:

        f.write("hello world!")


    log_artifacts("outputs", "mlruns")

    

    # Logging models

    X, y = load_iris(return_X_y=True)

    clf = LogisticRegression(random_state=0).fit(X, y)

    mlflow.sklearn.log_model(clf, "logistic_regression")


    mlflow.end_run()


if __name__ == "__main__":

    main()


The test.py script is shown below, it takes one parameter as input a run id, it uses that to fetch the model from the given run and use this model to print some information

import mlflow

import click

from sklearn.datasets import load_iris


@click.command()

@click.option('--run_id', required=True)

def main(run_id):

    # Load model as a PyFuncModel

    mlflow.set_tracking_uri('<http://localhost:5000>')

    logged_model = f'runs:/{run_id}/logistic_regression'

    loaded_model = mlflow.sklearn.load_model(logged_model)


    X, y = load_iris(return_X_y=True)

    print('===== Predict probas ======')

    print(loaded_model.predict_proba(X[:2, :]))

    print('===========================')


if __name__ == "__main__":

    main()


Finally we have the MLproject file. Here we define the entry points and the parameter for the second script, as well as the environment file.

name: test_project


conda_env: conda_env.yaml


entry_points:

  train:

    command: "python3 train.py" 

  test:

    parameters:

      run_id: string

    command: "python3 test.py --run_id {run_id}"


We set up the environment to be the one defined in the conda environment file and two entry points, one for the training and the second for testing, now, to run training we just have to run:

MLFLOW_TRACKING_URI=http://localhost:5000 MLFLOW_EXPERIMENT_NAME=test_experiment GOOGLE_APPLICATION_CREDENTIALS=~/path/to/gcloud-key-file.json mlflow run . -e train


And for testing we do:

MLFLOW_TRACKING_URI=http://localhost:5000 MLFLOW_EXPERIMENT_NAME=test_experiment GOOGLE_APPLICATION_CREDENTIALS=~/path/to/gcloud-key-file.json mlflow run . -e test -P run_id=<run_id>


There are two main considerations, first: we've added three environment variables. When running a server you have to provide the URI to the server as environment variable to mflow run. You also need to provide credentials if you log artifacts or models to a bucket and finally you need to specify the experiment's name.

After triggering a run of a project you'll automatically create a run on the tracking server and any parameters used are also logged. That's what you'll see after running the test script.

After running both entry points, you'll be able to see a train run and a test run, For the train one you'll see the model and you'll also be able to register it.

Learn more about Machine Learning:

Related resources

Get started