What Exactly Is a Model in MLflow?
When starting to learn about machine learning and data science, most of us, ML developers, envision the learned parameters or weights from the training process as the model. We learn that a machine learning model is a mathematical construct trained on data to make predictions or classifications. Which is correct (Well… I mean partially correct)! In hands-on courses, you’ve likely saved models as binary files using packages like pickle
, especially if you’re a Pythonist. This single file holds the parameters learned during training, ready for predictions on new data. Which is totally fine!
Beyond Just “the Model”
However, as we dive into the challenges (and excitement 😉) of putting that model into production (a.k.a. MLOps 😎), we start realiazing that the concept of a “model” can be much broader!
In this post, our focus is within the context of MLflow, where the model encompasses a broader set of components (we will refer to them as elements) optimized for production. By broader components, I mean that the model in MLflow extends beyond just the saved weights to include environments, metadata, , and configurations.
So, now, the model is a comprehensive package or bundle that encapsulates everything needed to reproduce predictions reliably in various environments! This is what I am discussing here… to break this bundle down to “elements”.
Initially, understanding these elements and their structure can be challenging (it was for me at least!), but with some exploration, it all will start to make sense. I thought it might be beneficial to share what I’ve learned about these components in this post.
Element 1: The Heart of It All — Model Binary!
This is the central piece — the actual saved model weights or parameters. As mentioned earlier, it’s what many think of as “the model.” These weights represent the knowledge your model has gained during training, allowing it to make predictions. They are commonly stored in binary format when using packages like pickle
or joblib
or through framework-specific methods like torch.save()
in PyTorch. We will say later which part of your code asks MLflow to do this.
Element 2: Additional Files — Auxiliary Binaries!
Alongside the main model binary file, some models need a few extra helpers to work, which are stored as additional auxiliary binaries. Think of this as anything beyond your main model weights that could be important or useful.
One common example is scalers for preprocessing, used to normalize or standardize data before it reaches the model. If the model expects normalized data, you should save this scaler as a binary file.
Another example of that is tokenizers in NLP. Basically, a tokenizer’s task is to convert words into numerical values or tokens, which the model can then process. Without them, text data would remain as characters, making it hard for the model to work with it.
One more could be when work with clustering algorithms, you might need to save, for example, your k-means centroids to keep track of how data points were grouped during training and use that for inference.
A side note: Without MLflow, to save or store these two elements (the main model binary and the auxiliary binaries) , we can use somthing like the following code (with
pickle
):
# Save model and scaler (example of auxiliary binary file) without mlflow
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('scaler.pkl', 'wb') as f:
pickle.dump(scaler, f)
Element 3: Pre-loaded Code — Custom Logic!
In addition to the binary files, in some cases, models also require custom code to be loaded in the inference environment. Three obvious examples are (if not too obvious to you, do not worry 😅):
- Preprocessing: Before the model makes predictions, the data often needs to be transformed. For example, you might handle missing values or scale features using the auxiliary file saved in Element 2. In NLP models, custom tokenization scripts (also saved as auxiliary files) are used to prepare the text data. Remmeber, we are talking code here 😅, Element 1 and 2 were objects.
- Postprocessing: After the model predicts, the raw output may need adjustments. For instance, in image classification models, you might map predicted indices to class names using dictionaries (the model outputs numbers, not labels like cats or dogs 😅). You could also apply confidence thresholds to filter predictions’ probabilities.
- Custom Logic: This includes calculations or business rules. Sometimes, the raw predictions aren’t exactly what you’re after! For example, in a financial model, predictions might be adjusted based on recent market trends by incorporating real-time data from APIs. By pulling in current market prices or news sentiment, the model can refine its predictions to better reflect the present market conditions.
Element 4: Dependencies — Building the Right Environment!
Now we have the model binaries and the necessary custom code (Elements 1,2 and 3 🎉), but we still missing a crucial element! That is the environment where this model will operate! This includes specific library versions and dependencies essential for ensuring that the model functions as expected.
For instance, using TensorFlow older version (V 2.3) for inference instead of V 2.8 (newer version) which you used for training can lead to compatibility issues due to changes in the API.
Without matching the environment, even the best model and code can fail to execute properly. That’s where library dependencies come into play, so everything runs smoothly and consistently.
Of course you don’t need the libraries’ code itself to be saved 🤪, just the names and specific versions. These dependencies are typically saved in a requirements.txt
file or a conda.yaml
file, listing all the necessary packages and their versions. The aim is to allow you to recreate the exact environment later during inferenceby installing these packages.
## example of requirements.txt
tensorflow==2.3.0
pandas==1.1.5
numpy==1.18.5
scikit-learn==0.24.2
## example of conda.yml
name: my_environment
dependencies:
- tensorflow=2.3.0
- pandas=1.1.5
- numpy=1.18.5
- scikit-learn=0.24.2
Element 5: Metadata — The Model’s Story!
In a professional ML environment, you may be required to follow strict organisation policies on model governance and auditing. This may involve keeping detailed records of metadata, which ensures the model’s history is transparent and makes it easier to reproduce results, audit processes, and maintain accountability throughout the model’s lifecycle.
Metadata includes crucial information about the model’s lineage, such as who trained the model, with what code, when, and where.
Yes, metadata is important! It provides a clear trail for auditing, essential for compliance in regulated industries. It supports model governance by tracking versions, performance metrics, and deployment status. It also aids in debugging by offering context about the training conditions, making troubleshooting more efficient!
Examples of vital metadata could be:
- Training Data Source: To Identify the dataset used.
- Model Version: To track changes and updates.
- Training Code: To link to specific scripts or notebooks used (I love how MLflow make this natural!).
- Hyperparameters: Mainly listing parameters like learning rate or batch size, etc.
You may write your own code to save this metadata file as a JSON file for example :
{
"training_data_source": "path/to/dataset.csv",
"model_version": "1.0",
"training_code": "train_model.py",
"hyperparameters": {
"learning_rate": 0.01,
"batch_size": 32
},
"performance_metrics": {
"accuracy": 0.95,
"loss": 0.05
}
}
But before seeing how MLflow deals with this with some code, it’s worth mentioning that MLflow stores two other elements that are somewhat specific to the MLflow ecosystem. I’m not sure if I’ve encountered them outside MLflow. These are:
Element 6: Signature — Your Model’s Blueprint!
This “signature” defines the expected input and output schema of your model, acting as a blueprint that ensures consistency across different environments.
This signature helps validate that the data fed into the model during inference is correctly formatted, preventing errors and making integration smoother. We will see the example later in the coding section.
Element 7: Input Example — A Glimpse of What’s to Come!
Here we kind of trying to give a sneak peek into what the model shoud expect. This is real data sample!
It is used for validating the model after deployment, ensuring it functions correctly with the expected data format.
I see it as a reference point, it used to help me understand the kind of input the model requires. Yes, signature might be sufficient, but providing this example is even more helpful! And yes, it is optional!
The MLflow Way: All Your Model Elements Under One Roof!
Now we can let the magic of MLflow handle all these tasks with very simple and neat code!
We’ll use the popular Iris dataset for classification, focusing on a basic model to avoid any distracting complications.
It is impressive how MLflow manages all model elements seamlessly. Here’s a breakdown of how MLflow helps us, I will show the main code first then I will break it down to show how these elements were served and how they are stored.
The code all together:
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from mlflow.models.signature import infer_signature
# Load Iris dataset
data = load_iris()
X, y = data.data, data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_scaled, y_train)
# Create an input example and infer signature
input_example = pd.DataFrame(X_test_scaled[:1], columns=data.feature_names)
signature = infer_signature(X_test_scaled, model.predict(X_test_scaled))
# Start a new MLflow run
with mlflow.start_run():
# Log the RandomForest model with sklearn flavor
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
signature=signature,
input_example=input_example
)
# Log the scaler as an artifact
mlflow.sklearn.log_model(scaler, artifact_path="scaler")
# Log parameters and metrics
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("accuracy", model.score(X_test_scaled, y_test))
Code Breakdown (by element):
Elements on MLflow UI:
If you copy this code and run it locally on your machine, you’ll notice a folder named mlruns
created in your working directory. Of course, you can explore it in your coding IDE (I’m using VSCode), but the MLflow UI makes that much easier. On your command line, run the following to access the local host link:
> mlflow ui
Then you would see these elements (Our Elements that we discussed):
Conclusion and Further Readings:
In this post, I tried to explore with you how MLflow expands the traditional concept of a model from just binary file to include vital components for inference like environments, configurations and metadata.
These elements ensure that our models are not only functional but also optimized for production, making them robust and reproducible in diverse environments. By leveraging the full capabilities of MLflow, we can always transform our machine learning projects into scalable, production-ready solutions.
Lastly, I’ve linked here our MLflow blog for great reads, so you can stay up to date with MLflow’s new features!