Model Deployment with KServe
Deploy your trained model from Kubeflow Pipelines to production using KServe with MinIO integration
What You'll Learn
Deploy your trained model from Kubeflow Pipelines artifacts stored in MinIO to a production-ready KServe InferenceService that can handle real-time predictions with autoscaling.
Model Packaging
Package your model with KServe Python server
MinIO Integration
Connect to Kubeflow artifacts stored in MinIO
Auto-scaling
Production-ready with automatic scaling
Step 1: Prerequisites
Ensure you have the required components and verify your MinIO service is running.
โ Prerequisites Checklist
Required Components:
- โข A working Kubernetes cluster
- โข Kubeflow Pipelines 2.x (installed in kubeflow namespace)
- โข Trained model saved in mlpipeline MinIO bucket
- โข Docker
Verify MinIO Service:
kubectl get svc minio -n kubeflow
You should see:
minio ClusterIP 10.xx.xx.xx 9000/TCP ...
Step 2: Install KServe
Install KServe controller which handles model deployments and autoscaling.
๐ง Install KServe Controller
# Install KNative
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-core.yaml
kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/net-istio.yaml
kubectl --namespace istio-system get service istio-ingressgateway
# Enable Side Car Injection
kubectl label namespace knative-serving istio-injection=enabled
# Install KServe
KSERVE_VERSION=v0.15.0
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml
# Wait for KServe to be ready
kubectl wait --for=condition=ready pod -l control-plane=kserve-controller-manager -n kserve --timeout=300s
What KServe Provides:
- โข Model deployment and management
- โข Automatic scaling based on traffic
- โข Multiple model format support (sklearn, pytorch, tensorflow)
- โข Production-ready inference endpoints
Step 3: Setup MinIO Access
Configure service account and credentials for KServe to access your model artifacts in MinIO.
๐ MinIO Credentials
apiVersion: v1
kind: Secret
metadata:
name: minio-creds
namespace: inference
annotations:
serving.kserve.io/s3-endpoint: "minio-service.kubeflow:9000" # Replace with your MinIO service endpoint
serving.kserve.io/s3-usehttps: "0" # Set to "0" for insecure connections, "1" for HTTPS
serving.kserve.io/s3-region: "us-east-1" # Can be any value for MinIO
stringData:
AWS_ACCESS_KEY_ID: "minio"
AWS_SECRET_ACCESS_KEY: "minio123"
type: Opaque
๐ค Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: kserve-sa
namespace: inference
secrets:
- name: minio-creds
โ๏ธ Create Resources
namespace - kubectl create namespace inference
k8s secret - kubectl apply -f minio-creds-secret.yaml
Service account - kubectl apply -f kserve-sa.yaml
Step 4: Package Your Model
Create a Python server that loads your model from MinIO and serves predictions via KServe.
๐ Model Server Code
Create taxiprice_server.py:
import joblib, numpy as np, time
from typing import Dict, Any
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import logging, traceback
from kserve import Model, ModelServer
# Custom metrics
PREDICTION_COUNTER = Counter('model_predictions_total', 'Total predictions')
PREDICTION_ERRORS = Counter('model_prediction_errors_total', 'Total errors')
PREDICTION_LATENCY = Histogram('model_prediction_latency_seconds', 'Prediction latency')
MODEL_ACCURACY = Gauge('model_accuracy', 'Model accuracy')
class TaxiModel(Model):
def __init__(self, name: str):
super().__init__(name)
self.name = name
self.model = joblib.load('/mnt/models/model')
start_http_server(8000) # Expose metrics endpoint
MODEL_ACCURACY.set(0.93) # Current model accuracy baseline
self.ready = True
def predict(self, request, headers: dict = None) -> dict:
logging.info(f"Received request: {request}")
start = time.time()
try:
instance = request["instances"][0]
start_location = instance["start_location"]
end_location = instance["end_location"]
X = np.array([[start_location, end_location]])
price = self.model.predict(X)[0]
PREDICTION_COUNTER.inc()
PREDICTION_LATENCY.observe(time.time() - start)
return {"predictions": [{"predicted_price": float(price)}]}
except Exception:
PREDICTION_ERRORS.inc()
raise
if __name__ == "__main__":
logging.error("In main")
model = TaxiModel("taxi-model")
#print("Attempting to load model manually in the main thread...")
#model.load()
#print("Model loaded successfully in main thread. Starting server...")
logging.error("Starting model server")
try:
ModelServer().start([model])
except Exception as e:
# This block catches any exception raised during startup or readiness check
logging.error("!!!!!!!!!!!!!!!! FATAL ERROR CAUGHT DURING STARTUP !!!!!!!!!!!!!!!!")
logging.error(f"Error type: {type(e).__name__}")
logging.error(f"Error message: {e}")
# This function prints the full stack trace to stderr
traceback.print_exc()
logging.error("!!!!!!!!!!!!!!!! END OF ERROR TRACEBACK !!!!!!!!!!!!!!!!")
# Ensure the process exits with an error status code
exit(1)
๐ณ Dockerfile
FROM python:3.9-slim
WORKDIR /app
# Install dependencies
RUN pip install kserve joblib scikit-learn numpy pandas prometheus-client
# Copy model server
COPY taxiprice_server.py .
# Expose port
EXPOSE 8080
# Run the server
CMD ["python", "-u", "taxiprice_server.py"]
๐๏ธ Build and Push Docker Image
# Build Docker image
docker build -t <your-registry>/taxi-price-model:latest .
# Push to registry
docker push <your-registry>/taxi-price-model:latest
Step 5: Deploy with KServe
Create KServe InferenceService that connects to your model artifacts in MinIO.
๐ KServe Deployment YAML
Update artifact path and docker image(from step 3).
Save this as taxi_price_service.yaml:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "taxi-service-1"
namespace: inference
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
prometheus.io/scrape-protocol: "http"
spec:
predictor:
serviceAccountName: kserve-sa
containers:
- name: kserve-container # Name must be 'kserve-container' for automatic injection of env var/sidecar
image: tugamidi/taxi-model:5
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
env:
# Specify the storage URI here
- name: STORAGE_URI
value: "s3://mlpipeline/v2/artifacts/chicago-taxi-fare-pipeline/99d4b75c-ffab-485e-9c07-7de5b536ee13/train-fare-model/e6363970-33ac-47e6-80d9-7f778f3bc05c/model"
Replace placeholders:
- โข
<your-registry>with your Docker registry
Find your run ID: kubectl get pipelines.pipelines.kubeflow.org -n kubeflow -o yaml
๐ Deploy Your Model
# Deploy the model
kubectl apply -f taxi_price_service.yaml
# Check deployment status
kubectl get inferenceservices -n inference
# Check readiness
kubectl get inferenceservice taxi-service-1 -n inference -o jsonpath='{.status.conditions[*].status}'
Step 6: Test Prediction API
Test your deployed model with real prediction requests.
๐ Get Model URL
# Get the pod name
kubectl get pods -n inference
# Get the model URL
kubectl port-forward pod/<pod-name> 3000:<container-port>8080 -n inference
๐งช Send Test Request
curl -v -H "Content-Type: application/json" -d '{"instances":[{"start_location":1,"end_location":10}]}' http://localhost:3000/v1/models/taxi-model:predict
Expected Output:
Step 7: Monitor & Auto-scaling
Monitor your model performance and verify automatic scaling capabilities.
Model Successfully Deployed! ๐
Your model is now running in production with auto-scaling capabilities. Let's learn about monitoring and CI/CD for ML systems.