🚀

Model Deployment with KServe

Deploy your trained model from Kubeflow Pipelines to production using KServe with MinIO integration

⏱️ 25 minutes 📚 Phase 4 of 7 🛠️ Production Ready

What You'll Learn

Deploy your trained model from Kubeflow Pipelines artifacts stored in MinIO to a production-ready KServe InferenceService that can handle real-time predictions with autoscaling.

📦

Model Packaging

Package your model with KServe Python server

🔗

MinIO Integration

Connect to Kubeflow artifacts stored in MinIO

⚡

Auto-scaling

Production-ready with automatic scaling

🧰

Step 1: Prerequisites

Ensure you have the required components and verify your MinIO service is running.

✅ Prerequisites Checklist

Required Components:

• A working Kubernetes cluster
• Kubeflow Pipelines 2.x (installed in kubeflow namespace)
• Trained model saved in mlpipeline MinIO bucket
• Docker

Verify MinIO Service:

Check MinIO Service

    
      kubectl get svc minio -n kubeflow

ℹ️ Expected Output

You should see:

NAME TYPE CLUSTER-IP PORT(S) AGE
minio ClusterIP 10.xx.xx.xx 9000/TCP ...

⚙️

Step 2: Install KServe

Install KServe controller which handles model deployments and autoscaling.

🧠 Install KServe Controller

Install KServe

    
      
# Install KNative
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.20.0/serving-core.yaml
kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.20.1/net-istio.yaml
kubectl --namespace istio-system get service istio-ingressgateway

# Enable Side Car Injection
kubectl label namespace knative-serving istio-injection=enabled

# Install KServe
KSERVE_VERSION=v0.15.0
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml

# Wait for KServe to be ready
kubectl wait --for=condition=ready pod -l control-plane=kserve-controller-manager -n kserve --timeout=300s

What KServe Provides:

• Model deployment and management
• Automatic scaling based on traffic
• Multiple model format support (sklearn, pytorch, tensorflow)
• Production-ready inference endpoints

🔐

Step 3: Setup MinIO Access

Configure service account and credentials for KServe to access your model artifacts in MinIO.

🔑 MinIO Credentials

Create minio-creds-secret.yaml

    
      apiVersion: v1
kind: Secret
metadata:
  name: minio-creds
  namespace: inference
  annotations:
    serving.kserve.io/s3-endpoint: "minio-service.kubeflow:9000" # Replace with your MinIO service endpoint
    serving.kserve.io/s3-usehttps: "0" # Set to "0" for insecure connections, "1" for HTTPS
    serving.kserve.io/s3-region: "us-east-1" # Can be any value for MinIO
stringData:
  AWS_ACCESS_KEY_ID: "minio"
  AWS_SECRET_ACCESS_KEY: "minio123"
type: Opaque

👤 Service Account

Create kserve-sa.yaml

    
      apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-sa
  namespace: inference
secrets:
  - name: minio-creds

⚙️ Create Resources

Create Resources

    
      namespace - kubectl create namespace inference
k8s secret - kubectl apply -f minio-creds-secret.yaml
Service account - kubectl apply -f kserve-sa.yaml

📦

Step 4: Package Your Model

Create a Python server that loads your model from MinIO and serves predictions via KServe.

🐍 Model Server Code

Create taxiprice_server.py:

Model Server (taxiprice_server.py)

    
      import joblib, numpy as np, time
from typing import Dict, Any
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import logging, traceback
from kserve import Model, ModelServer

# Custom metrics
PREDICTION_COUNTER = Counter('model_predictions_total', 'Total predictions')
PREDICTION_ERRORS  = Counter('model_prediction_errors_total', 'Total errors')
PREDICTION_LATENCY = Histogram('model_prediction_latency_seconds', 'Prediction latency')
MODEL_ACCURACY     = Gauge('model_accuracy', 'Model accuracy')

class TaxiModel(Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.name = name
        self.model = joblib.load('/mnt/models/model')
        start_http_server(8000)  # Expose metrics endpoint
        MODEL_ACCURACY.set(0.93) # Current model accuracy baseline
        self.ready = True

    def predict(self, request, headers: dict = None) -> dict:
        logging.info(f"Received request: {request}")
        start = time.time()
        try:
            instance = request["instances"][0]
            start_location = instance["start_location"]
            end_location = instance["end_location"]
            X = np.array([[start_location, end_location]])
            price = self.model.predict(X)[0]
            PREDICTION_COUNTER.inc()
            PREDICTION_LATENCY.observe(time.time() - start)
            return {"predictions": [{"predicted_price": float(price)}]}
        except Exception:
            PREDICTION_ERRORS.inc()
            raise

if __name__ == "__main__":
    logging.error("In main")
    model = TaxiModel("taxi-model")

    #print("Attempting to load model manually in the main thread...")
    #model.load()
    #print("Model loaded successfully in main thread. Starting server...")

    logging.error("Starting model server")
    try:
        ModelServer().start([model])
    except Exception as e:
        # This block catches any exception raised during startup or readiness check
        logging.error("!!!!!!!!!!!!!!!! FATAL ERROR CAUGHT DURING STARTUP !!!!!!!!!!!!!!!!")
        logging.error(f"Error type: {type(e).__name__}")
        logging.error(f"Error message: {e}")

        # This function prints the full stack trace to stderr
        traceback.print_exc()

        logging.error("!!!!!!!!!!!!!!!! END OF ERROR TRACEBACK !!!!!!!!!!!!!!!!")

        # Ensure the process exits with an error status code
        exit(1)

🐳 Dockerfile

Dockerfile

    
      FROM python:3.9-slim

WORKDIR /app

# Install dependencies
RUN pip install kserve joblib scikit-learn numpy pandas prometheus-client

# Copy model server
COPY taxiprice_server.py .

# Expose port
EXPOSE 8080

# Run the server
CMD ["python", "-u", "taxiprice_server.py"]

🏗️ Build and Push Docker Image

Build and Push Image

    
      # Build Docker image
docker build -t <your-registry>/taxi-price-model:latest .

# Push to registry
docker push <your-registry>/taxi-price-model:latest

🚀

Step 5: Deploy with KServe

Create KServe InferenceService that connects to your model artifacts in MinIO.

📄 KServe Deployment YAML

Update artifact path and docker image(from step 3). Save this as taxi_price_service.yaml:

taxi_price_service.yaml

    
      
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "taxi-service-1"
  namespace: inference
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
    prometheus.io/scrape-protocol: "http"
spec:
  predictor:
    serviceAccountName: kserve-sa
    containers:
      - name: kserve-container # Name must be 'kserve-container' for automatic injection of env var/sidecar
        image: tugamidi/taxi-model:5
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 1
            memory: 1Gi
        env:
          # Specify the storage URI here
          - name: STORAGE_URI
            value: "s3://mlpipeline/v2/artifacts/chicago-taxi-fare-pipeline/99d4b75c-ffab-485e-9c07-7de5b536ee13/train-fare-model/e6363970-33ac-47e6-80d9-7f778f3bc05c/model"

⚠️ Important

Replace placeholders:

• <your-registry> with your Docker registry

Find your run ID: kubectl get pipelines.pipelines.kubeflow.org -n kubeflow -o yaml

🚀 Deploy Your Model

Deploy Model

    
      # Deploy the model
kubectl apply -f taxi_price_service.yaml

# Check deployment status
kubectl get inferenceservices -n inference

# Check readiness
kubectl get inferenceservice taxi-service-1 -n inference -o jsonpath='{.status.conditions[*].status}'

🔎

Step 6: Test Prediction API

Test your deployed model with real prediction requests.

🔗 Get Model URL

Get Internal URL

    
      # Get the pod name
kubectl get pods -n inference

# Get the model URL
kubectl port-forward pod/<pod-name> 3000:<container-port>8080 -n inference

🧪 Send Test Request

Test Prediction

    
      curl -v -H "Content-Type: application/json"   -d '{"instances":[{"start_location":1,"end_location":10}]}'   http://localhost:3000/v1/models/taxi-model:predict

Expected Output:

{ "predictions": [ {"predicted_price": 10.890} ] }

⚡

Step 7: Monitor & Auto-scaling

Monitor your model performance and verify automatic scaling capabilities.

Model Successfully Deployed! 🎉

Your model is now running in production with auto-scaling capabilities. Let's learn about monitoring and CI/CD for ML systems.

Monitor Models → Back to Dashboard