Kubernetes Dev Stack¶

Local kind cluster mirroring the EKS deployment from hotosm/k8s-infra. Use this when you want production parity (real ZenML Kubernetes orchestrator, helmfile-managed charts, in-cluster networking). For everyday development the Docker Compose stack from Getting Started is faster and lighter;

Quickstart¶

Prerequisites

kind, kubectl, helm, helmfile, mc (MinIO client), Docker Engine (Linux) or colima (macOS), plus uv and just for the Python side.

For GPU support add: nvkind, the NVIDIA driver, and nvidia-container-toolkit. See GPU Support below.

Source: infra/

Bring up the kind stack

just setup            # root: install Python deps (skip if already done)
cd infra
just up               # create cluster (if missing), helmfile apply, port-forwards
just status           # cluster + pod health
just down             # stop port-forwards (cluster stays)
just tear             # destroy the cluster

just up registers a k8s ZenML stack (Kubernetes orchestrator, in-cluster MinIO and MLflow) and sets it active. From the repo root you can then run any of the example pipelines against the cluster:

cd ..
uv run --group example python examples/segmentation/run.py

Verifying results¶

After the pipeline completes

What	URL
ZenML dashboard (pipelines, steps, artifacts)	http://localhost:8080 (login: `default` / empty password)
STAC collections (registered & promoted models)	http://localhost:8082/collections
MLflow experiments (training metrics, model registry)	http://localhost:5000
MinIO browser (raw S3 objects)	http://localhost:9001 (login: `minioadmin` / `minioadmin`)

Architecture¶

All services run in namespace fair on a kind cluster.

Cluster topology (namespace: fair)

postgres (PG 17 + PostGIS)           zenml (ghcr.io/hotosm/zenml-postgres:0.93.3)
  DBs: zenml, fair_models, mlflow      Official Helm chart, OCI registry
        |                               |
        +--- stac-fastapi-pgstac        +--- mlflow (community-charts/mlflow)
        |    eoapi-k8s chart                 PG backend + S3 artifacts
        |                               |
        +--- minio (s3://fair-data, s3://mlflow, s3://zenml)

Port-forwards (managed by just up / just down)

Service	Local	Cluster
ZenML	localhost:8080	zenml.fair.svc:80
STAC API	localhost:8082	stac-stac.fair.svc:8080
MinIO API	localhost:9000	minio.fair.svc:9000
MinIO Console	localhost:9001	minio-console.fair.svc:9001
MLflow	localhost:5000	mlflow.fair.svc:80
Postgres	localhost:5432	postgres.fair.svc:5432

GPU Support (optional)¶

Follow the nvkind prerequisites and setup guide to install the NVIDIA driver, nvidia-container-toolkit, and nvkind on your host. Once nvkind is on $PATH, just up handles the rest.

What just up does

kind-config.yaml labels workers as inference and train, with the train node getting extraMounts that signal GPU presence to nvkind. The cluster creation step runs nvkind (installs toolkit inside the node, configures containerd). The infra step creates the nvidia RuntimeClass, labels the GPU node, and deploys the device plugin.

Caveats

PatchProcDriverNvidia may fail on non-MIG single-GPU hosts ; non-critical, the justfile tolerates it.
nvkind restarts containerd on the GPU node, briefly disrupting colocated pods.
Device plugin uses --set deviceDiscoveryStrategy=nvml (default auto fails inside kind).

Configuration¶

Label domain¶

Node labels and taints use the fair.dev prefix (hardcoded in the dev config files). For production (dok8s), the label domain comes from the domain OpenTofu variable in infra/dok8s/terraform.tfvars (exposed as the fair_domain output and consumed by the infra/ justfile recipes).

The runtime default in fair/zenml/config.py can be overridden via FAIR_LABEL_DOMAIN env var.

Decisions

kind over minikube/k3s : hotosm/k8s-infra runs upstream K8s (EKS). kind runs upstream K8s in Docker containers with guaranteed API compatibility. Lightweight, no VM.

Single PostgreSQL, three databases : ZenML, pgstac, and MLflow all need Postgres. One StatefulSet with init SQL (CREATE DATABASE zenml; fair_models; mlflow). Mirrors production where CloudNativePG hosts databases the same way.

MLflow over W&B : Apache 2.0, uses Postgres (same engine as everything else), mature Helm chart, ZenML first-class --flavor=mlflow support. W&B self-hosted requires MySQL + Redis + commercial license.

eoAPI for STAC : Dev uses the upstream eoAPI chart from https://devseed.com/eoapi-k8s/ (eoapi/eoapi, pinned to 0.12.2 in infra/helmfile.yaml.gotmpl) with external-plaintext DB.

ZenML Postgres patch : OSS ZenML only supports MySQL/SQLite. The patched server image at ghcr.io/hotosm/zenml-postgres replaces MySQL dialect (MEDIUMTEXT) with Postgres equivalents. The client side is handled automatically by fair-py-ops: a .pth startup hook (fair/_patch_zenml.py) adds the POSTGRESQL enum variant to ServerDatabaseType at interpreter startup, before any ZenML import. No manual client patching is needed.

StacBackend Protocol : StacCatalogManager writes local JSON files. PgStacBackend writes to pgstac via pypgstac. Both conform to the StacBackend Protocol (structural subtyping). run.py --stac-api-url selects pgstac; omit for local.

PgStacBackend reads via pystac-client : The eoAPI chart injects --root-path=/stac by default, which breaks self-links under direct port-forwarding. Dev values set stac.overrideRootPath: "" to remove it, so pystac-client works correctly against http://localhost:8082.

GPU scheduling from STAC metadata : mlm:accelerator and mlm:accelerator_count in stac-item.json drive nvidia.com/gpu resource requests. config.py reads these and emits pod settings only when the orchestrator is Kubernetes.

Known issues¶

eoAPI root_path (resolved)

The chart's deployment template injects --root-path={{ .Values.stac.ingress.path }} (defaults to /stac) into the uvicorn command when an ingress class is set. Dev values set stac.overrideRootPath: "" which removes the arg entirely, so pystac-client works via direct port-forwarding.