Kubernetes Dev Stack¶
Local kind cluster mirroring the EKS deployment from hotosm/k8s-infra. Use this when you want production parity (real ZenML Kubernetes orchestrator, helmfile-managed charts, in-cluster networking). For everyday development the Docker Compose stack from Getting Started is faster and lighter;
Quickstart¶
Prerequisites
kind, kubectl, helm,
helmfile,
mc (MinIO client),
Docker Engine (Linux) or colima (macOS),
plus uv and just for the Python side.
For GPU support add: nvkind, the NVIDIA driver, and nvidia-container-toolkit. See GPU Support below.
Source: infra/
just setup # root: install Python deps (skip if already done)
cd infra
just up # create cluster (if missing), helmfile apply, port-forwards
just status # cluster + pod health
just down # stop port-forwards (cluster stays)
just tear # destroy the cluster
just up registers a k8s ZenML stack (Kubernetes orchestrator, in-cluster MinIO and MLflow) and sets it active. From the repo root you can then run any of the example pipelines against the cluster:
Verifying results¶
After the pipeline completes
| What | URL |
|---|---|
| ZenML dashboard (pipelines, steps, artifacts) | http://localhost:8080 (login: default / empty password) |
| STAC collections (registered & promoted models) | http://localhost:8082/collections |
| MLflow experiments (training metrics, model registry) | http://localhost:5000 |
| MinIO browser (raw S3 objects) | http://localhost:9001 (login: minioadmin / minioadmin) |
Architecture¶
All services run in namespace fair on a kind cluster.
postgres (PG 17 + PostGIS) zenml (ghcr.io/hotosm/zenml-postgres:0.93.3)
DBs: zenml, fair_models, mlflow Official Helm chart, OCI registry
| |
+--- stac-fastapi-pgstac +--- mlflow (community-charts/mlflow)
| eoapi-k8s chart PG backend + S3 artifacts
| |
+--- minio (s3://fair-data, s3://mlflow, s3://zenml)
Port-forwards (managed by just up / just down)
| Service | Local | Cluster |
|---|---|---|
| ZenML | localhost:8080 | zenml.fair.svc:80 |
| STAC API | localhost:8082 | stac-stac.fair.svc:8080 |
| MinIO API | localhost:9000 | minio.fair.svc:9000 |
| MinIO Console | localhost:9001 | minio-console.fair.svc:9001 |
| MLflow | localhost:5000 | mlflow.fair.svc:80 |
| Postgres | localhost:5432 | postgres.fair.svc:5432 |
GPU Support (optional)¶
Follow the nvkind prerequisites and setup guide to install the NVIDIA driver, nvidia-container-toolkit, and nvkind on your host. Once nvkind is on $PATH, just up handles the rest.
What just up does
kind-config.yaml labels workers as inference and train, with the train
node getting extraMounts that signal GPU presence to nvkind. The cluster
creation step runs nvkind (installs toolkit inside the node, configures containerd).
The infra step creates the nvidia RuntimeClass, labels the GPU node, and
deploys the device plugin.
Caveats
PatchProcDriverNvidiamay fail on non-MIG single-GPU hosts ; non-critical, the justfile tolerates it.- nvkind restarts containerd on the GPU node, briefly disrupting colocated pods.
- Device plugin uses
--set deviceDiscoveryStrategy=nvml(defaultautofails inside kind).
Configuration¶
Label domain¶
Node labels and taints use the fair.dev prefix (hardcoded in the dev config files).
For production (dok8s), the label domain comes from the domain OpenTofu variable in infra/dok8s/terraform.tfvars (exposed as the fair_domain output and consumed by the infra/ justfile recipes).
The runtime default in fair/zenml/config.py can be overridden via FAIR_LABEL_DOMAIN env var.
Decisions
kind over minikube/k3s : hotosm/k8s-infra runs upstream K8s (EKS). kind runs
upstream K8s in Docker containers with guaranteed API compatibility. Lightweight, no VM.
Single PostgreSQL, three databases : ZenML, pgstac, and MLflow all need Postgres.
One StatefulSet with init SQL (CREATE DATABASE zenml; fair_models; mlflow). Mirrors
production where CloudNativePG hosts databases the same way.
MLflow over W&B : Apache 2.0, uses Postgres (same engine as everything else),
mature Helm chart, ZenML first-class --flavor=mlflow support. W&B self-hosted
requires MySQL + Redis + commercial license.
eoAPI for STAC : Dev uses the upstream eoAPI chart from
https://devseed.com/eoapi-k8s/ (eoapi/eoapi, pinned to 0.12.2 in
infra/helmfile.yaml.gotmpl) with external-plaintext DB.
ZenML Postgres patch : OSS ZenML only supports MySQL/SQLite. The patched server
image at ghcr.io/hotosm/zenml-postgres
replaces MySQL dialect (MEDIUMTEXT) with Postgres equivalents. The client side is
handled automatically by fair-py-ops: a .pth startup hook
(fair/_patch_zenml.py) adds the POSTGRESQL enum variant to
ServerDatabaseType at interpreter startup, before any ZenML import. No manual
client patching is needed.
StacBackend Protocol : StacCatalogManager writes local JSON files.
PgStacBackend writes to pgstac via pypgstac. Both conform to the StacBackend
Protocol (structural subtyping). run.py --stac-api-url selects pgstac; omit for local.
PgStacBackend reads via pystac-client : The eoAPI chart injects
--root-path=/stac by default, which breaks self-links under direct port-forwarding.
Dev values set stac.overrideRootPath: "" to remove it, so pystac-client works
correctly against http://localhost:8082.
GPU scheduling from STAC metadata : mlm:accelerator and mlm:accelerator_count
in stac-item.json drive nvidia.com/gpu resource requests. config.py reads these
and emits pod settings only when the orchestrator is Kubernetes.
Known issues¶
eoAPI root_path (resolved)
The chart's
deployment template
injects --root-path={{ .Values.stac.ingress.path }} (defaults to /stac) into
the uvicorn command when an ingress class is set. Dev values set
stac.overrideRootPath: "" which removes the arg entirely, so pystac-client
works via direct port-forwarding.