Kubernetes Dev Stack¶

Local kind cluster mirroring the EKS deployment from hotosm/k8s-infra.

Quickstart¶

Prerequisites

kind, kubectl, helm, helmfile, mc (minio client), colima (macOS) or Docker Engine (Linux). just setup in k8s mode checks all of these are on $PATH before proceeding. For GPU support: nvkind, NVIDIA driver, nvidia-container-toolkit. See GPU Support below.

View source code of infra files for dev infra/dev

Setup and cluster lifecycle

just k8s              # switch to k8s mode (sticky, one-time)
just setup            # install deps + k8s extras + verify CLI tools
cd infra/dev
just up               # creates cluster if missing, deploys infra, starts port-forwards
just status            # show cluster, pods, port-forward health
just down              # stop port-forwards (cluster stays for fast restart)
just tear              # destroy everything

Run pipelines

just example           # E2E with local orchestrator against k8s infra (from repo root)
just run-example-k8s   # E2E with k8s orchestrator (from infra/dev)

Verifying results¶

After just run-example completes, inspect outputs at

What	URL
ZenML dashboard (pipelines, steps, artifacts)	http://localhost:8080 (login: `default` / empty password)
STAC collections (registered & promoted models)	http://localhost:8082/collections
MLflow experiments (training metrics, model registry)	http://localhost:5000
MinIO browser (raw S3 objects)	http://localhost:9000 (login: `minioadmin` / `minioadmin`)

ZenML Stacks¶

just up registers two stacks:

dev (active) k8s


Orchestrator	`default` (local process)
S3 Endpoint	`localhost:9000`
MLflow	`localhost:5000`
Use	Local runs via port-forward (`just run-example`)


Orchestrator	`k8s_orchestrator`
S3 Endpoint	`minio.fair.svc:9000`
MLflow	`mlflow.fair.svc:80`
Use	In-cluster jobs (`just run-example-k8s`)

Architecture¶

All services run in namespace fair on a 3-node kind cluster (1 CP + 2 workers).

Cluster topology (namespace: fair)

postgres (PG 17 + PostGIS)           zenml (ghcr.io/hotosm/zenml-postgres:0.93.3)
  DBs: zenml, fair_models, mlflow      Official Helm chart, OCI registry
        |                               |
        +--- stac-fastapi-pgstac        +--- mlflow (community-charts/mlflow)
        |    eoapi-k8s chart                 PG backend + S3 artifacts
        |                               |
        +--- minio (s3://fair-data, s3://mlflow, s3://zenml)

Port-forwards (managed by just up / just down)

Service	Local	Cluster
ZenML	localhost:8080	zenml.fair.svc:80
STAC API	localhost:8082	stac-stac.fair.svc:8080
MinIO	localhost:9000	minio.fair.svc:9000
MLflow	localhost:5000	mlflow.fair.svc:80
Postgres	localhost:5432	postgres.fair.svc:5432

GPU Support (optional)¶

Follow the nvkind prerequisites and setup guide to install the NVIDIA driver, nvidia-container-toolkit, and nvkind on your host. Once nvkind is on $PATH, just up handles the rest.

What just up does

kind-config.yaml labels workers as inference and train, with the train node getting extraMounts that signal GPU presence to nvkind. The cluster creation step runs nvkind (installs toolkit inside the node, configures containerd). The infra step creates the nvidia RuntimeClass, labels the GPU node, and deploys the device plugin.

Caveats

PatchProcDriverNvidia may fail on non-MIG single-GPU hosts ; non-critical, the justfile tolerates it.
nvkind restarts containerd on the GPU node, briefly disrupting colocated pods.
Device plugin uses --set deviceDiscoveryStrategy=nvml (default auto fails inside kind).

Configuration¶

Label domain¶

Node labels and taints use the fair.dev prefix (hardcoded in all dev/CI config files). For production (dok8s), the label domain comes from FAIR_DOMAIN in .env.

The runtime default in fair/zenml/config.py can be overridden via FAIR_LABEL_DOMAIN env var.

Where the label domain appears

infra/dev/kind-config.yaml : node labels (fair.dev/role) and taints (fair.dev/workload)
infra/ci/kind-config.yaml : same, single-node CI variant
infra/dev/postgres/statefulset.yaml : nodeSelector fair.dev/role: infra
stacks/k8s.yaml / stacks/ci-k8s.yaml : pod node_selectors and tolerations
fair/zenml/config.py : reads FAIR_LABEL_DOMAIN at runtime (default fair.dev) for pipeline pod scheduling

Decisions

kind over minikube/k3s : hotosm/k8s-infra runs upstream K8s (EKS). kind runs upstream K8s in Docker containers with guaranteed API compatibility. Lightweight, no VM. ( this can be revised in know that talos is recommended in our docs, it is mainly becuase of learning curve with talos..)

Single PostgreSQL, three databases : ZenML, pgstac, and MLflow all need Postgres. One StatefulSet with init SQL (CREATE DATABASE zenml; fair_models; mlflow). Mirrors production where CloudNativePG hosts databases the same way.

MLflow over W&B : Apache 2.0, uses Postgres (same engine as everything else), mature Helm chart, ZenML first-class --flavor=mlflow support. W&B self-hosted requires MySQL + Redis + commercial license.

eoAPI for STAC : Production deploys eoAPI at stac.ai.hotosm.org (k8s-infra/apps/fair/eoapi/values.yaml). Dev uses the same chart (v0.12.0) with external-plaintext DB.

ZenML Postgres patch : OSS ZenML only supports MySQL/SQLite. The patched server image at ghcr.io/hotosm/zenml-postgres replaces MySQL dialect (MEDIUMTEXT) with Postgres equivalents. The client side is handled automatically by fair-py-ops: a .pth startup hook (fair/_patch_zenml.py) adds the POSTGRESQL enum variant to ServerDatabaseType at interpreter startup, before any ZenML import. No manual client patching is needed.

StacBackend Protocol : StacCatalogManager writes local JSON files. PgStacBackend writes to pgstac via pypgstac. Both conform to the StacBackend Protocol (structural subtyping). run.py --stac-api-url selects pgstac; omit for local.

PgStacBackend reads via pystac-client : The eoAPI chart injects --root-path=/stac by default, which breaks self-links under direct port-forwarding. Dev values set stac.overrideRootPath: "" to remove it, so pystac-client works correctly against http://localhost:8082.

GPU scheduling from STAC metadata : mlm:accelerator and mlm:accelerator_count in stac-item.json drive nvidia.com/gpu resource requests. config.py reads these and emits pod settings only when the orchestrator is Kubernetes.

Dev -> Prod delta¶

Environment comparison

Dev (kind)	Prod (EKS)
PG StatefulSet	CloudNativePG cluster
MinIO	AWS S3
eoAPI dev values	`k8s-infra/apps/fair/eoapi/values.yaml`
ZenML Helm (same OCI)	TBF
MLflow dev values	TBF
kind kubeconfig	TBF

Known issues¶

eoAPI root_path (resolved)

The chart's deployment template injects --root-path={{ .Values.stac.ingress.path }} (defaults to /stac) into the uvicorn command when an ingress class is set. Dev values set stac.overrideRootPath: "" which removes the arg entirely, so pystac-client works via direct port-forwarding.