Kubernetes Dev Stack¶
Local kind cluster mirroring the EKS deployment from hotosm/k8s-infra.
Quickstart¶
Prerequisites
kind, kubectl, helm,
helmfile,
mc (minio client),
colima (macOS) or Docker Engine (Linux).
just setup in k8s mode checks all of these are on $PATH before proceeding.
For GPU support: nvkind, NVIDIA driver,
nvidia-container-toolkit.
See GPU Support below.
View source code of infra files for dev infra/dev
just k8s # switch to k8s mode (sticky, one-time)
just setup # install deps + k8s extras + verify CLI tools
cd infra/dev
just up # creates cluster if missing, deploys infra, starts port-forwards
just status # show cluster, pods, port-forward health
just down # stop port-forwards (cluster stays for fast restart)
just tear # destroy everything
just example # E2E with local orchestrator against k8s infra (from repo root)
just run-example-k8s # E2E with k8s orchestrator (from infra/dev)
Verifying results¶
After just run-example completes, inspect outputs at
| What | URL |
|---|---|
| ZenML dashboard (pipelines, steps, artifacts) | http://localhost:8080 (login: default / empty password) |
| STAC collections (registered & promoted models) | http://localhost:8082/collections |
| MLflow experiments (training metrics, model registry) | http://localhost:5000 |
| MinIO browser (raw S3 objects) | http://localhost:9000 (login: minioadmin / minioadmin) |
ZenML Stacks¶
just up registers two stacks:
| Orchestrator | default (local process) |
| S3 Endpoint | localhost:9000 |
| MLflow | localhost:5000 |
| Use | Local runs via port-forward (just run-example) |
| Orchestrator | k8s_orchestrator |
| S3 Endpoint | minio.fair.svc:9000 |
| MLflow | mlflow.fair.svc:80 |
| Use | In-cluster jobs (just run-example-k8s) |
Architecture¶
All services run in namespace fair on a 3-node kind cluster (1 CP + 2 workers).
postgres (PG 17 + PostGIS) zenml (ghcr.io/hotosm/zenml-postgres:0.93.3)
DBs: zenml, fair_models, mlflow Official Helm chart, OCI registry
| |
+--- stac-fastapi-pgstac +--- mlflow (community-charts/mlflow)
| eoapi-k8s chart PG backend + S3 artifacts
| |
+--- minio (s3://fair-data, s3://mlflow, s3://zenml)
Port-forwards (managed by just up / just down)
| Service | Local | Cluster |
|---|---|---|
| ZenML | localhost:8080 | zenml.fair.svc:80 |
| STAC API | localhost:8082 | stac-stac.fair.svc:8080 |
| MinIO | localhost:9000 | minio.fair.svc:9000 |
| MLflow | localhost:5000 | mlflow.fair.svc:80 |
| Postgres | localhost:5432 | postgres.fair.svc:5432 |
GPU Support (optional)¶
Follow the nvkind prerequisites and setup guide to install the NVIDIA driver, nvidia-container-toolkit, and nvkind on your host. Once nvkind is on $PATH, just up handles the rest.
What just up does
kind-config.yaml labels workers as inference and train, with the train
node getting extraMounts that signal GPU presence to nvkind. The cluster
creation step runs nvkind (installs toolkit inside the node, configures containerd).
The infra step creates the nvidia RuntimeClass, labels the GPU node, and
deploys the device plugin.
Caveats
PatchProcDriverNvidiamay fail on non-MIG single-GPU hosts ; non-critical, the justfile tolerates it.- nvkind restarts containerd on the GPU node, briefly disrupting colocated pods.
- Device plugin uses
--set deviceDiscoveryStrategy=nvml(defaultautofails inside kind).
Configuration¶
Label domain¶
Node labels and taints use the fair.dev prefix (hardcoded in all dev/CI config files).
For production (dok8s), the label domain comes from FAIR_DOMAIN in .env.
The runtime default in fair/zenml/config.py can be overridden via FAIR_LABEL_DOMAIN env var.
Where the label domain appears
infra/dev/kind-config.yaml: node labels (fair.dev/role) and taints (fair.dev/workload)infra/ci/kind-config.yaml: same, single-node CI variantinfra/dev/postgres/statefulset.yaml: nodeSelectorfair.dev/role: infrastacks/k8s.yaml/stacks/ci-k8s.yaml: podnode_selectorsandtolerationsfair/zenml/config.py: readsFAIR_LABEL_DOMAINat runtime (defaultfair.dev) for pipeline pod scheduling
Decisions
kind over minikube/k3s : hotosm/k8s-infra runs upstream K8s (EKS). kind runs
upstream K8s in Docker containers with guaranteed API compatibility. Lightweight, no VM. ( this can be revised in know that talos is recommended in our docs, it is mainly becuase of learning curve with talos..)
Single PostgreSQL, three databases : ZenML, pgstac, and MLflow all need Postgres.
One StatefulSet with init SQL (CREATE DATABASE zenml; fair_models; mlflow). Mirrors
production where CloudNativePG hosts databases the same way.
MLflow over W&B : Apache 2.0, uses Postgres (same engine as everything else),
mature Helm chart, ZenML first-class --flavor=mlflow support. W&B self-hosted
requires MySQL + Redis + commercial license.
eoAPI for STAC : Production deploys eoAPI at stac.ai.hotosm.org
(k8s-infra/apps/fair/eoapi/values.yaml). Dev uses the same chart (v0.12.0)
with external-plaintext DB.
ZenML Postgres patch : OSS ZenML only supports MySQL/SQLite. The patched server
image at ghcr.io/hotosm/zenml-postgres
replaces MySQL dialect (MEDIUMTEXT) with Postgres equivalents. The client side is
handled automatically by fair-py-ops: a .pth startup hook
(fair/_patch_zenml.py) adds the POSTGRESQL enum variant to
ServerDatabaseType at interpreter startup, before any ZenML import. No manual
client patching is needed.
StacBackend Protocol : StacCatalogManager writes local JSON files.
PgStacBackend writes to pgstac via pypgstac. Both conform to the StacBackend
Protocol (structural subtyping). run.py --stac-api-url selects pgstac; omit for local.
PgStacBackend reads via pystac-client : The eoAPI chart injects
--root-path=/stac by default, which breaks self-links under direct port-forwarding.
Dev values set stac.overrideRootPath: "" to remove it, so pystac-client works
correctly against http://localhost:8082.
GPU scheduling from STAC metadata : mlm:accelerator and mlm:accelerator_count
in stac-item.json drive nvidia.com/gpu resource requests. config.py reads these
and emits pod settings only when the orchestrator is Kubernetes.
Dev -> Prod delta¶
Environment comparison
| Dev (kind) | Prod (EKS) |
|---|---|
| PG StatefulSet | CloudNativePG cluster |
| MinIO | AWS S3 |
| eoAPI dev values | k8s-infra/apps/fair/eoapi/values.yaml |
| ZenML Helm (same OCI) | TBF |
| MLflow dev values | TBF |
| kind kubeconfig | TBF |
Known issues¶
eoAPI root_path (resolved)
The chart's
deployment template
injects --root-path={{ .Values.stac.ingress.path }} (defaults to /stac) into
the uvicorn command when an ingress class is set. Dev values set
stac.overrideRootPath: "" which removes the arg entirely, so pystac-client
works via direct port-forwarding.
References¶
Further reading