Contributing a Model¶

Guide for contributing a base model to fAIr. A base model is a reusable ML blueprint that users can finetune on their own datasets through the fAIr platform.

Reference Implementations¶

Model	Task	Architecture	Directory
UNet segmentation	Semantic segmentation	UNet (torchgeo)	`models/unet_segmentation/`
ResNet18 classification	Binary classification	ResNet18 (torchvision)	`models/resnet18_classification/`
YOLOv11n detection	Object detection	YOLOv11 nano (ultralytics)	`models/yolo11n_detection/`

Model Scope¶

fAIr targets feature extraction from very high resolution (VHR) aerial and satellite imagery ; typically ~ > 30 cm ground sample distance (GSD), RGB only. All imagery is sourced from OpenAerialMap.

Supported Tasks¶

Task	STAC value	Label mapping	Typical output
Semantic segmentation	`semantic-segmentation`	`segmentation`	polygons
Instance segmentation	`instance-segmentation`	`segmentation`	polygons
Object detection	`object-detection`	`detection`	boxes or polygons
Classification	`classification`	`classification`	existing geometries with attributes

Your mlm:tasks must use one or more of these exact values. CI rejects anything else.

Supported Feature Categories¶

fAIr is a humanitarian mapping platform. Models should prioritise features that support disaster response, infrastructure mapping, and environmental monitoring. Core categories:

Keyword	Examples
`building`	Residential, commercial, industrial footprints; damaged vs. undamaged assessment
`road`	Highway classification (primary, secondary, tertiary); paved vs. unpaved surface detection
`tree`	Individual canopy, tree cover areas
`water`	Rivers, lakes, ponds, reservoirs

Other OpenStreetMap feature categories (landuse, bridge, etc.) are welcome as long as they are compatible with the platform's RGB input and vector output constraints. To add a new keyword, include it in keywords.json as part of your PR.

Input Requirements¶

RGB only

All models receive 3-band RGB GeoTIFF chips as input. The platform does not accept non-RGB inputs (e.g. multispectral, SAR, DEM).

Field	Value
Bands	`red`, `green`, `blue` (3 channels, RGB)
Shape	`[-1, 3, H, W]` where H and W are the chip size
Dimension order	`["batch", "bands", "height", "width"]`

Models must normalize the uint8 pixel values (0-255) in their preprocess function.

Output Requirements¶

fAIr only supports vector output. Your model's final output must produce GeoJSON geometries of one of these types:

Geometry type	Keyword	Typical task
`Polygon`	`polygon`	Building footprints, land parcels
`LineString`	`line`	Roads, waterways
`Point`	`point`	Tree detection, POI extraction

Your stac-item.json must declare exactly which geometry type the model produces via the keywords array. CI enforces that at least one of polygon, line, or point is present.

Raster-only output (e.g. raw segmentation masks without vectorization) is acceptable as an intermediate step, but the post_processing_function must ultimately convert to one of the supported geometry types for downstream consumption.

Sample Data Layout¶

data/sample/

data/sample/
  train/
    oam/             # RGB GeoTIFF chips (OAM-{x}-{y}-{z}.tif, ≥30cm GSD)
    osm/             # GeoJSON labels (osm_features_*.geojson)
  predict/
    oam/             # Input chips for inference
    predictions/     # Output directory (model writes here)

Chip filenames follow the pattern OAM-{x}-{y}-{z}.tif where x, y, z are tile coordinates. Your model must accept these as input during both training and inference.

Prerequisites¶

Before starting, ensure you have:

A working ML model for geospatial feature extraction (buildings, roads, trees, etc.)
Pretrained weights that are publicly downloadable or distributable
Familiarity with Docker and Python packaging

License¶

Required: Open-source license

Your model must use one of these open-source licenses:

License	SPDX identifier
GNU AGPL v3	`AGPL-3.0-only`
MIT	`MIT`
Apache 2.0	`Apache-2.0`
BSD 3-Clause	`BSD-3-Clause`

The license is declared in your stac-item.json under properties.license. CI rejects any other license value.

Directory Structure¶

Create a subdirectory under models/ named after your model (lowercase, hyphens for spaces):

Model directory structure

models/your-model/
  pipeline.py          # ZenML pipeline with training + inference
  Dockerfile           # Self-contained runtime environment
  stac-item.json       # STAC MLM item (model metadata)
  README.md            # Model overview, limitations, citation

pipeline.py¶

This is the core of your contribution. It must export two @pipeline-decorated functions that the platform discovers and dispatches automatically.

Required Exports¶

pipeline.py

from zenml import pipeline, step

@pipeline
def training_pipeline(...) -> None:
    """Finetune the model on a dataset."""
    ...

@pipeline
def inference_pipeline(...) -> None:
    """Run prediction on input imagery."""
    ...

@step
def split_dataset(...) -> Annotated[dict[str, Any], "split_info"]:
    """Split data into train/val sets and log split metadata."""
    ...

CI validates these exports via AST parsing (scripts/validate_model.py). Both @pipeline functions and the @step split_dataset function are required. No runtime dependencies are needed for the check to pass.

Required Functions¶

Your pipeline.py must also define:

Function	Role	Referenced by
`preprocess`	Normalize/transform input data before the model	`stac-item.json` `mlm:input[].pre_processing_function`
`postprocess`	Convert raw model output to usable predictions	`stac-item.json` `mlm:output[].post_processing_function`

These are referenced as Python entrypoints in the STAC item (e.g. models.your_model.pipeline:preprocess). The platform calls them dynamically ; your model owns its own pre/post processing logic entirely.

Training Pipeline¶

The training_pipeline must follow this step sequence:

Required pipeline shape

@pipeline
def training_pipeline(
    base_model_weights: str,
    dataset_chips: str,
    dataset_labels: str,
    num_classes: int,
    hyperparameters: dict[str, Any],
) -> None:
    split_info = split_dataset(dataset_chips, dataset_labels, hyperparameters)
    trained_model = train_model(..., split_info=split_info)
    evaluate_model(trained_model, ..., split_info=split_info)
    export_onnx(trained_model, ...)

Step	Required	Purpose
`split_dataset`	Yes	Split data into train/val, log split metadata to ZenML
`train_model`	Yes	Train on train split only
`evaluate_model`	Yes	Evaluate on val split only
`export_onnx`	Yes	Export ONNX model with `onnx.checker.check_model()` validation

The split_info dict returned by split_dataset is passed as a dependency to both train_model and evaluate_model. This enforces that both steps use the same split and that the split step runs first.

The pipeline receives its parameters from a generated YAML config (STAC mlm:hyperparameters merged with user overrides via fair.zenml.config.generate_training_config). Typical parameters:

dataset_chips / dataset_labels : S3 or local paths to training data
base_model_weights : pretrained weight reference (URL, enum, local path)
epochs, batch_size, learning_rate, weight_decay : training hyperparameters
chip_size, num_classes : model-specific configuration
val_ratio, split_seed : train/val split configuration

All hyperparameters must have validation constraints in the function signature (see Hyperparameters). The platform rejects invalid values at submission time, before any pod is scheduled.

Use mlflow.log_params() and mlflow.log_metrics() for experiment tracking. Use zenml.log_metadata() to attach metrics to the ZenML model version.

Auto-injected Parameters¶

The platform automatically injects several parameters into your training_pipeline from the STAC items. Your function signature must accept them, but you do not declare them in mlm:hyperparameters:

Parameter	Source	Description
`model_name`	User input	ZenML model name for the finetuned model
`base_model_id`	Base model STAC item ID	Identifies which base model is being finetuned
`dataset_id`	Dataset STAC item ID	Identifies which dataset is used
`num_classes`	`len(classification:classes)`	Extracted from STAC output spec
`class_names`	`classification:classes[].name`	Class name list from STAC output spec
`chip_size`	`mlm:input[0].input.shape[-1]`	Chip dimension from STAC input spec
`dataset_chips`	Dataset `chips` asset href	Path to training images
`dataset_labels`	Dataset `labels` asset href	Path to training labels

Train/Val Split (split_dataset step)¶

Every training pipeline must include a split_dataset step. This step is the single source of truth for how data is divided into training and validation sets. CI enforces its presence via AST parsing.

Your split_dataset step must:

Accept dataset_chips, dataset_labels, and hyperparameters
Read val_ratio and split_seed from hyperparameters
Perform the split (strategy is model-specific)
Log split metadata to ZenML via log_metadata(metadata={"fair/split": split_info})
Return a split_info dict

The split_info dict must contain:

Key	Type	Description
`strategy`	string	Split strategy: `"random"`, `"spatial"`, or custom
`val_ratio`	float	Actual validation ratio used
`seed`	int	Random seed for reproducibility
`train_count`	int	Number of training samples
`val_count`	int	Number of validation samples
`description`	string	Human-readable explanation of how the split works

Example implementation:

split_dataset step

@step
def split_dataset(
    dataset_chips: str,
    dataset_labels: str,
    hyperparameters: dict[str, Any],
) -> Annotated[dict[str, Any], "split_info"]:
    val_ratio = hyperparameters.get("val_ratio", 0.2)
    seed = hyperparameters.get("split_seed", 42)

    # Your split logic here
    train_samples, val_samples = do_split(dataset_chips, dataset_labels, val_ratio, seed)

    split_info = {
        "strategy": "random",
        "val_ratio": val_ratio,
        "seed": seed,
        "train_count": len(train_samples),
        "val_count": len(val_samples),
        "description": "Random split by seeded shuffle of sorted filenames",
    }
    log_metadata(metadata={"fair/split": split_info})
    return split_info

The split metadata flows through the promotion pipeline into the local model STAC item as fair:split, giving users full visibility into how each finetuned model was trained.

Train on train, evaluate on val

train_model must only see training data. evaluate_model must only see validation data. Both steps receive split_info and must use it to reconstruct the same split deterministically. Evaluating on training data produces inflated metrics that do not reflect real-world performance.

Non-serializable Model Pattern (YOLO)¶

Some ML frameworks produce model objects that are not pickle-serializable (e.g. ultralytics YOLO). In these cases, your train_model step should return the file path to the saved checkpoint instead of the model object itself. ZenML will materialize the .pt file into the artifact store.

See models/yolo11n_detection/pipeline.py for a working example of this pattern.

Inference Pipeline¶

The inference_pipeline loads weights and runs prediction. It must support both base model weights (pretrained) and finetuned weights (from ZenML artifact store). Use fair.zenml.steps.load_model to load finetuned weights:

Inference pipeline example

from fair.zenml.steps import load_model

@pipeline
def inference_pipeline(
    model_uri: str,
    input_images: str,
    chip_size: int,
    num_classes: int,
    zenml_artifact_version_id: str = "",
    use_base_model: bool = False,
) -> None:
    if use_base_model:
        model = load_base_model(model_uri=model_uri, num_classes=num_classes)
    else:
        model = load_model(model_uri=model_uri, zenml_artifact_version_id=zenml_artifact_version_id)
    run_inference(model=model, input_images=input_images, chip_size=chip_size, num_classes=num_classes)

Data Resolution¶

Training data lives in S3 (production) or local filesystem (dev). Use the helpers from fair.utils.data to handle both transparently:

Data resolution helpers

from fair.utils.data import resolve_directory, resolve_path

local_chips = str(resolve_directory(chips_path, "OAM-*.tif"))
local_labels = resolve_path(labels_path)

Warning

Never hardcode paths. Never bake data into Docker images.

Dockerfile¶

Your Dockerfile must be self-contained: building and running the image alone should be sufficient to execute both training and inference pipelines. No external dependencies beyond what is installed in the image.

Requirements:

Multi-stage build recommended (builder + slim runtime)
Install all Python dependencies including fair-py-ops and your ML framework
Copy your model code into the image at models/your_model/
Set ENTRYPOINT ["/usr/local/bin/python"]

Reference Dockerfile structure:

Dockerfile

FROM ghcr.io/astral-sh/uv:python3.13-trixie-slim AS builder
ENV UV_SYSTEM_PYTHON=1 UV_LINK_MODE=copy

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential libgdal-dev && rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install \
    your-ml-framework \
    fair-py-ops 

FROM python:3.13-slim-trixie
WORKDIR /app

RUN apt-get update && apt-get install -y --no-install-recommends \
    libexpat1 libgdal36 && rm -rf /var/lib/apt/lists/*

COPY --from=builder /usr/local/lib/python3.13/site-packages /usr/local/lib/python3.13/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin

COPY models/your_model models/your_model
ENTRYPOINT ["/usr/local/bin/python"]

The image is built from the repository root (not from models/your_model/), so paths in COPY are relative to the repo root.

stac-item.json¶

The STAC item is your model's metadata card. It follows the STAC MLM Extension v1.5.1 and is validated by CI against the platform's requirements schema.

Required Extensions

"stac_extensions": [
    "https://stac-extensions.github.io/mlm/v1.5.1/schema.json",
    "https://stac-extensions.github.io/version/v1.2.0/schema.json",
    "https://stac-extensions.github.io/classification/v2.0.0/schema.json",
    "https://stac-extensions.github.io/file/v2.1.0/schema.json",
    "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
]

Required Properties¶

Property	Type	Description
`title`	string	Human-readable model name (shown in catalog UI)
`description`	string	One-paragraph summary of the model and its intended use
`mlm:name`	string	Model identifier (matches directory name)
`mlm:architecture`	string	Architecture name (e.g. `UNet`, `YOLOv8`)
`mlm:tasks`	string[]	One or more of: `semantic-segmentation`, `instance-segmentation`, `object-detection`, `classification`
`mlm:framework`	string	`PyTorch` or `TensorFlow`
`mlm:framework_version`	string	Framework version
`mlm:pretrained`	boolean	Whether pretrained weights are used
`mlm:pretrained_source`	string	URL of the paper or dataset the weights come from
`mlm:input`	object[]	Input specification with `pre_processing_function`
`mlm:output`	object[]	Output specification with `post_processing_function` and `classification:classes`
`mlm:hyperparameters`	object	Default training hyperparameters
`keywords`	string[]	Feature tags + task + output geometry type
`version`	string	Semantic version (start with `"1"`)
`license`	string	SPDX license identifier
`fair:metrics_spec`	object[]	Evaluation metrics vocabulary (see below)
`fair:split_spec`	object	Train/val split specification (see below)

fair:metrics_spec¶

The MLM extension does not define evaluation metrics semantics. fair:metrics_spec fills this gap by declaring the meaning and storage location of each evaluation metric your model produces during evaluate_model. Users need this to understand what "accuracy" means (pixel accuracy? per-class? mean IoU?).

Each entry must declare:

Field	Type	Description
`key`	string	Property key where the metric is stored on the local model STAC item (e.g. `fair:accuracy`)
`name`	string	Human-readable metric name
`description`	string	Precise definition including averaging strategy

Example:

fair:metrics_spec example

"fair:metrics_spec": [
    {
        "key": "fair:accuracy",
        "name": "Pixel Accuracy",
        "description": "Fraction of correctly classified pixels across all classes"
    },
    {
        "key": "fair:mean_iou",
        "name": "Mean IoU (macro)",
        "description": "Macro-averaged IoU across classes; each class weighted equally"
    },
    {
        "key": "fair:per_class_iou",
        "name": "Per-class IoU",
        "description": "IoU per class, stored as object keyed by class name from classification:classes"
    }
]

When evaluate_model logs metrics via log_metadata(infer_model=True), the platform copies those values to the promoted local model STAC item. Class IoU keys use the classification:classes names, e.g. iou_background, iou_building (not numeric indices like iou_class_0).

fair:split_spec¶

The fair:split_spec property declares how your model expects training data to be split into train and validation sets. This is a required property on base model STAC items. CI validates its presence and structure.

Field	Type	Description
`strategy`	string	Split strategy: `"random"`, `"spatial"`, or custom
`default_ratio`	float	Recommended validation ratio (0 < ratio < 1)
`seed`	int	Default random seed for reproducibility
`description`	string	Explanation of how the split works for this model

The split strategy depends on the task type:

Task	Strategy	Description
Classification	`random`	Seeded shuffle of sorted filenames, split at ratio boundary
Segmentation	`spatial`	`RandomGeoSampler` for train, `GridGeoSampler` for val (non-overlapping tiles)
Detection	`random`	Last N% of sorted image IDs held out for validation

Example:

fair:split_spec example

"fair:split_spec": {
    "strategy": "random",
    "default_ratio": 0.2,
    "seed": 42,
    "description": "Random split by seeded shuffle of sorted filenames. Deterministic given the same seed."
}

Contributors can define custom split strategies as long as they document the approach in description and implement the corresponding split_dataset step. The val_ratio and split_seed hyperparameters allow users to override the defaults at finetuning time.

Keywords¶

The keywords array must include:

At least one feature keyword: building, road, tree, water, landuse
At least one task keyword: matches mlm:tasks values
Exactly one geometry type: polygon, line, or point

Example: ["building", "semantic-segmentation", "polygon"]

Hyperparameters¶

The mlm:hyperparameters object in your STAC item declares the default training configuration. When users finetune your model, the platform reads these defaults and merges any user overrides into a generated YAML config (via fair.zenml.config.generate_training_config). This YAML is then passed to your training_pipeline.

Every key in mlm:hyperparameters becomes a pipeline parameter. Your training_pipeline signature must accept all of them and apply validation constraints using typing.Annotated and typing.Literal:

Hyperparameter validation example

from typing import Annotated, Literal
from annotated_types import Ge, Le

@pipeline
def training_pipeline(
    # ...dataset and model params...
    epochs: Annotated[int, Ge(1), Le(1000)],
    batch_size: Annotated[int, Ge(1), Le(64)],
    learning_rate: Annotated[float, Ge(1e-6), Le(1.0)],
    weight_decay: Annotated[float, Ge(0.0), Le(1.0)],
    chip_size: Annotated[int, Ge(64), Le(2048)],
    num_classes: Annotated[int, Ge(2), Le(256)],
    optimizer: Literal["Adam", "AdamW", "SGD"] = "AdamW",
    loss: Literal["CrossEntropyLoss", "BCEWithLogitsLoss"] = "CrossEntropyLoss",
) -> None:
    ...

This serves two purposes:

ZenML validates inputs at submission time — invalid overrides are rejected before any pod is scheduled
STAC item documents the contract — users and the platform know exactly what hyperparameters your model accepts and their valid ranges

In addition to model-specific hyperparameters, you must include these split and training parameters:

Parameter	Type	Required	Description
`val_ratio`	float	Yes	Fraction of data held out for validation (default 0.2)
`split_seed`	int	Yes	Random seed for reproducible train/val split (default 42)
`scheduler`	string	Recommended	LR scheduler: `"cosine"` or `"none"`
`max_grad_norm`	float	Recommended	Maximum gradient norm for clipping (default 1.0)

Example mlm:hyperparameters:

mlm:hyperparameters example

"mlm:hyperparameters": {
    "epochs": 15,
    "batch_size": 4,
    "learning_rate": 0.0001,
    "weight_decay": 0.0001,
    "chip_size": 512,
    "optimizer": "AdamW",
    "loss": "CrossEntropyLoss",
    "val_ratio": 0.2,
    "split_seed": 42,
    "scheduler": "cosine",
    "max_grad_norm": 1.0
}

The platform auto-extracts chip_size from mlm:input[0].input.shape[-1] and num_classes from classification:classes length, so those don't need to be duplicated in mlm:hyperparameters unless your defaults differ.

Input Specification

Each entry in mlm:input must declare exactly 3 RGB bands and include a pre_processing_function with format and expression fields:

mlm:input example

"mlm:input": [{
    "name": "RGB chips",
    "bands": [{"name": "red"}, {"name": "green"}, {"name": "blue"}],
    "input": {
        "shape": [-1, 3, 512, 512],
        "dim_order": ["batch", "bands", "height", "width"],
        "data_type": "float32"
    },
    "pre_processing_function": {
        "format": "python",
        "expression": "models.your_model.pipeline:preprocess"
    }
}]

Output Specification

Each entry in mlm:output must include post_processing_function and classification:classes:

mlm:output example

"mlm:output": [{
    "name": "segmentation mask",
    "tasks": ["semantic-segmentation"],
    "result": {
        "shape": [-1, 2, 512, 512],
        "dim_order": ["batch", "channel", "height", "width"],
        "data_type": "float32"
    },
    "classification:classes": [
        {"name": "background", "value": 0},
        {"name": "building", "value": 1}
    ],
    "post_processing_function": {
        "format": "python",
        "expression": "models.your_model.pipeline:postprocess"
    }
}]

Required Assets¶

Asset key	Purpose	Required fields
`model`	Pretrained weights	`mlm:artifact_type` (e.g. `torch.save`, `onnx`)
`source-code`	Link to model source code (git URL)	`mlm:entrypoint` (e.g. `models.your_model.pipeline:training_pipeline`)
`mlm:training`	Training Docker image	`href` = Docker image reference
`mlm:inference`	Inference Docker image	`href` = Docker image reference
`readme`	Model documentation (README.md)	(none)

Model weights must be downloadable from the model asset href. This can be a direct URL, S3 path, or a framework-specific weight enum (e.g. torchgeo.models.Unet_Weights.OAM_RGB_RESNET50_TCD). Your pipeline.py is responsible for resolving and loading the weights at runtime.

The readme asset href must be an absolute URL to the raw file, not a relative path. Use the GitHub raw URL pattern:

"readme": {
    "href": "https://raw.githubusercontent.com/hotosm/fAIr-models/refs/heads/main/models/your_model/README.md",
    "type": "text/markdown",
    "roles": ["metadata"],
    "title": "Model README"
}

Relative paths such as ./README.md are not accessible from deployed STAC catalogs and will be flagged by validation.

The source-code asset href must point to the git repository (or tree URL) where the model's source code lives. This is validated by CI and displayed on the model's catalog page.

cite-as Link¶

If your model or its pretrained weights come from a published paper, add a cite-as link pointing to the canonical DOI or arXiv URL:

cite-as link example

{
    "rel": "cite-as",
    "href": "https://arxiv.org/abs/2407.11743",
    "type": "text/html",
    "title": "Paper title"
}

This link is displayed in the catalog UI. Use the canonical DOI URL when available (https://doi.org/...).

README.md¶

Every model must include a README.md in its directory. This is the human-readable documentation for your model ; it covers context that the STAC MLM item cannot express.

The README is referenced as a readme asset in stac-item.json with an absolute raw GitHub URL (see Required Assets above). Validation checks that the README file exists locally and that the asset is present in the STAC item.

What to include¶

Section	Content
Overview	One-paragraph summary: what the model does, target geography, intended use
Architecture	Model type, backbone, input/output shapes, key design choices
Pretrained source	Training dataset, paper reference, data license
Limitations	Known failure modes, geographic bias, resolution constraints
Usage	How to run training/inference locally, example commands
Citation	BibTeX or reference if the model or weights come from published work
License	License name (must match `properties.license` in `stac-item.json`)

Keep it concise. The STAC item already captures hyperparameters, input/output specs, and keywords ; the README is for everything else.

PR Checklist¶

Before opening a PR, make sure:

models/your-model/ includes pipeline.py, Dockerfile, stac-item.json, and README.md
README.md explains the model clearly enough for another developer to use it
just validate passes for the model and STAC item
training and inference both run on the sample data ( if sample data doesn't match , consider adding one )

The full requirements are described in the sections above, especially the STAC metadata, pipeline structure, assets, and README guidance. CI checks the detailed metadata, pipeline exports, Docker build, and consistency rules for you.

CI Checks¶

On PR submission, CI will:

Validate pipeline exports : scripts/validate_model.py checks for training_pipeline and inference_pipeline (@pipeline) and split_dataset (@step) via AST parsing
Validate STAC item : scripts/validate_stac_items.py checks all required properties (including fair:split_spec), extensions, assets, keywords (including geometry type), and license
Build Docker image -- verifies the Dockerfile builds successfully
Run tests with sample data -- executes against data/sample/

All checks must pass before the PR is reviewed.

Local Development¶

Local dev workflow

just setup                             # Install deps + ZenML init
just validate                          # Validate STAC items + model pipelines
just test                              # Run tests
just example                           # Run full example pipeline

Reference¶

STAC MLM Extension v1.5.1 -- MLM fields spec
MLM Best Practices
UNet segmentation model -- segmentation reference
ResNet18 classification model -- classification reference
YOLOv11n detection model -- detection reference
UNet STAC item -- STAC item template