Machine Learning

Warning

This API and documentation is experimental, under heavy development and subject to change.

Added in version 0.10.0.

Random Forest based Classification and Regression

openEO defines a couple of processes for random forest based machine learning for Earth Observation applications:

  • fit_class_random_forest for training a random forest based classification model

  • fit_regr_random_forest for training a random forest based regression model

  • predict_random_forest for inference/prediction

The openEO Python Client library provides the necessary functionality to set up and execute training and inference workflows.

Training

Let’s focus on training a classification model, where we try to predict a class like a land cover type or crop type based on predictors we derive from EO data. For example, assume we have a GeoJSON FeatureCollection of sample points and a corresponding classification target value as follows:

feature_collection = {"type": "FeatureCollection", "features": [
    {
        "type": "Feature",
        "properties": {"id": "b3dw-wd23", "target": 3},
        "geometry": {"type": "Point", "coordinates": [3.4, 51.1]}
    },
    {
        "type": "Feature",
        "properties": {"id": "r8dh-3jkd", "target": 5},
        "geometry": {"type": "Point", "coordinates": [3.6, 51.2]}
    },
    ...

Note

Confusingly, the concept “feature” has somewhat conflicting meanings for different audiences. GIS/EO people use “feature” to refer to the “rows” in this feature collection. For the machine learning community however, the properties (the “columns”) are the features. To avoid confusion in this discussion we will avoid the term “feature” and instead use “sample point” for the former and “predictor” for the latter.

We first build a datacube of “predictor” bands. For simplicity, we will just use the raw B02/B03/B04 band values here and use the temporal mean to eliminate the time dimension:

cube = connection.load_collection(
    "SENTINEL2",
    temporal_extent=[start, end],
    spatial_extent=bbox,
    bands=["B02", "B03", "B04"]
)
cube = cube.reduce_dimension(dimension="t", reducer="mean")

We now use aggregate_spatial to sample this raster data cube at the sample points and get a vector cube where we have the temporal mean of the B02/B03/B04 bands as predictor values:

predictors = cube.aggregate_spatial(feature_collection, reducer="mean")

We can now train a Random Forest model by calling the fit_class_random_forest() method on the predictor vector cube and passing the original target class data:

model = predictors.fit_class_random_forest(
    target=feature_collection,
)
# Save the model as a batch job result asset
# so that we can load it in another job.
model = model.save_ml_model()

Finally execute this whole training flow as a batch job:

training_job = model.create_job()
training_job.start_and_wait()

Inference

When the batch job finishes successfully, the trained model can then be used with the predict_random_forest process on the raster data cube (or another cube with the same band structure) to classify all the pixels.

Technically, the openEO predict_random_forest process has to be used as a reducer function inside a reduce_dimension call, but the openEO Python client library makes it a bit easier by providing a predict_random_forest() method directly on the DataCube class, so that you can just do:

predicted = cube.predict_random_forest(
    model=training_job.job_id,
    dimension="bands"
)

predicted.download("predicted.GTiff")

We specified the model here by batch job id (string), but it can also be specified in other ways: as BatchJob instance, as URL to the corresponding STAC Item that implements the ml-model extension, or as MlModel instance (e.g. loaded through load_ml_model()).