DataCube construction

The load_collection process

The most straightforward way to start building your openEO data cube is through the load_collection process. As mentioned earlier, this is provided by the load_collection() method on a Connection object, which produces a DataCube instance. For example:

cube = connection.load_collection("SENTINEL2_TOC")

While this should cover the majority of use cases, there some cases where one wants to build a DataCube object from something else or something more than just a simple load_collection process.

Construct DataCube from process

Through user-defined processes one can encapsulate one or more load_collection processes and additional processing steps in a single reusable user-defined process. For example, imagine a user-defined process “masked_s2” that loads an openEO collection “SENTINEL2_TOC” and applies some kind of cloud masking. The implementation details of the cloud masking are not important here, but let’s assume there is a parameter “dilation” to fine-tune the cloud mask. Also note that the collection id “SENTINEL2_TOC” is hardcoded in the user-defined process.

We can now construct a data cube from this user-defined process with datacube_from_process() as follows:

cube = connection.datacube_from_process("masked_s2", dilation=10)

# Further processing of the cube:
cube = cube.filter_temporal("2020-09-01", "2020-09-10")

Note that datacube_from_process() can be used with all kind of processes, not only user-defined processes. For example, while this is not exactly a real EO data use case, it will produce a valid openEO process graph that can be executed:

>>> cube = connection.datacube_from_process("mean", data=[2, 3, 5, 8])
>>> cube.execute()
4.5

Construct a DataCube from JSON

openEO process graphs are typically stored and published in JSON format. Most notably, user-defined processes are transferred between openEO client and back-end in a JSON structure roughly like in this example:

{
  "id": "evi",
  "parameters": [
    {"name": "red", "schema": {"type": "number"}},
    {"name": "blue", "schema": {"type": "number"}},
    ...
  ],
  "process_graph": {
    "sub": {"process_id": "subtract", "arguments": {"x": {"from_parameter": "nir"}, "y": {"from_parameter": "red"}}},
    "p1": {"process_id": "multiply", "arguments": {"x": 6, "y": {"from_parameter": "red"}}},
    "div": {"process_id": "divide", "arguments": {"x": {"from_node": "sub"}, "y": {"from_node": "sum"}},
    ...

It is possible to construct a DataCube object that corresponds with this process graph with the Connection.datacube_from_json method. It can be given one of:

  • a raw JSON string,

  • a path to a local JSON file,

  • an URL that points to a JSON resource

The JSON structure should be one of:

  • a mapping (dictionary) like the example above with at least a "process_graph" item, and optionally a "parameters" item.

  • a mapping (dictionary) with {"process_id": ...} items

Some examples

Load a DataCube from a raw JSON string, containing a simple “flat graph” representation:

raw_json = '''{
    "lc": {"process_id": "load_collection", "arguments": {"id": "SENTINEL2_TOC"}},
    "ak": {"process_id": "apply_kernel", "arguments": {"data": {"from_node": "lc"}, "kernel": [[1,2,1],[2,5,2],[1,2,1]]}, "result": true}
}'''
cube = connection.datacube_from_json(raw_json)

Load from a raw JSON string, containing a mapping with “process_graph” and “parameters”:

raw_json = '''{
    "parameters": [
        {"name": "kernel", "schema": {"type": "array"}, "default": [[1,2,1], [2,5,2], [1,2,1]]}
    ],
    "process_graph": {
        "lc": {"process_id": "load_collection", "arguments": {"id": "SENTINEL2_TOC"}},
        "ak": {"process_id": "apply_kernel", "arguments": {"data": {"from_node": "lc"}, "kernel": {"from_parameter": "kernel"}}, "result": true}
    }
}'''
cube = connection.datacube_from_json(raw_json)

Load directly from a local file or URL containing these kind of JSON representations:

# Local file
cube = connection.datacube_from_json("path/to/my_udp.json")

# URL
cube = connection.datacube_from_json("https://example.com/my_udp.json")

Parameterization

When the process graph uses parameters, you must specify the desired parameter values at the time of calling Connection.datacube_from_json.

For example, take this simple toy example of a process graph that takes the sum of 5 and a parameter “increment”:

raw_json = '''{"add": {
    "process_id": "add",
    "arguments": {"x": 5, "y": {"from_parameter": "increment"}},
    "result": true
}}'''

Trying to build a DataCube from it without specifying parameter values will fail like this:

>>> cube = connection.datacube_from_json(raw_json)
ProcessGraphVisitException: No substitution value for parameter 'increment'.

Instead, specify the parameter value:

>>> cube = connection.datacube_from_json(
...    raw_json,
...    parameters={"increment": 4},
... )
>>> cube.execute()
9

Parameters can also be defined with default values, which will be used when they are not specified in the Connection.datacube_from_json call:

raw_json = '''{
    "parameters": [
        {"name": "increment", "schema": {"type": "number"}, "default": 100}
    ],
    "process_graph": {
        "add": {"process_id": "add", "arguments": {"x": 5, "y": {"from_parameter": "increment"}}, "result": true}
    }
}'''

cube = connection.datacube_from_json(raw_json)
result = cube.execute())
# result will be 105

Re-parameterization

TODO

Building process graphs with multiple result nodes

Note

Multi-result support is added in version 0.35.0

Most openEO use cases are just about building a single result data cube, which is readily covered in the openEO Python client library through classes like DataCube and VectorCube. It is straightforward to create a batch job from these, or execute/download them synchronously.

The openEO API also allows multiple result nodes in a single process graph, for example to persist intermediate results or produce results in different output formats. To support this, the openEO Python client library provides the MultiResult class, which allows to group multiple DataCube and VectorCube objects in a single entity that can be used to create or run batch jobs. For example:

from openeo import MultiResult

cube1 = ...
cube2 = ...
multi_result = MultiResult([cube1, cube2])
job = multi_result.create_job()

Moreover, it is not necessary to explicitly create such a MultiResult object, as the Connection.create_job() method directly supports passing multiple data cube objects in a list, which will be automatically grouped as a multi-result:

cube1 = ...
cube2 = ...
job = connection.create_job([cube1, cube2])

Important

Only a single Connection can be in play when grouping multiple results like this. As everything is to be merged in a single process graph to be sent to a single backend, it is not possible to mix cubes created from different connections.