Finding and loading data

As illustrated in the basic concepts, most openEO scripts start with load_collection, but this skips the step of actually finding out which collection to load. This section dives a bit deeper into finding the right data, and some more advanced data loading use cases.

Data discovery

To explore data in a given back-end, it is recommended to use a more visual tool like the openEO Hub (http://hub.openeo.org/). This shows available collections, and metadata in a user-friendly manner.

Next to that, the client also offers various Connection methods to explore collections and their metadata:

When using these methods inside a Jupyter notebook, you should notice that the output is rendered in a user friendly way.

In a regular script, these methods can be used to programmatically find a collection that matches specific criteria.

As a user, make sure to carefully read the documentation for a given collection, as there can be important differences. You should also be aware of the data retention policy of a given collection: some data archives only retain the last 3 months for instance, making them only suitable for specific types of analysis. Such differences can have an impact on the reproducibility of your openEO scripts.

Also note that the openEO metadata may use links to point to much more information for a particular collection. For instance technical specification on how the data was preprocessed, or viewers that allow you to visually explore the data. This can drastically improve your understanding of the dataset.

Finally, licensing information is important to keep an eye on: not all data is free and open.

Initial exploration of an openEO collection

A common question from users is about very specific details of a collection, we’d like to list some examples and solutions here:

  • The collection data type, and range of values, can be determined by simply downloading a sample of data, as NetCDF or Geotiff. This can in fact be done at any point in the design of your script, to get a good idea of intermediate results.

  • Data availability, and available timestamps can be retrieved by computing average values for your area of interest. Just construct a polygon, and retrieve those statistics. For optical data, this can also be used to get an idea on cloud statistics.

  • Most collections have a native projection system, again a simple download will give you this information if its not clear from the metadata.

Loading a data cube from a collection

Many examples already illustrate the basic openEO load_collection process through a Connection.load_collection() call, with filters on space, time and bands. For example:

cube = connection.load_collection(
    "SENTINEL2_L2A",
    spatial_extent={"west": 3.75, "east": 4.08, "south": 51.29, "north": 51.39},
    temporal_extent=["2021-05-07", "2021-05-14"],
    bands=["B04", "B03", "B02"],
)

The purpose of these filters in load_collection is to reduce the amount of raw data that is loaded (and processed) by the back-end. This is essential to get a response to your processing request in reasonable time and keep processing costs low. It’s recommended to start initial exploration with a small spatio-temporal extent and gradually increase the scope once initial tests work out.

Next to specifying filters inside the load_collection process, there are also possibilities to filter with separate filter processes, e.g. at a later stage in your process graph. For most openEO back-ends, the following example snippet should be equivalent to the previous:

cube = connection.load_collection("SENTINEL2_L2A")
cube = cube.filter_bbox(west=3.75, east=4.08, south=51.29, north=51.39)
cube = cube.filter_temporal("2021-05-07", "2021-05-14")
cube = cube.filter_bands(["B04", "B03", "B02"])

Another nice feature is that processes that work with geometries or vector features (e.g. aggregated statistics for a polygon, or masking by polygon) can also be used by a back-end to automatically infer an appropriate spatial extent. This way, you do not need to explicitly set these filters yourself.

In the following sections, we want to dive a bit into details, and more advanced cases.

Filter on spatial extent

A spatial extent is a bounding box that specifies the minimum and and maximum longitude and latitude of the region of interest you want to process.

By default these latitude and longitude values are expressed in the standard Coordinate Reference System for the world, which is EPSG:4326, also known as “WGS 84”, or just “lat-long”.

connection.load_collection(
    ...,
    spatial_extent={"west": 5.14, "south": 51.17, "east": 5.17, "north": 51.19},
)

Filter on temporal extent

Usually you don’t need the complete time range provided by a collection and you should specify an appropriate time window to load as a temporal_extent pair containing a start and end date:

connection.load_collection(
    ...,
    temporal_extent=["2021-05-07", "2021-05-14"],
)

In most use cases, day-level granularity is enough and you can just express the dates as strings in the format "yyyy-mm-dd". You can also pass datetime.date objects (from Python standard library) if you already have your dates in that format.

Note

When you need finer, time-level granularity, you can pass datetime.datetime objects. Or, when passed as a string, the openEO API requires date and time to be provided in RFC 3339 format. For example for for 2020-03-17 at 12:34:56 in UTC:

"2020-03-17T12:34:56Z"

Left-closed intervals: start included, end excluded

Time ranges in openEO processes like load_collection and filter_temporal are handled as left-closed (“half-open”) temporal intervals: the start instant is included in the interval, but the end instant is excluded from the interval.

For example, the interval defined by ["2020-03-05", "2020-03-15"] covers observations from 2020-03-05 up to (and including) 2020-03-14 (just before midnight), but does not include observations from 2020-03-15.

          2020-03-05                             2020-03-14   2022-03-15
________|____________|_________________________|____________|____________|_____

        [--------------------------------------------------(O
    including                                           excluding
2020-03-05 00:00:00.000                             2020-03-15 00:00:00.000

While this might look unintuitive at first, working with half-open intervals avoids common and hard to discover pitfalls when combining multiple intervals, like unintended window overlaps or double counting observations at interval borders.

Year/month shorthand notation

Note

Year/month shorthand notation handling is available since version 0.23.0.

Rounding down periods to dates

The openEO Python Client Library supports some shorthand notations for the temporal extent, which come in handy if you work with year/month based temporal intervals. Date strings that only consist of a year or a month will be automatically “rounded down” to the first day of that period. For example:

"2023"    -> "2023-01-01"
"2023-08" -> "2023-08-01"

This approach fits best with left-closed interval handling.

For example, the following two load_collection calls are equivalent:

# Filter for observations in 2021 (left-closed interval).
connection.load_collection(temporal_extent=["2021", "2022"], ...)
# The above is shorthand for:
connection.load_collection(temporal_extent=["2021-01-01", "2022-01-01"], ...)

The same applies for filter_temporal(), which has a couple of additional call forms. All these calls are equivalent:

# Filter for March, April and May (left-closed interval)
cube = cube.filter_temporal("2021-03", "2021-06")
cube = cube.filter_temporal(["2021-03", "2021-06"])
cube = cube.filter_temporal(start_date="2021-03", end_date="2021-06")
cube = cube.filter_temporal(extent=("2021-03", "2021-06"))

# The above are shorthand for:
cube = cube.filter_temporal("2021-03-01", "2022-06-01")

Single string temporal extents

Apart from rounding down year or month string, the openEO Python Client Library provides an additional extent handling feature in methods like Connection.load_collection(temporal_extent=...) and DataCube.filter_temporal(extent=...). Normally, the extent argument should be a list or tuple containing start and end date, but if a single string is given, representing a year, month (or day) period, it is automatically expanded to the appropriate interval, again following the left-closed interval principle. For example:

extent="2022"        ->  extent=("2022-01-01", "2023-01-01")
extent="2022-05"     ->  extent=("2022-05-01", "2022-06-01")
extent="2022-05-17"  ->  extent=("2022-05-17", "2022-05-18")

The following snippet shows some examples of equivalent calls:

connection.load_collection(temporal_extent="2022", ...)
# The above is shorthand for:
connection.load_collection(temporal_extent=("2022-01-01", "2023-01-01"), ...)


cube = cube.filter_temporal(extent="2021-03")
# The above are shorthand for:
cube = cube.filter_temporal(extent=("2021-03-01", "2022-04-01"))

Filter on collection properties

Although openEO presents data in a data cube, a lot of collections are still backed by a product based catalog. This allows filtering on properties of that catalog.

A very common use case is to pre-filter Sentinel-2 products on cloud cover. This avoids loading clouded data unnecessarily and increases performance. Connection.load_collection() provides a dedicated max_cloud_cover argument (shortcut for the eo:cloud_cover property) for that:

connection.load_collection(
    "SENTINEL2_L2A",
    ...,
    max_cloud_cover=80,
)

For more general cases, you can use the properties argument to filter on any collection property. For example, to filter on the relative orbit number of SAR data:

connection.load_collection(
    "SENTINEL1_GRD",
    ...,
    properties={
        "relativeOrbitNumber": lambda x: x==116
    },
)

Version 0.26.0 of the openEO Python Client Library adds collection_property() which makes defining such property filters more user-friendly by avoiding the lambda construct:

import openeo

connection.load_collection(
    "SENTINEL1_GRD",
    ...,
    properties=[
        openeo.collection_property("relativeOrbitNumber") == 116,
    ],
)

Note that property names follow STAC metadata conventions, but some collections can have different names.

Property filters in openEO are also specified by small process graphs, that allow the use of the same generic processes defined by openEO. This is the ‘lambda’ process that you see in the property dictionary. Do note that not all processes make sense for product filtering, and can not always be properly translated into the query language of the catalog. Hence, some experimentation may be needed to find a filter that works.

One important caveat in this example is that ‘relativeOrbitNumber’ is a catalog specific property name. Meaning that different archives may choose a different name for a given property, and the properties that are available can depend on the collection and the catalog that is used by it. This is not a problem caused by openEO, but by the limited standardization between catalogs of EO data.

Handling large vector data sets

For simple use cases, it is common to directly embed geometries (vector data) in your openEO process graph. Unfortunately, with large vector data sets this leads to very large process graphs and you might hit certain limits, resulting in HTTP errors like 413 Request Entity Too Large or 413 Payload Too Large.

This problem can be circumvented by first uploading your vector data to a file sharing service (like Google Drive, DropBox, GitHub, …) and use its public URL in the process graph instead through Connection.vectorcube_from_paths. For example, as follows:

# Load vector data from URL
url = "https://github.com/Open-EO/openeo-python-client/raw/master/tests/data/example_aoi.pq"
parcels = connection.vectorcube_from_paths([url], format="parquet")

# Use the parcel vector data, for example to do aggregation.
cube = connection.load_collection(
    "SENTINEL2_L2A",
    bands=["B04", "B03", "B02"],
    temporal_extent=["2021-05-12", "2021-06-01"],
)
aggregations = cube.aggregate_spatial(
    geometries=parcels,
    reducer="mean",
)

Note that while openEO back-ends typically support multiple vector formats, like GeoJSON and GeoParquet, it is usually recommended to use a compact format like GeoParquet, instead of GeoJSON. The list of supported formats is also advertised by the backend, and can be queried with Connection.list_file_formats.