Adding and using data with the Virtual Ecosystem

A Virtual Ecosystem simulation requires data to run. That includes the loading of initial forcing data for the model - things like air temperature, elevation and photosynthetically active radiation - but also includes the storage of internal variables calculated by the various models running within the simulation. The data handling for simulations is managed by the data module and the Data class, which provides the data loading and storage functions for the Virtual Ecosystem. The data system is extendable to provide support for different file formats and axis validation (see the module API docs) but that is beyond the scope of this document.

A Virtual Ecosystem simulation will have one instance of the Data class to provide access to the different forcing and internal variables used in the simulation. As they are loaded, all variables are validated and then added to an xarray.Dataset object, which provides a consistent indexing and data manipulation for the underlying arrays of data.

In many cases, a user will simply provide a configuration file to set up the data that will be validated and loaded when a simulation runs, but the main functionality for working with data using Python are shown below.

Validation

One of the main functions of the data module is to automatically validate data before it is added to the Data instance. Validation is applied along a set of core axes used in the simulation. For a given core axis:

  • The dimension names of a dataset are used to identify if data should be validated on that axis. For example, a dataset with x and y dimensions will be validated on the spatial core axis.

  • The axis will have a set of defined validators, which are provided to handles different possible data configurations. For example, there is a specific spatial validator used to handle a dataset with x and y dimensions but no coordinate values.

  • When a dataset is checked against a core axis, the validation checks to see that one of those validators applies to the actual configuration of the data, and then runs the specific validation for that configuration.

The validation process is primarily intended to check that the sizes or coordinates of the dimensions of provided datasets are congruent with the configuration of a particular simulation. Validators may also standardise or subset input datasets to map them onto a particular axis configuration.

For more details on the different core axes and the alternative mappings applied by validators see the core axis documentation.

Creating a Data instance

A Data instance is created using information that provides information on the core configuration of the simulation. At present, this is just the spatial grid being used.

from pathlib import Path

import numpy as np
from xarray import DataArray

from virtual_ecosystem.core.grid import Grid
from virtual_ecosystem.core.config import Config
from virtual_ecosystem.core.data import Data
from virtual_ecosystem.core.axes import *
from virtual_ecosystem.core.readers import load_to_dataarray

# Create a grid with square 100m2 cells in a 10 by 10 lattice and a Data instance
grid = Grid(grid_type='square', cell_area=100, cell_nx=10, cell_ny=10)
data = Data(grid=grid)

data
Data: no variables loaded

Adding data to a Data instance

Data can be added to a Data instance using one of two methods:

  1. An existing DataArray object can be added to a Data instance just using the standard dictionary assignment: data['var_name'] = data_array. The Virtual Ecosystem readers module provides the function load_to_dataarray() to read data into a DataArray from supported file formats. This can then be added directly to a Data instance:

data['var_name'] = load_to_dataarray('path/to/file.nc', var='temperature')
  1. The load_data_config() method takes a loaded Data configuration - which is a set of named variables and source files - and then just uses load_to_dataarray() to try and load each one.

Adding a data array directly

Adding a DataArray to a Data method takes an existing DataArray object and then uses the built in validation to match the data onto core axes. So, for example, the grid used above has a spatial resolution and size:

grid
CoreGrid(square, A=100, nx=10, ny=10, n=100, bounds=(0.0, 0.0, 100.0, 100.0))

One of the validation routines for the core spatial axis takes a DataArray with x and y coordinates and checks that the data covers all the cells in a square grid:

temperature_data = DataArray(
    np.random.normal(loc=20.0, size=(10, 10)),
    name="temperature",
    coords={"y": np.arange(5, 100, 10), "x": np.arange(5, 100, 10)},
)

temperature_data.plot();
../../_images/2414930880ae79c1c580a5c35a916caa480c5bd7e8f3c63f3eebd05b325c01a0.png

That data array can then be added to the loaded and validated:

data["temperature"] = temperature_data
[INFO] - data - __setitem__(210) - Adding data array for 'temperature'

The representation of the virtual_ecosystem.core.data.Data instance now shows the loaded variables:

data
Data: ['temperature']

A variable can be accessed from the data object using the variable name as a key, and the data is returned as an :class:xarray.DataArray object.

Note that the x and y coordinates have been mapped onto the internal cell_id dimension used to label the different grid cells (see the Grid documentation for details).

# Get the temperature data
loaded_temp = data["temperature"]

print(loaded_temp)
<xarray.DataArray 'temperature' (cell_id: 100)> Size: 800B
array([19.18255646, 19.73863043, 19.96400293, 18.91293738, 19.87738782,
       19.49066972, 19.12166306, 20.15735771, 19.50391667, 19.28538099,
       21.03782659, 19.48875264, 21.32990701, 20.50858601, 19.61896617,
       18.04978447, 19.62633464, 19.61596411, 19.44502263, 20.26316314,
       19.82133621, 21.17377578, 21.15139286, 19.80548545, 19.60960675,
       20.58700652, 20.39757136, 20.79703481, 20.52482362, 19.58477842,
       21.29632687, 20.18228689, 19.92637715, 19.63235711, 19.01064588,
       19.71071319, 19.72569102, 19.97585873, 20.03574546, 21.44086283,
       20.02144154, 19.24625673, 20.9070336 , 21.30304389, 20.45518546,
       21.00569206, 17.29817076, 19.5088585 , 19.03466613, 20.1780125 ,
       20.98288897, 18.72524624, 20.13500642, 20.74694571, 19.15080049,
       18.5445083 , 20.33568856, 18.94084468, 20.57520206, 16.94567768,
       19.71077916, 20.04220263, 19.70543969, 19.70917819, 18.78140545,
       19.81206259, 20.29082848, 18.52166707, 21.37534011, 20.52353058,
       20.3822667 , 18.13982302, 21.36175962, 21.10536143, 20.31490153,
       19.86325391, 21.9698908 , 19.23854365, 19.80296407, 19.11467738,
       20.45259368, 20.12273163, 20.85982004, 19.95109577, 21.89446113,
       20.13530813, 16.76533195, 20.64286659, 20.51492633, 18.44443909,
       21.42845175, 18.51042375, 19.91759208, 21.94120635, 19.6612575 ,
       18.84158397, 18.66492004, 18.97374123, 20.83488106, 19.59019387])
Coordinates:
    y        (cell_id) int64 800B 95 95 95 95 95 95 95 95 95 ... 5 5 5 5 5 5 5 5
    x        (cell_id) int64 800B 5 15 25 35 45 55 65 ... 35 45 55 65 75 85 95
Dimensions without coordinates: cell_id

You can check whether a particular variable has been validated on a given core axis using the on_core_axis() method:

data.on_core_axis("temperature", "spatial")
True

Loading data from a file

Data can be loaded directly from a file by providing a path to a supported file format and the name of a variable stored in the file. In this example below, the NetCDF file contains a variable temp with dimensions x and y, both of which are of length 10: it contains a 10 by 10 grid that maps onto the shape of the configured grid.

# Load data from a file
file_path = Path("../../data/xy_dim.nc")
data['temp'] = load_to_dataarray(file_path, var_name="temp")
[INFO] - readers - load_to_dataarray(167) - Loading variable 'temp' from file: ../../data/xy_dim.nc
[INFO] - data - __setitem__(210) - Adding data array for 'temp'
data
Data: ['temperature', 'temp']
data.on_core_axis("temp", "spatial")
True

Loading data from a configuration

The configuration files for a Virtual Ecosystem simulation can include a data configuration section. This can be used to automatically load multiple datasets into a Data object. The configuration file is TOML formatted and should contain an entry like the example below for each variable to be loaded.

[[core.data.variable]]
file="'../../data/xy_dim.nc'"
var_name="temp"

NOTE: At the moment, core.data.variable tags cannot be used across multiple toml config files without causing ConfigurationError: Duplicated entries in config files: core.data.variable to be raised. This means that all variables need to be combined in one config file.

To load configuration data , you will typically use the cfg_paths argument to pass one or more TOML formatted configuration files to create a Config object. You can also use a string containing TOML formatted text or a list of TOML strings to create a configuration object:

data_toml = '''[[core.data.variable]]
file="../../data/xy_dim.nc"
var_name="temp"
'''

config = Config(cfg_strings=data_toml)
[INFO] - config - load_config_toml_string(372) - Config TOML loaded from config strings
[INFO] - config - build_config(440) - Config built from config string
[INFO] - registry - register_module(104) - Registering module: virtual_ecosystem.core
[INFO] - registry - register_module(154) - Schema registered for virtual_ecosystem.core: /home/docs/checkouts/readthedocs.org/user_builds/virtual-rainforest/checkouts/latest/virtual_ecosystem/core/module_schema.json 
[INFO] - registry - register_module(176) - Constants class registered for virtual_ecosystem.core: CoreConsts 
[INFO] - config - build_schema(482) - Validation schema for configuration built.
[INFO] - config - validate_config(516) - Configuration validated

The Config object can then be passed to the load_data_config method:

data.load_data_config(config)
[INFO] - data - load_data_config(294) - Loading data from configuration
[INFO] - readers - load_to_dataarray(167) - Loading variable 'temp' from file: ../../data/xy_dim.nc
[INFO] - data - __setitem__(212) - Replacing data array for 'temp'
data
Data: ['temperature', 'temp']

Data output

The entire contents of the Data object can be output using the save_to_netcdf() method:

data.save_to_netcdf(output_file_path)

Alternatively, a smaller netCDF can be output containing only variables of interest. This is done by providing a list specifying what those variables are to the function.

variables_to_save = ["variable_a", "variable_b"]
data.save_to_netcdf(output_file_path, variables_to_save)