Adding and using data with the Virtual Ecosystem
A Virtual Ecosystem simulation requires data to run. That includes the loading of
initial forcing data for the model - things like air temperature, elevation and
photosynthetically active radiation - but also includes the storage of internal
variables calculated by the various models running within the simulation. The data
handling for simulations is managed by the data
module
and the Data
class, which provides the data
loading and storage functions for the Virtual Ecosystem. The data system is extendable
to provide support for different file formats and axis validation (see the module API
docs) but that is beyond the scope of this document.
A Virtual Ecosystem simulation will have one instance of the
Data
class to provide access to the different
forcing and internal variables used in the simulation. As they are loaded, all variables
are validated and then added to an xarray.Dataset
object, which provides a
consistent indexing and data manipulation for the underlying arrays of data.
In many cases, a user will simply provide a configuration file to set up the data that will be validated and loaded when a simulation runs, but the main functionality for working with data using Python are shown below.
Validation
One of the main functions of the data
module is to
automatically validate data before it is added to the Data
instance. Validation is
applied along a set of core axes used in the simulation. For a given core axis:
The dimension names of a dataset are used to identify if data should be validated on that axis. For example, a dataset with
x
andy
dimensions will be validated on thespatial
core axis.The axis will have a set of defined validators, which are provided to handles different possible data configurations. For example, there is a specific
spatial
validator used to handle a dataset withx
andy
dimensions but no coordinate values.When a dataset is checked against a core axis, the validation checks to see that one of those validators applies to the actual configuration of the data, and then runs the specific validation for that configuration.
The validation process is primarily intended to check that the sizes or coordinates of the dimensions of provided datasets are congruent with the configuration of a particular simulation. Validators may also standardise or subset input datasets to map them onto a particular axis configuration.
For more details on the different core axes and the alternative mappings applied by validators see the core axis documentation.
Creating a Data
instance
A Data
instance is created using information
that provides information on the core configuration of the simulation. At present, this
is just the spatial grid being used.
from pathlib import Path
import numpy as np
from xarray import DataArray
from virtual_ecosystem.core.grid import Grid
from virtual_ecosystem.core.config import Config
from virtual_ecosystem.core.data import Data
from virtual_ecosystem.core.axes import *
from virtual_ecosystem.core.readers import load_to_dataarray
# Create a grid with square 100m2 cells in a 10 by 10 lattice and a Data instance
grid = Grid(grid_type='square', cell_area=100, cell_nx=10, cell_ny=10)
data = Data(grid=grid)
data
Data: no variables loaded
Adding data to a Data instance
Data can be added to a Data
instance using one of
two methods:
An existing DataArray object can be added to a
Data
instance just using the standard dictionary assignment:data['var_name'] = data_array
. The Virtual Ecosystemreaders
module provides the functionload_to_dataarray()
to read data into a DataArray from supported file formats. This can then be added directly to a Data instance:
data['var_name'] = load_to_dataarray('path/to/file.nc', var='temperature')
The
load_data_config()
method takes a loaded Data configuration - which is a set of named variables and source files - and then just usesload_to_dataarray()
to try and load each one.
Adding a data array directly
Adding a DataArray to a Data
method takes an
existing DataArray object and then uses the built in validation to match the data onto
core axes. So, for example, the grid used above has a spatial resolution and size:
grid
CoreGrid(square, A=100, nx=10, ny=10, n=100, bounds=(0.0, 0.0, 100.0, 100.0))
One of the validation routines for the core spatial axis takes a DataArray with x
and
y
coordinates and checks that the data covers all the cells in a square grid:
temperature_data = DataArray(
np.random.normal(loc=20.0, size=(10, 10)),
name="temperature",
coords={"y": np.arange(5, 100, 10), "x": np.arange(5, 100, 10)},
)
temperature_data.plot();
That data array can then be added to the loaded and validated:
data["temperature"] = temperature_data
[INFO] - data - __setitem__(210) - Adding data array for 'temperature'
The representation of the virtual_ecosystem.core.data.Data
instance now shows
the loaded variables:
data
Data: ['temperature']
A variable can be accessed from the data
object using the variable name as a key, and
the data is returned as an :class:xarray.DataArray
object.
Note that the x
and y
coordinates have been mapped onto the internal cell_id
dimension used to label the different grid cells (see the Grid
documentation for details).
# Get the temperature data
loaded_temp = data["temperature"]
print(loaded_temp)
<xarray.DataArray 'temperature' (cell_id: 100)> Size: 800B
array([19.18255646, 19.73863043, 19.96400293, 18.91293738, 19.87738782,
19.49066972, 19.12166306, 20.15735771, 19.50391667, 19.28538099,
21.03782659, 19.48875264, 21.32990701, 20.50858601, 19.61896617,
18.04978447, 19.62633464, 19.61596411, 19.44502263, 20.26316314,
19.82133621, 21.17377578, 21.15139286, 19.80548545, 19.60960675,
20.58700652, 20.39757136, 20.79703481, 20.52482362, 19.58477842,
21.29632687, 20.18228689, 19.92637715, 19.63235711, 19.01064588,
19.71071319, 19.72569102, 19.97585873, 20.03574546, 21.44086283,
20.02144154, 19.24625673, 20.9070336 , 21.30304389, 20.45518546,
21.00569206, 17.29817076, 19.5088585 , 19.03466613, 20.1780125 ,
20.98288897, 18.72524624, 20.13500642, 20.74694571, 19.15080049,
18.5445083 , 20.33568856, 18.94084468, 20.57520206, 16.94567768,
19.71077916, 20.04220263, 19.70543969, 19.70917819, 18.78140545,
19.81206259, 20.29082848, 18.52166707, 21.37534011, 20.52353058,
20.3822667 , 18.13982302, 21.36175962, 21.10536143, 20.31490153,
19.86325391, 21.9698908 , 19.23854365, 19.80296407, 19.11467738,
20.45259368, 20.12273163, 20.85982004, 19.95109577, 21.89446113,
20.13530813, 16.76533195, 20.64286659, 20.51492633, 18.44443909,
21.42845175, 18.51042375, 19.91759208, 21.94120635, 19.6612575 ,
18.84158397, 18.66492004, 18.97374123, 20.83488106, 19.59019387])
Coordinates:
y (cell_id) int64 800B 95 95 95 95 95 95 95 95 95 ... 5 5 5 5 5 5 5 5
x (cell_id) int64 800B 5 15 25 35 45 55 65 ... 35 45 55 65 75 85 95
Dimensions without coordinates: cell_id
You can check whether a particular variable has been validated on a given core axis
using the on_core_axis()
method:
data.on_core_axis("temperature", "spatial")
True
Loading data from a file
Data can be loaded directly from a file by providing a path to a supported file
format and the name of a variable stored in the file. In this example below, the
NetCDF file contains a variable temp
with dimensions x
and y
, both of which
are of length 10: it contains a 10 by 10 grid that maps onto the shape of the
configured grid.
# Load data from a file
file_path = Path("../../data/xy_dim.nc")
data['temp'] = load_to_dataarray(file_path, var_name="temp")
[INFO] - readers - load_to_dataarray(167) - Loading variable 'temp' from file: ../../data/xy_dim.nc
[INFO] - data - __setitem__(210) - Adding data array for 'temp'
data
Data: ['temperature', 'temp']
data.on_core_axis("temp", "spatial")
True
Loading data from a configuration
The configuration files for a Virtual Ecosystem simulation can include a data configuration section. This can be used to automatically load multiple datasets into a Data object. The configuration file is TOML formatted and should contain an entry like the example below for each variable to be loaded.
[[core.data.variable]]
file="'../../data/xy_dim.nc'"
var_name="temp"
NOTE: At the moment,
core.data.variable
tags cannot be used across multiple toml config files without
causing ConfigurationError: Duplicated entries in config files: core.data.variable
to
be raised. This means that all variables need to be combined in one config
file.
To load configuration data , you will typically use the cfg_paths
argument
to pass one or more TOML formatted configuration files to create a
Config
object. You can also use a string
containing TOML formatted text or a list of TOML strings to create a configuration
object:
data_toml = '''[[core.data.variable]]
file="../../data/xy_dim.nc"
var_name="temp"
'''
config = Config(cfg_strings=data_toml)
[INFO] - config - load_config_toml_string(372) - Config TOML loaded from config strings
[INFO] - config - build_config(440) - Config built from config string
[INFO] - registry - register_module(104) - Registering module: virtual_ecosystem.core
[INFO] - registry - register_module(154) - Schema registered for virtual_ecosystem.core: /home/docs/checkouts/readthedocs.org/user_builds/virtual-rainforest/checkouts/latest/virtual_ecosystem/core/module_schema.json
[INFO] - registry - register_module(176) - Constants class registered for virtual_ecosystem.core: CoreConsts
[INFO] - config - build_schema(482) - Validation schema for configuration built.
[INFO] - config - validate_config(516) - Configuration validated
The Config
object can then be passed to the load_data_config
method:
data.load_data_config(config)
[INFO] - data - load_data_config(294) - Loading data from configuration
[INFO] - readers - load_to_dataarray(167) - Loading variable 'temp' from file: ../../data/xy_dim.nc
[INFO] - data - __setitem__(212) - Replacing data array for 'temp'
data
Data: ['temperature', 'temp']
Data output
The entire contents of the Data
object can be output using the
save_to_netcdf()
method:
data.save_to_netcdf(output_file_path)
Alternatively, a smaller netCDF can be output containing only variables of interest. This is done by providing a list specifying what those variables are to the function.
variables_to_save = ["variable_a", "variable_b"]
data.save_to_netcdf(output_file_path, variables_to_save)