API documentation for the data
module
The data
module handles the population and storage of
data sources used to run Virtual Ecosystem simulations.
The Data class
The core Data
class is used to store data for the
variables used in a simulation. It can be used both for data from external sources - for
example, data used to set the initial environment or time series of inputs - and for
internal variables used in the simulation. The class behaves like a dictionary - so data
can be retrieved and set using data_object['varname']
- but also provide validation
for data being added to the object.
All data added to the class is stored in a Dataset
object, and data
extracted from the object will be a DataArray
. The Dataset
can also
be accessed directly using the data
attribute
of the class instance to use any of the Dataset
class methods.
When data is added to a Data
instance, it is
automatically validated against the configuration of a simulation before being added to
the data
attribute. The validation process
also stores information that allows models to can confirm that a given variable has been
successfully validated.
The core of the Data
class is the
__setitem__()
method. This method provides the
following functionality:
It allows a
DataArray
to be added to aData
instance using thedata['varname'] = data_array
syntax.It applies the validation step using the
validate_dataarray()
function. See theaxes
module for the details of the validation process, including theAxisValidator
class and the concept of core axes.It inserts the data into the
Dataset
instance stored in thedata
attribute.Lastly, it records the data validation details in the
variable_validation
attribute.
The Data
class also provides three shorthand
methods to get information and data from an instance.
The
__contains__()
method tests if a named variable is included in the internalDataset
instance.# Equivalent code 'varname' in data 'varname' in data.data
The
__getitem__()
method is used to retrieve a named variable from the internalDataset
instance.# Equivalent code data['varname'] data.data['varname']
The
on_core_axis()
method queries thevariable_validation
attribute to confirm that a named variable has been validated on a named axis.# Test that the temperature variable has been validated on the spatial axis data.on_core_axis('temperature', 'spatial')
Adding data from a file
The general solution for programmatically adding data from a file is to:
manually open a data file using an appropriate reader packages for the format,
coerce the data into a properly structured
DataArray
object, and thenuse the
__setitem__()
method to validate and add it to aData
instance.
The load_to_dataarray()
implements data loading
to a DataArray for some known file formats, using file reader functions described in the
readers
module. See the details of that module for
supported formats and for extending the system to additional file formats.
# Load temperature data from a supported file
from virtual_ecosystem.core.readers import load_to_dataarray
data['temp'] = load_to_dataarray(
'/path/to/supported/format.nc', var_name='temperature'
)
Using a data configuration
A Data
instance can also be populated using the
load_data_config()
method. This is expecting to
take a properly validated configuration object, typically created from TOML files
(see Config
). The expected
structure of the data configuration section within those TOML files is as follows:
[[core.data.variable]]
file="/path/to/file.nc"
var_name="precip"
[[core.data.variable]]
file="/path/to/file.nc"
var_name="temperature"
[[core.data.variable]]
var_name="elev"
Data configurations must not contain repeated data variable names. NOTE: At the moment,
`core.data.variable`
tags cannot be used across multiple toml config files without
causing `ConfigurationError: Duplicated entries in config files: core.data.variable`
to be raised. This means that all variables need to be combined in one `config`
file.
# Load configured datasets
data.load_data_config(config)
Classes:
|
The Virtual Ecosystem data object. |
|
Generate artificial data. |
Functions:
|
Merge all continuous data files in a folder into a single file. |
- class virtual_ecosystem.core.data.Data(grid: Grid)
The Virtual Ecosystem data object.
This class holds data for a Virtual Ecosystem simulation. It functions like a dictionary but the class extends the dictionary methods to provide common methods for data validation etc and to hold key attributes, such as the underlying spatial grid.
- Parameters:
grid – The Grid instance that will be used for simulation.
- Raises:
TypeError – when grid is not a Grid object
Methods:
__contains__
(key)Check if a given data variable is present in a Data instance.
__getitem__
(key)Get a given data variable from a Data instance.
__repr__
()Returns a representation of a Data instance.
__setitem__
(key, value)Load a data array into a Data instance.
add_from_dict
(output_dict)Update data object from dictionary of variables.
load_data_config
(config)Setup the simulation data from a user configuration.
on_core_axis
(var_name, axis_name)Check core axis validation.
output_current_state
(variables_to_save, ...)Method to output the current state of the data object.
save_timeslice_to_netcdf
(output_file_path, ...)Save specific variables from current state of data as a NetCDF file.
save_to_netcdf
(output_file_path[, ...])Save the contents of the data object as a NetCDF file.
Attributes:
The
Dataset
used to store data.The configured Grid to be used in a simulation.
Records validation details for loaded variables.
- __contains__(key: str) bool
Check if a given data variable is present in a Data instance.
This method provides the var_name in data_instance functionality for a Data instance. This is just a shortcut:
var in data_instance
is the same asvar in data_instance.data
.- Parameters:
key – A data variable name
- __getitem__(key: str) DataArray
Get a given data variable from a Data instance.
This method looks for the provided key in the data variables saved in the data attribute and returns the DataArray for that variable. Note that this is just a shortcut:
data_instance['var']
is the same asdata_instance.data['var']
.- Parameters:
key – The name of the data variable to get
- Raises:
KeyError – if the data variable is not present
- __setitem__(key: str, value: DataArray) None
Load a data array into a Data instance.
This method takes an input {class}`~xarray.DataArray` object and then matches the dimension and coordinates signature of the array to find a loading routine given the grid used in the {class}`virtual_ecosystem.core.data.Data` instance. That routine is used to validate the DataArray and then add the DataArray to the {class}`~xarray.Dataset` object or replace the existing DataArray under that key.
Note that the DataArray name is expected to match the standard internal variable names used in Virtual Ecosystem.
- Parameters:
key – The name to store the data under
value – The DataArray to be stored
- Raises:
TypeError – when the value is not a DataArray.
- add_from_dict(output_dict: dict[str, DataArray]) None
Update data object from dictionary of variables.
This function takes a dictionary of updated variables to replace the corresponding variables in the data object. If a variable is not in data, it is added. This will need to be reassessed as the model evolves; TODO we might want to split the function in strict ‘replace’ and ‘add’ functionalities.
- Parameters:
output_dict – dictionary of variables from submodule
- Returns:
an updated data object for the current time step
- load_data_config(config: Config) None
Setup the simulation data from a user configuration.
This is a method is used to validate a provided user data configuration and populate the Data instance object from the provided data sources. The data_config dictionary can contain a ‘variable’ key containing an array of dictionaries providing the path to the file (
file
) and the name of the variable within the file (var_name
).- Parameters:
config – A validated Virtual Ecosystem model configuration object.
- on_core_axis(var_name: str, axis_name: str) bool
Check core axis validation.
This function checks if a given variable loaded into a Data instance has been validated on one of the core axes.
- Parameters:
var_name – The name of a variable
axis_name – The core axis name
- Returns:
A boolean indicating if the variable was validated on the named axis.
- Raises:
ValueError – Either an unknown variable or core axis name or that the variable validation data in the Data instance does not include the variable, which would be an internal programming error.
- output_current_state(variables_to_save: list[str], data_options: dict[str, Any], time_index: int) Path
Method to output the current state of the data object.
This function outputs all variables stored in the data object, except for any data with a “time_index” dimension defined (at present only climate input data has this). This data can either be saved as a new file or appended to an existing file.
- Parameters:
variables_to_save – List of variables to save
data_options – Set of options concerning what to output and where
time_index – The index representing the current time step in the data object.
- Raises:
ConfigurationError – If the final output directory doesn’t exist, isn’t a directory, or the final output file already exists (when in new file mode). If the file to append to is missing (when not in new file mode).
- Returns:
A path to the file that the current state is saved in
- save_timeslice_to_netcdf(output_file_path: Path, variables_to_save: list[str], time_index: int) None
Save specific variables from current state of data as a NetCDF file.
At present, this function save each time step individually. In future, this function might be altered to append multiple time steps at once, as this could improve performance significantly.
- Parameters:
output_file_path – Path location to save NetCDF file to.
variables_to_save – List of variables to save in the file
time_index – The time index of the slice being saved
- Raises:
ConfigurationError – If the file to save to can’t be found
- save_to_netcdf(output_file_path: Path, variables_to_save: list[str] | None = None) None
Save the contents of the data object as a NetCDF file.
Either the whole contents of the data object or specific variables of interest can be saved using this function.
- Parameters:
output_file_path – Path location to save the Virtual Ecosystem model state.
variables_to_save – List of variables to be saved. If not provided then all variables are saved.
- variable_validation: dict[str, dict[str, str | None]]
Records validation details for loaded variables.
The validation details for each variable is stored in this dictionary using the variable name as a key. The validation details are a dictionary, keyed using core axis names, of the
AxisValidator
subclass applied to that axis. If no validator was applied, the entry for that core axis will beNone
.
- class virtual_ecosystem.core.data.DataGenerator(spatial_axis: str, temporal_axis: str, temporal_interpolation: timedelta64, seed: int | None, method: str, **kwargs: Any)
Generate artificial data.
Currently just a signature sketch.
- virtual_ecosystem.core.data.merge_continuous_data_files(data_options: dict[str, Any], continuous_data_files: list[Path]) None
Merge all continuous data files in a folder into a single file.
This function deletes all of the continuous output files it has been asked to merge once the combined output is saved.
- Parameters:
data_options – Set of options concerning what to output and where
continuous_data_files – Files containing previously output continuous data
- Raises:
ConfigurationError – If output folder doesn’t exist or if it output file already exists