A main driver in CliMAF design is that it allows to define what is called a ‘CliMAF operator’ by interfacing with any user-developed diagnostic, be it an external script, an external binary of a Python function (hereafter called a ‘diagnostic’), and to combine it with other processing stages. CliMAF provides some services to the diagnostics, which allows to focus their development on the science.
The present section explains the basics of such an interfacing. The way to use operators is up-to-now described mainly by the Examples section
The main principles for a diagnostic are that :
it may implement either a simple or a complex function ; while simple functions are more re-usable, complex ones may be more cost-effective
the script (or function) calling sequence is registered with CliMAF before use, using a dedicated syntax, which allows to map CliMAF managed objects to script (or function) arguments (see cscript() , and syntax explanation below)
all type of diagnostics interface with string-like arguments on the command-line (or function call) for providing diagnostic computation parameters; this apply to all arguments except main input and output datasets
for main input and output datasets :
- Regarding script-type diagnostics
- they interface with CliMAF using :
- NetCDF files or OpenDAP dataset URLs (see below) for data input
- NetCDF or PNG files for data output
- input and output dataset filenames are provided by CliMAF as script arguments (at the location required by the script)
- NetCDF files must be CF-compliant
- Regarding Python function-type diagnostics : they interface with CliMAF using MaskedArrays (to be confirmed : Masked Variables may apply)
CliMAF can provide a number of services in data pre-processing, upstream of the script, which could help in simplifying the script design : fetching data through OpenDAP, slicing the data in time and space, aggregating NetCDF files in time, re-mapping data to a regular lat-lon grid
Because CliMAF does manage a cache of such pre-processed data, it should be cost-effective to let it handle these operations
Data location
On input, CliMAF deals with knowing where the data is, and will provide its path or URL; the script doesn’t need to care about that, as it receives data paths/URLs; scripts declared as non-OpenDAP-capable will receives only file paths.
For script outputs, CliMAF will provide the script with filename(s) in an existing directory, with write permission
Time slicing / aggregating
Scripts may be able, or unable, to read one variable in mutiple files, where each file represents only a part of the time period to process; CliMAF manages both cases ; it can :
- either aggregate files in a single file covering exactly the time period to work on
- or it can provide the script with a list of those filenames which are sufficient to cover the time period to process, plus a specification of the time period as a string argument; this case is more cost-effective, for very long datasets which can hardly fit in a single file
Selecting a variable to process
Primary datafiles may be multi-variable datafiles; on the other hand, some scripts may wish to be released with variable selection; to accomodate both cases, every script can:
- either ‘declare’ to CliMAF that it can select a variable in a multi-variable NetCDF file ; this is the most cost-effective
- or let CliMAF do the variable selection upstream; in that case, the script must be able to identify which is the NetCDF variable it should work on (i.e. to let apart coordinate variables)
Aliasing and re-scaling
Some primary datafiles may be inconsistent with expected standards (as the CF convention) regarding the name of geophysical variables and/or their scaling. Generic services will soon be provided to the scripts in order to deal with such cases.
Chunking over time or space
Data chunking is a technique for dealing with very large datasets which generates memory size issues : for instance, a space average is computed by looping on time periods which are small enough to fit in memory. CliMAF does not yet provide automated chunking
Declare operator my_cdo based on an off-the-shelf script/binary (cdo):
>>> cscript('mycdo','cdo ${operator} ${in} ${out}')
Use the defined operator in CliMAF : define a dataset tas_ds and apply my_cdo on it, providing it with value tim_avg for argument operator:
>>> tas_ds = ds(project='example', simulation='AMIPV6ALB2G', variable='tas', period='1980-1981')
>>> tas_avg = mycdo(tas_ds,operator='timavg')
The script/binary is actually called e.g. when requesting a file with the content of object tas_avg, as in:
>>> filen = cfile(tas_avg)
which returns the filename:
/home/my/tmp/climaf_cache/4e/4.nc
..while the actual system call launched behind the curtain by CliMAF would look like:
$ cdo tim_avg /home/my/data/AMIP/AMIP_tas.nc /home/my/tmp/climaf_cache/4e/4.nc
A diagnostic script is declared to CliMAF using function cscript with two arguments :
- one for the name of the ‘diagnostic operator’ to define (which is also the name of the python function that will be used in CliMAF for applying the script), and
- a second one providing a script calling sequence pattern string ,
such as in:
>>> cscript ( < operator_name > , < calling_sequence_pattern > )
The script calling sequence syntax is documented with the cscript class
While a basic script interfacing example show in Example for interfacing a diagnostic script with CliMAF, module standard_operators.py includes the actual, commented declarations of all standard operators defined in current CLiMAF version.
TBD
Please follow e.g. the documentation template