Functions for data definition and access

Except for the first three paragraphs, this section is for advanced use. As a first step, you should consider using the built-in data data definitions described at projects. You may need to come back to this section for reference

ds : define a dataset object (actually a front-end for cdataset)

climaf.classes.ds(*args, **kwargs)[source]

Returns a dataset from its full Climate Reference Syntax string. Example

>>> ds('[1980].global.monthly.CNRM-CM5.r1i1p1.mon.Amon.atmos.last')

Also a shortcut for cdataset(), when used with with only keywords arguments. Example

>>> cdataset(project='CMIP5', model='CNRM-CM5', experiment='historical', frequency='monthly',              simulation='r2i3p9', domain=[40,60,-10,20], variable='tas', period='1980-1989', version='last')

You must refer to doc at : cdataset()

cdataset : define a dataset object

class climaf.classes.cdataset(**kwargs)[source]

Create a CLIMAF dataset.

A CLIMAF dataset is a description of what the data (rather than the data itself or a file). It is basically a set of pairs attribute-value. The list of attributes actually used to describe a dataset is defined by the project it refers to.

To display the attributes you may use for a given project, type e.g.:

>>> cprojects["CMIP5"]

For further details on projects , see cproject

None of the project’s attributes are mandatory arguments, because all attributes defaults to the value set by cdef() (which also applies if providing a None value for an attribute)

Some attributes have a special format or processing :

  • period : see init_period()

  • domain : allowed values are either ‘global’ or a list for latlon corners ordered as in : [ latmin, latmax, lonmin, lonmax ]

  • variable : name of the geophysical variable ; this should be :

    • either a variable actually included in the datafiles,
    • or a ‘derived’ variable (see derive() ),
    • or, an aliased variable name (see alias() )
  • in project CMIP5 , for triplets (frequency, simulation, period, table ) : if any is ‘fx’ (or ‘r0i0p0 for simulation), the others are forced to ‘fx’ (resp. ‘r0i0p0’) too.

Example, using no default value, and adressing some CMIP5 data

>>>  cdataset(project='CMIP5', model='CNRM-CM5', experiment='historical', frequency='monthly',
>>>           simulation='r2i3p9', domain=[40,60,-10,20], variable='tas', period='1980-1989', version='last')

You may use wildcard (‘*’) in attribute values, and use explore() for having CliMAF doing something sensible matching such attributes with available data

cdataset.explore: explore data and periods, and match joker attributes

cdataset.explore(option='check_and_store', sort_periods_on=None)[source]

Versatile datafile exploration for a dataset which possibly has wildcards (* and ? ) in attributes.

option can be :

  • ‘choices’ for returning a dict which keys are wildcard attributes and entries are values list
  • ‘resolve’ for returning a NEW DATASET with instanciated attributes (if uniquely)
  • ‘ensemble’ for returning AN ENSEMBLE based on multiple possible values of a single attribute
  • ‘check_and_store’ (or missing) for just identifying and storing dataset files list (while ensuring non-ambiguity check for wildcard attributes)

This feature works only for projects which organization is of type ‘generic’

Attribute ‘period’ cannot use a * without being == * ; in that case, the period of all matching files will be either :

  • aggregated among all instances of all attributes with wildcards (default)
  • or aggregated after being sorted on attribute sort_periods_on, if provided

Toy example

>>> rst=ds(project="example", simulation="*", variable="rst", period="1980-1981")
>>> rst

>>> rst.explore('choices')
{'simulation': ['AMIPV6ALB2G']}

>>> instanciated_dataset=rst.explore('resolve')
>>> instanciated_dataset

>>> my_ensemble=rst.explore('ensemble')
error    : "Creating an ensemble does not make sense because all wildcard attributes have a single possible value ({'simulation': ['AMIPV6ALB2G']})"

Real life example for options choices and ensemble

>>> rst=ds(project="CMIP6", model='*', experiment="*ontrol*", realization="r1i1p1f*", table="Amon", variable="rsut", period="1980-1981")
>>> clog('info')
>>> rst.explore('choices')
info     : Attribute institute has matching value CNRM-CERFACS
info     : Attribute experiment has multiple values : set(['piClim-control', 'piControl'])
info     : Attribute grid has matching value gr
info     : Attribute realization has matching value r1i1p1f2
info     : Attribute mip has multiple values : set(['CMIP', 'RFMIP'])
info     : Attribute model has multiple values : set(['CNRM-ESM2-1', 'CNRM-CM6-1'])
{'institute': ['CNRM-CERFACS'], 'experiment': ['piClim-control', 'piControl'], 'grid': ['gr'],
'realization': ['r1i1p1f2'], 'mip': ['CMIP', 'RFMIP'], 'model': ['CNRM-ESM2-1', 'CNRM-CM6-1']}

# Let us further select by setting experiment=piCOntrol
>>> mrst=ds(project="CMIP6", model='*', experiment="piControl", realization="r1i1p1f*", table="Amon", variable="rsut", period="1980-1981")
>>> mrst.explore('choices')
{'institute': ['CNRM-CERFACS'], 'mip': ['CMIP'], 'model': ['CNRM-ESM2-1', 'CNRM-CM6-1'], 'grid': ['gr'], 'realization': ['r1i1p1f2']}
>>> small_ensemble=mrst.explore('ensemble')
>>> small_ensemble
      'CNRM-CM6-1' :ds('CMIP6%%rsut%1980-1981%global%/cnrm/cmip%CNRM-CM6-1%CNRM-CERFACS%CMIP%Amon%piControl%r1i1p1f2%gr%latest')

Identify period covered by data, and versions

>>> d=ds(project="CMIP6",experiment="piControl", realization='r1i1p1f2', variable="so",
... table="*", period="*" , model="*",version="*")
>>> clog('info')
>>> d.explore('choices')
info     : Attribute institute='*' has matching value 'CNRM-CERFACS'
info     : Attribute perios='*' has matching value [1850-2349]
info     : Attribute version='*' has multiple values : ['v0', 'v20180720', 'latest']
info     : Attribute grid='g*' has matching value 'gn'
info     : Attribute mip='*' has matching value 'CMIP'
info     : Attribute table='*' has matching value 'Omon'
info     : Attribute model='*' has multiple values : ['CNRM-ESM2-1', 'CNRM-CM6-1']
{'institute': 'CNRM-CERFACS', 'period': [1850-2349], 'version': ['v0', 'v20180720', 'latest'], 'grid': 'gn', 'table': 'Omon', 'mip': 'CMIP', 'model': ['CNRM-ESM2-1', 'CNRM-CM6-1']}

Analyze available periods for each value of a given attribute

>>> rsut=ds(project="CMIP6", model='*', experiment="piControl*", realization="r1i1p1f*", table="Amon", variable="rsut", period="*")
>>> rsut.explore('choices','model')
{'institute': 'CNRM-CERFACS', 'period': {'CNRM-ESM2-1': [1850-2349], 'CNRM-CM6-1': [1850-2349]},
   'experiment': 'piControl', 'grid': 'gr', 'realization': 'r1i1p1f2', 'mip': 'CMIP',
   'model': ['CNRM-ESM2-1', 'CNRM-CM6-1']}

# Could also be written : rsut.explore(option='choices',sort_periods_on='model')

cdataset.check: check time consistency of a dataset


Check time consistency of first variable of a dataset or ensemble members: - check if first data time interval is consistent with dataset frequency - check if file data have a gap - check if period covered by data files actually includes the whole of dataset period

Returns: True if period of data files included dataset period, False otherwise.


>>> # Dataset with monthly frequency
>>> tas=ds(project='example', simulation='AMIPV6ALB2G', variable='tas',period='1980-1981')
>>> res1=tas.check()
>>> # Ensemble with monthly frequency
>>> j0=ds(project='example',simulation='AMIPV6ALB2G', variable='tas', frequency='monthly', period='1980')
>>> j1=ds(project='example',simulation='AMIPV6ALB2G', variable='tas', frequency='monthly', period='1981')
>>> ens=cens({'1980':j0, '1981':j1})
>>> res2=ens.check()
>>> # Define a new project for 'em' data with 3 hours frequency in particular
>>> cproject('em_3h','root','group','realm','frequency',separator='|')
>>> path='/cnrm/cmip/cnrm/simulations/${group}/${realm}/Regu/${frequency}/${simulation}/${variable}_??'
>>> dataloc(project='em_3h', organization='generic', url=path)
>>> # Dataset with 3h frequency for 'tas' variable (instant)
>>> tas_3h=ds(project='em_3h',variable='tas',group='AR4',realm='Atmos',frequency='3Hourly', simulation='A1B',period='2050-2100')
>>> res3=tas_3h.check()
>>> # Dataset with 3h frequency for 'pr' variable (time mean)
>>> pr_3h=ds(project='em_3h',variable='pr',group='AR4',realm='Atmos',frequency='3Hourly', simulation='A1B',period='2050-2100')
>>> res4=pr_3h.check()

cdataset.listfiles: returns the list of (local) files of a dataset


Returns the list of (local or remote) files which include the data for the dataset

Use cached value unless called with arg force=True

cdef : define some default values for datasets attributes

climaf.classes.cdef(attribute, value=None, project=None)[source]

Set or get the default value for a CliMAF dataset attribute or facet (such as e.g. ‘model’, ‘simulation’ ...), for use by next calls to cdataset() or to ds()

Argument ‘project’ allows to restrict the use/query of the default value to the context of the given ‘project’. On can also set the (global) default value for attribute ‘project’

There is no actual check that ‘attribute’ is a valid keyword for a call to ds or cdataset


>>> cdef('project','OCMPI5')
>>> cdef('frequency','monthly',project='OCMPI5')

eds : define an ensemble of datasets


Create a dataset ensemble using the same calling sequence as cdataset(), except that one of the facets is a list, which defines the ensemble members; this facet must be among the facets authorized for ensemble in the (single) project involved


>>> cdef("frequency","monthly") ;  cdef("project","CMIP5"); cdef("model","CNRM-CM5")
>>> cdef("variable","tas"); cdef("period","1860")
>>> ens=eds(experiment="historical", simulation=["r1i1p1","r2i1p1"])

cens : define an ensemble of objects

class climaf.classes.cens(dic={}, order=None, sortfunc=None)[source]

Function cens creates a CliMAF object of class cens , i.e. a dict of objects, which keys are member labels, and which members are ordered, using method set_order

In some cases, ensembles of datasets from the same project can also be built easily using eds()

When applying an operator to an ensemble, CliMAF will know, from operator’s declaration (see cscript()), whether the operator ‘wishes’ to get the ensemble or, on the reverse, is not ‘ensemble-capable’ :

  • if the operator is ensemble-capable it will deliver it :
    • if it is a script : with a string composed by concatenating the corresponding input files; it will also provide the labels list to the script if its declaration calls for it with keyword ${labels} (see cscript())
    • if it is a Python function : with the dict of corresponding objects
  • if the operator is ‘ensemble-dumb’, CliMAF will loop applying it on each member, and will form a new ensemble with the results.

The dict keys must be label strings, which describe what is basically different among members. They are usually used by plot scripts to provide a caption allowing to identify each dataset/object e.g using various colors.

Examples (see also ../examples/ :

>>> cdef('project','example'); cdef('simulation',"AMIPV6ALB2G");
>>> cdef('variable','tas');cdef('frequency','monthly')
>>> #
>>> ds1980=ds(period="1980")
>>> ds1981=ds(period="1981")
>>> #
>>> myens=cens({'1980':ds1980 , '1981':ds1981 })
>>> ncview(myens)  # will launch ncview once per member
>>> myens=cens({'1980':ds1980 , '1981':ds1981 }, order=['1981','1980'])
>>> myens.set_order(['1981','1980'])
>>> # Add a member
>>> myens['abcd']=ds(period="1982")

Limitations : Even if an ensemble is a dict, some dict methods are not properly implemented (popitem, fromkeys) and function iteritems does not use member order

You can write an ensemble to a file using function efile()

fds : define a dataset from a data file

climaf.classes.fds(filename, simulation=None, variable=None, period=None, model=None)[source]

fds stands for FileDataSet; it allows to create a dataset simply by providing a filename and optionally a simulation name , a variable name, a period and a model name.

For dataset attributes which are not provided, these defaults apply :

  • simulation : the filename basename (without suffix ‘.nc’)
  • variable : the set of variables in the data file
  • period : the period actually covered by the data file (if it has time_bnds)
  • model : the ‘model_id’ attribute if it exists, otherwise : ‘no_model’
  • project : ‘file’ (with separator = ‘|’)

The following restriction apply to such datasets :

Results are unforeseen if all variables do not have the same time axis

Examples : See

cproject : declare a new project and its non-standard attributes/facets

class climaf.classes.cproject(name, *args, **kwargs)[source]

Declare a project and its facets/attributes in CliMAF (see below)

  • name (string) – project name; do not use the chosen separator in it (see below)
  • args (strings) – attribute names; they are free; do not use the chosen separator in it (see below); CliMAF anyway will add attributes : project, simulation, variable, period, and domain
  • kwargs (dict) –

    can only be used with keywords :

    • sep or separator for indicating the symbol separating facets in the dataset syntax. Defaults to ”.”.
    • ensemble for declaring a list of attribute names which are allowed for defining an ensemble in this project (‘simulation’ is automatically allowed)

Returns : a cproject object, which string representation is the pattern later used in CliMAF Refreence Syntax for representing datasets in this project

A ‘cproject’ is the definition of a set of attributes, or facets, which values will completely define a ‘dataset’ as managed by CliMAF. Its name is one of the possible keys for describing data locations (see dataloc)

For instance, cproject CMIP5, after its Data Reference Syntax, has attributes : model, simulation (used for rip), experiment, variable, frequency, realm, table, version

A number of projects are built-in. See projects

A dataset in a cproject declared as

>>> cproject('MINE','myfreq','myfacet',sep='_')

will return


and will have datasets represented as e.g.:


while an example for built-in cproject CMIP5 will be:


The attributes list should include all facets which are useful for distinguishing datasets from each other, and for computing datafile pathnames in the ‘generic’ organization (see dataloc)

A default value for a given facet can be specified, by providing a tuple (facet_name,default_value) instead of the facet name. This default value is however of lower priority than the value set using cdef()

A project can be declared as having non-standard variable names in datafiles, or variables that should undergo re-scaling; see calias()

A project can be declared as having non-standard frequency names (this is used when accessing datafiles); see cfreqs())

cprojects : dictionary of known projects

climaf.classes.cprojects = {None: ${project}.${simulation}.${variable}.${period}.${domain}}

Dictionary of declared projects (type is cproject)

dataloc : describe data locations for a series of simulations

class climaf.dataloc.dataloc(project='*', organization='generic', url=None, model='*', simulation='*', realm='*', table='*', frequency='*')[source]

Create an entry in the data locations dictionary for an ensemble of datasets.

  • project (str,optional) – project name
  • model (str,optional) – model name
  • simulation (str,optional) – simulation name
  • frequency (str,optional) – frequency
  • organization (str) – name of the organization type, among those handled by selectFiles()
  • url (list of strings) – list of URLS for the data root directories, local or remote

Each entry in the dictionary allows to store :

  • a list of path or URLS (local or remote), which are root paths for finding some sets of datafiles which share a file organization scheme.

    • For remote data:

      url is supposed to be in the format ‘protocol:user@host:path’, but ‘protocol’ and ‘user’ are optional. So, url can also be ‘user@host:path’ or ‘protocol:host:path’ or ‘host:path’. ftp is default protocol (and the only one which is yet managed, AMOF).

      If ‘user’ is given:

      • if ‘host’ is in $HOME/.netrc file, CliMAF check if corresponding ‘login == ‘user’. If it is, CliMAF get associated password; otherwise it will prompt the user for entering password;
      • if ‘host’ is not present in $HOME/.netrc file, CliMAF will prompt the user for entering password.

      If ‘user’ is not given:

      • if ‘host’ is in $HOME/.netrc file, CliMAF get corresponding ‘login’ as ‘user’ and also get associated password;
      • if ‘host’ is not present in $HOME/.netrc file, CliMAF prompt the user for entering ‘user’ and ‘password’.

      Remark: The .netrc file contains login and password used by the auto-login process. It generally resides in the user’s home directory ($HOME/.netrc). So, it is highly recommended to supply this information in .netrc file not to have to enter password in every request.

      Warning: python netrc module does not handle multiple entries for a single host. So, if netrc file has two entries for the same host, the netrc module only returns the last entry.

      We define two kinds of host: hosts with evolving files, e.g. ‘beaufix’; and the others.

      For any file returned by function listfiles() which is found in cache:

      • in case of hosts with dynamic files, the file is transferred only if its date on server is more recent than that found in cache;
      • for other hosts, the file found in cache is used
  • the name for the corresponding data files organization scheme. The current set of known schemes is :

    • CMIP5_DRS : any datafile organized after the CMIP5 data reference syntax, such as on IPSL’s Ciclad and CNRM’s Lustre
    • EM : CNRM-CM post-processed outputs as organized using EM (please use a list of anyone string for arg urls)
    • generic : a data organization described by the user, using patterns such as described for selectGenericFiles(). This is the default

    Please ask the CliMAF dev team for implementing further organizations. It is quite quick for data which are on the filesystem. Organizations considered for future implementations are :

    • NetCDF model outputs as available during an ECLIS or ligIGCM simulation
    • ESGF
  • the set of attribute values which simulation’s data are stored at that URLS and with that organization

    For remote files, filename pattern must include ${varname}, which is instanciated by variable name or filenameVar (given via calias()), for the sake of efficiency. Please complain if this is inadequate

For the sake of brievity, each attribute can have the ‘*’ wildcard value; when using the dictionary, the most specific entries will be used (whic means : the entry (or entries) with the lowest number of wildcards)

Example :

  • Declaring that all IPSLCM-Z-HR data for project PRE_CMIP6 are stored under a single root path and folllows organization named CMIP6_DRS:

    >>> dataloc(project='PRE_CMIP6', model='IPSLCM-Z-HR', organization='CMIP6_DRS', url=['/prodigfs/esg/'])
  • and declaring an exception for one simulation (here, both location and organization are supposed to be different):

    >>> dataloc(project='PRE_CMIP6', model='IPSLCM-Z-HR', simulation='my_exp', organization='EM', url=['~/tmp/my_exp_data'])
  • and declaring a project to access remote data (on multiple servers):

    >>> cproject('MY_REMOTE_DATA', ('frequency', 'monthly'), separator='|')
    >>> dataloc(project='MY_REMOTE_DATA', organization='generic',url=['beaufix:/home/gmgec/mrgu/vignonl/*/${simulation}',
    ... 'ftp:vignonl@hendrix:/home/vignonl/${model}/${variable}_1m_YYYYMM_YYYYMM_${model}.nc']),
    >>> calias('MY_REMOTE_DATA','tas','tas',filenameVar='2T')
    >>> tas=ds(project='MY_REMOTE_DATA', simulation='AMIPV6ALB2G', variable='tas', frequency='monthly', period='198101')

Please refer to the example section of the documentation for an example with each organization scheme

cdefault: set or get a default value for some data attribute/facet

climaf.classes.cdef(attribute, value=None, project=None)[source]

Set or get the default value for a CliMAF dataset attribute or facet (such as e.g. ‘model’, ‘simulation’ ...), for use by next calls to cdataset() or to ds()

Argument ‘project’ allows to restrict the use/query of the default value to the context of the given ‘project’. On can also set the (global) default value for attribute ‘project’

There is no actual check that ‘attribute’ is a valid keyword for a call to ds or cdataset


>>> cdef('project','OCMPI5')
>>> cdef('frequency','monthly',project='OCMPI5')

derive : define a variable as computed from other variables

climaf.operators.derive(project, derivedVar, Operator, *invars, **params)[source]

Define that ‘derivedVar’ is a derived variable in ‘project’, computed by applying ‘Operator’ to input streams which are datasets whose variable names take the values in *invars and the parameter/arguments of Operator take the values in **params

‘project’ may be the wildcard : ‘*’

Example, assuming that operator ‘minus’ has been defined as

>>> cscript('minus','cdo sub ${in_1} ${in_2} ${out}')

which means that minus uses CDO for substracting the two datasets; you may define, for a given project ‘CMIP5’, a new variable e.g. for cloud radiative effect at the surface, named ‘rscre’, using the difference of values of all-sky and clear-sky net radiation at the surface by:

>>> derive('CMIP5', 'rscre','minus','rs','rscs')

You may then use this variable name at any location you would use any other variable name

Note : you may use wildcard ‘*’ for the project

Another example is rescaling or renaming some variable; here, let us define how variable ‘ta’ can be derived from ERAI variable ‘t’ :

>>> derive('erai', 'ta','rescale', 't', scale=1., offset=0.)

However, this is not the most efficient way to do that. See calias()

Expert use : argument ‘derivedVar’ may be a dictionary, which keys are derived variable names and values are scripts outputs names; example

>>> cscript('vertical_interp', ' ${in} surface_pressure=${in_2} ${out_l500} ${out_l850} method=${opt}')
>>> derive('*', {'z500' : 'l500' , 'z850' : 'l850'},'vertical_interp', 'zg', 'ps', opt='log'}

calias : define a variable as computed, in a project, from another, single, variable

climaf.classes.calias(project, variable, fileVariable=None, scale=1.0, offset=0.0, units=None, missing=None, filenameVar=None)[source]

Declare that in project, variable is to be computed by reading filevariable, and applying scale and offset;

Arg filenameVar allows to tell which fake variable name should be used when computing the filename for this variable in this project (for optimisation purpose);

Can tell that a given constant must be interpreted as a missing value

variable may be a list. In that case, fileVariable and filenameVar, if provided, should be parallel lists

`` variable`` can be a comma separated list of variables, in which case this tells how variables are grouped in files (it make sense to use filenameVar in that case, as this is a xway to provide the label which is unique to this grouping of variable; scale, offset and missing args must be the same for all variables in that case


>>> calias('erai','tas','t2m',filenameVar='2T')
>>> calias('erai','tas_degC','t2m',scale=1., offset=-273.15)  # scale and offset may be provided
>>> calias('EM',[ 'sic', 'sit', 'sim', 'snd', 'ialb', 'tsice'], missing=1.e+20)
>>> calias('data_CNRM','so,thetao',filenameVar='grid_T_table2.2')

NB: A wrapper with same name of this function is defined in climaf.driver.calias() and it is the one which is exported by module climaf.api. It allows to use a list of variable.

climaf.driver.calias(project, variable, fileVariable=None, **kwargs)[source]

See climaf.classes.calias()

Declare that in project, variable is to be computed by reading filevariable; It allows to use a list of variable, given as a string where the name of variables are separated by commas

cfreqs : declare non-standard frequency names, for a project

climaf.classes.cfreqs(project, dic)[source]

Allow to declare a dictionary specific to project for matching normalized frequency values to project-specific frequency values

Normalized frequency values are :
decadal, yearly, monthly, daily, 6h, 3h, fx and annual_cycle

When defining a dataset, any reference to a non-standard frequency will be left unchanged both in the datset’s CRS and when trying to access corresponding datafiles


>>> cfreqs('CMIP5',{'monthly':'mon' , 'daily':'day' })

crealms : declare non-standard realm names, for a project

climaf.classes.crealms(project, dic)[source]

Allow to declare a dictionary specific to project for matching normalized realm names to project-specific realm names

Normalized realm names are :
atmos, ocean, land, seaice

When defining a dataset, any reference to a non-standard realm will be left unchanged both in the datset’s CRS and when trying to access corresponding datafiles


>>> crealms('CMIP5',{'atmos':'ATM' , 'ocean':'OCE' })