Pre-defined projects and datasets

CliMAF knows a bunch of datasets . Package projects is devoted to that :

Package projects declares a number of ‘projects’, and the data location for these projects , at CNRM or on Ciclad, when they exists. All its modules are automatically loaded when importing climaf.api or launching by climaf

The concept of a ‘project’ in CliMAF is explained with function cproject(). It allows to declare non-standard variable names, scaling parameters…

Please note that, for some combinations of observation ‘projects’ and variables (i.e. ‘snm’ in erai, ‘pr’ in gpcp and cruts3), CliMAF provides a flux variable, while the original data provides a monthly accumulation. In that case, the conversion to rates assumes a fixed month length of 30.3 days (for ensuring minimal bias at year scale)

For listing the declared projects and their specifics, if you are under the Python prompt, type e.g.:

>>> import climaf
>>> dir(climaf.projects)
>>> help(climaf.projects.cmip5)

For knowing the specifics of variables for a given project (as e.g. re-scaling), type:

>>> from climaf.api import *
>>> aliases["erai"]

and interpret a result such as:

'erai': {'clt': ('tcc', 1.0, 0.0, None, 'TCC', None),
         'das': ('d2m', 1.0, 0.0, None, '2D', None),
...

by: in project ‘erai’, standard variable ‘clt’ is read from data variable ‘tcc’ with scaling=1, offest=0, and no change in units name; while ‘TCC’ is the variable name used in computing datafilename; and there is no special missing value in addition to the one duly declared in the datafile

cmip6

This module declares locations for searching data for CMIP6 outputs organized according to CMIP6 DRS

Attributes for CMIP6 datasets are: model, experiment, table, realization, grid, version, institute, mip, root

Syntax for these attributes is described in the CMIP6 DRS document

Example for a CMIP6 dataset declaration

>>> tas1pc=ds(project='CMIP6', model='CNRM-CM6-1', experiment='1pctCO2', variable='tas', table='Amon',
...           realization='r3i1p1f2', period='1860-1861')

cmip5

This module declares locations for searching data for CMIP5 outputs produced by libIGCM or Eclis for all frequencies.

Attributes for CMIP5 datasets are: model, experiment, table, realization, grid, version, institute, mip, root

Syntax for these attributes is described in `the CMIP5 DRS document
<http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf>`

Example for a CMIP5 dataset declaration:

>>> tas1pc = ds(project='CMIP5', model='CNRM-CM6-1', experiment='1pctCO2', variable='tas', table='Amon',
...             realization='r3i1p1f2', period='1860-1861')

ocmip5

This module declares how to access OCMIP5 data on Ciclad.

Use attributes ‘model’ and ‘frequency’

Example of a path: /prodigfs/project/OCMIP5/OUTPUT/IPSL/IPSL-CM4/CTL/mon/CACO3/CACO3_IPSL_IPSL-CM4_CTL_1860-1869.nc

Example

>>> cdef('model','IPSL-CM4')
>>> cdef('frequency','monthly')
>>> cactl=ds(project='OCMIP5_Ciclad', simulation='CTL', variable='CACO3', period='1860-1861')

ref_climatos_and_ts

This module declares two ‘projects’:

  • ‘ref_climatos’, for the climatological annual cycles and
  • ‘ref_ts’, for the ‘time series’ (one variable evolving with time) of a set of reference products as managed by J. Servonnat at IPSL.

This archive is available on Ciclad (IPSL), Curie (TGCC) and Ada (IDRIS), and /cnrm and at Cerfacs

The specific attributes are:

  • product (default:’*’): name of the observation or reanalysis product (example: ERAI, GPCP…)
  • for climatologies only : clim_period : a character string; there is no mechanism of period selection (like with ‘period’)

Default values of the attributes for climatologies (ref_climato):

  • product : ‘*’
  • variable : ‘*’
  • period : ‘fx’
  • frequency : annual_cycle’

It is possible to pass a list of products to ‘product’ to define an ensemble of climatologies with eds() as in:

>>> dat_ens = eds(project='ref_climatos', product=['ERAI','NCEP'],...)

Default values of the attributes for time_series (ref_ts):

  • product : ‘*’
  • period : ‘1900-2050’
  • frequency : ‘monthly’

Example of a ‘ref_ts’ project dataset declaration

>>> cdef('project','ref_ts')
>>> d=ds(variable='tas',period='198001'....)

igcm_out

This module declares locations for searching data for IGCM outputs produced by libIGCM for all frequencies, on Ciclad and at TGCC.

The project IGCM_OUT presents many possible keywords (facets) to determine precisely the dataset and render the data location as efficient as possible. We have chosen to provide ‘wild cards’ (*) to many keywords by default. This way, ds() has a greater chance to feed the user back with a result (even if it contains too many simulations), even if the user specifies just a few keywords.

Three projects are available to access the IGCM_OUT outputs; they are aimed at dealing with the diversity of variable names seen among the IPSL outputs (that can vary with time and users). They all provide aliases to the CMIP variables names and to the old names (taking advantage of the mechanisms linked with calias). - IGCM_OUT corresponds to the more up-to-date combination of variable names (mix of CMIP and old names) - IGCM_OUT_old : links with the old variable names - IGCM_OUT_CMIP : simply uses calias to provide the scale, offset and filenameVar

The attributes are:
  • root : path (without the login) to the top of the IGCM_OUT tree
  • login : login of the producer of the simulation
  • model : explicit
  • experiment : piControl, historical, amip…
  • status : DEVT, PROD, TEST
  • simulation : name of the numerical simulation (JobName in the IGCM syntax)
  • DIR : ATM, OCE, SRF…
  • OUT : Analyse, Output
  • frequency : monthly, daily, annual_cycle (equivalent to ‘seasonal’)
  • ave_length : MO, DA (optionnal, but can reduce the duration of the localization by ds() )
  • period : explicit
  • variable : explicit
  • clim_period : a character string; there is no mechanism of period selection (like with ‘period’)
  • clim_period_length : can be set to ‘_50Y’ or ‘_100Y’ to access the annual cycles averaged over 50yr long or 100yr
    long periods
Default values of the attributes:
  • root : ‘/ccc/store/cont003/dsm’ (at TGCC)
  • login : ‘*’
  • model : ‘*’
  • experiment : ‘*’
  • status : ‘*’
  • simulation : ‘*’
  • DIR : ‘*’
  • OUT : ‘*’
  • frequency : ‘monthly’
  • ave_length : ‘*’
  • period : ‘fx’
  • variable : ‘*’
  • clim_period : ‘????_????’
  • clim_period_length : ‘*’

Example 1: - On Curie, access to a ‘time series’ dataset of the variable tas, providing values to all facets:

>>> dat1 = ds(project='IGCM_OUT',
              root='/ccc/store/cont003/dsm',
              login ='p86mart',
              model='IPSLCM6',
              experiment='piControl',
              status='DEVT',
              simulation='O1T09V04',
              DIR='ATM',
              OUT='Analyse',
              frequency='monthly',
              ave_length='MO',
              period='1850-1900',
              variable='tas'
              )
Note that the following request returns the same files (but takes more time):
>>> dat1 = ds(project='IGCM_OUT',
              model='IPSLCM6',
              simulation='O1T09V04',
              period='1850-1900',
              variable='tas'
              )

Example 2: - On Curie, access to a ‘SE_50Y’ dataset of the variable tas, providing values to all facets; Note that we set frequency to ‘seasonal’ (or ‘annual_cycle’), specify clim_period and clim_period_length (to specify either _50Y or _100Y)

>>> dat2 = ds(project='IGCM_OUT',
              login ='p86mart',
              model='IPSLCM6',
              experiment='piControl',
              status='DEVT',
              simulation='O1T09V04',
              DIR='ATM',
              OUT='Analyse',
              frequency='seasonal',
              clim_period='1850_1899',
              clim_period_length='_50Y',
              variable='tas'
              )

The attributes ‘model’, ‘simulation’ and ‘clim_period’ can be used to define ensembles with eds(). Example 3: - On Curie, define an ensemble with simulations ‘O1T09V01’,’O1T09V02’,’O1T09V03’:

>>> dat_ens = eds(project='IGCM_OUT',
                  model='IPSLCM6',
                  simulation=['O1T09V01','O1T09V02','O1T09V03'],
                  clim_period='1850_1859',
                  variable='tas'
                  )

Contact: jerome.servonnat@lsce.ipsl.fr

em

This module declares project em, base on data organization ‘generic’

EM (Experiment Manager) is a tool used at CNRM for moving simulation post-processed data from the HPSS to the local filesystem, and to organize it in a file hierarchy governed by a few configuration files

Simulation names (or ‘EXPIDs’) are assumed to be unique in the namespace defined by the user’s configuration file, which may include shared simulation

Specific facets are:
  • root : root directory for private data files as declared to EM
  • group : group of the simualtion (as declared to ECLIS)
  • frequency : for now, only monthly is managed; it is the default
  • realm : to speed up data search, and to resolve ambiguities. Usable values are ‘A, Atmos, O, Ocean, I, SeaIce, L, Land. Unfortunately, for now, you have to know whether you data is on a private dir (use e.g. ‘A’) or a shared one (use e.g. Atmos). Default is ‘*’ (costly).

Examples for defining an EM dataset:

>>> tas= ds(project='em', simulation='GSAGNS1', variable='tas', period='1975-1976', realm="(A|Atmos)")
>>> pr = ds(project='em', simulation="C1P60", group="SC", variable="pr", period="1850", realm="(O|Ocean)"))

See other examples in examples/data_em.py

The location of ocean variables in the various grid_XX files matches the case with : T_table_2.2, T_table_2.5, T_table_2.7, U_table_2.3, U_table_2.8, W_table2.3 … Other cases should be described by another ‘project’

WARNING REGARDING OCEAN DATA : for a number of old simulations, there is an issue with the name of time coordinates, which lead to some nav_lat/nav_lon coordinates being discarded during CDO processing. You can tell CLiMAF to deal automatically with that, at the expense of computing time, by setting and exporting environment variable CLIMAF_FIX_NEMO_TIME to any value except ‘no’, ‘0’ and ‘None’ BEFORE launching CliMAF. What CliMAF does in that case shows in ../scripts/mcdo.py (see function nemo_timefix())

A number of Seaice fields are duly described with 1.e+20 as missing value (which is ill described in data files); see code for details

example

This module declares project example and its data location for the standard CliMAF distro

Only one additionnal attribute: frequency (but data sample actually includes only frequency= ‘monthly’)

Example of an ‘example’ dataset definition

>>> dg=ds(project='example', simulation='AMIPV6ALB2G', variable='tas', period='1980-1981', frequency='monthly')

erai

This module declares ERA Interim data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

Also declares how to derive CMIP5 variables from the original ERAI variables set (aliasing)

Attributes are ‘grid’, and ‘frequency’

Various grids are available. Original grid writes as: grid=’_’. Other grids write e.g. as : grid =’T42’ or grid =’T127’

Example of an ‘erai’ project dataset declaration

>>> cdef('project','erai')
>>> d=ds(variable='tas',period='198001',grid='_', frequency='monthly')
>>> d2=ds(variable='tas',period='198001',grid='T42',frequency='daily')

erai-land

This module declares ERA Interim land data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

Also declares how to derive CMIP5 variables from the original ERAI-land variables set

Attribute is ‘grid’

Various grids are available. Original grid writes as : grid=’_’. Other grids write e.g. as : grid =’T127’

Most variables for ERAI-LAND have no CMIP5 counterpart : only CIMP5 ‘snd’ is aliased to ERAI-LAND ‘sd’; see doc for the other, original, ERAI-LAND variables

Example of an ‘erai_land’ project dataset declaration

>>> cdef('project','erai-land')
>>> d=ds(variable='snd',period='198001',grid='_')
>>> d2=ds(variable='snd',period='198001',grid='T127')

ceres

This module declares CERES data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

No attributes in addition to standard ones; and ‘simulation’ is not used

Version of dataset is implicitly the latest, through symbolic links managed by Sophie. Please complain to climaf at cnrm dot fr if this does not fit the needs

Example of a ‘ceres’ project dataset declaration

>>> d = ds(project='ceres', variable='rlds', period='198001', domain=[40.,60.,-10.,+20.])

cruts3

This module declares CRUTS3 data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

Also declares how to derive CMIP5 variables from the original CRUTS3 variables set

Attributes are ‘grid’

Various grids are available. Original grid writes as : grid=’’. Other grids write e.g. as : grid =’T127’

Example of an ‘cruts3’ project dataset declaration

>>> cdef('project','cruts3')
>>> d=ds(variable='tas',period='198001',grid='')
>>> d2=ds(variable='tas',period='198001',grid='T127')

gpcc

This module declares GPCC data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

Also declares how to derive CMIP5 variables from the original GPCC variables set

Attributes are ‘grid’

Various grids are available. Grids write e.g. as: grid=’05d’, grid =’1d’ and grid =’T127’

Example of an ‘gpcc’ project dataset declaration

>>> cdef('project','gpcc')
>>> d=ds(variable='pr',period='198001',grid='05d')
>>> d2=ds(variable='pr',period='198001',grid='1d')
>>> d3=ds(variable='pr',period='198001',grid='T127')

gpcp

This module declares GPCP data organization and specifics, as managed by Sophie T. at CNRM; see file:///cnrm/amacs/DATA/OBS/netcdf/

Also declares how to derive CMIP5 variables from the original GPCP variables set (aliasing/scaling)

Attributes are ‘grid’, and ‘frequency’.

Various grids are available. Grids write e.g. as: grid=’1d’, grid =’2.5d’, grid =’T42’ and grid =’T127’

Only two variables are available: the original ‘precip’ (mm/day) and pr (kg m-2 s-1)

Example of an ‘gpcp’ project dataset declaration

>>> cdef('project','gpcp')
>>> d=ds(variable='pr',period='198001',grid='2.5d', frequency='monthly')
>>> d2=ds(variable='pr',period='198001',grid='1d',frequency='daily')

obs4mips

This module declares locations for searching data for project OBS4MIP at CNRM (VDR), for all frequencies; see file:///cnrm/amacs/DATA/Obs4MIPs/doc/

Additional attribute for OBS4MIPS datasets : ‘frequency’

Example for an OBS4MIPS CMIP5 dataset declaration

>>> pr_obs=ds(project='OBS4MIPS', variable='pr', simulation='GPCP-SG', frequency='monthly', period='1979-1980')

cami

This module declares how to access observation datasets organized ‘a la CAMI’ at CNRM, at /cnrm/est/COMMON/cami/V1.8/climlinks/

Example

>>> pr_gpcp=ds(project='CAMIOBS', simulation='GPCP2.5d', variable='pr', period='1979-1980')

optimize

Optimize searching datasets files when some facets are shell-like wildcards (i.e. include * or ?)

For now limited to project CMIP6 and active only if env.environment.optimize_cmip6_wildcards is True (which is the default) . See doc for cmip6_optimize_wildcards()

climaf.projects.optimize.clear_tables(pattern=None)[source]

Clear all search optimization tabes that include a given pattern (e.g. ‘CMIP6’), or all tables if no pattern is given

In order to identify the pattern for a given table :

  • tables are stored in your CliMAF cache (which name is displayed at the beginning of your session)
  • table names are self explanatory; e.g. ‘CMIP6_mip_experiment_model2realization_7367d567.json’ stands for the table which allows to derive the list of realizations from the values of mip, experiment and model. The last part is a hash code for the root directory of the CMIP6 data
climaf.projects.optimize.cmip6_facets(path, root, *fields)[source]

Returns a tuple of facets values for the ranks in FIELDS, derived from PATH after removing prefix ROOT and assuming that the path matches CMIP6 DRS, at least up to the max depth in FIELDS

Example : >>> institute, model, experiment, realization, table, variable = cmip6_facets(path, root, 2, 3, 4, 5, 6, 7)

climaf.projects.optimize.cmip6_optimize_check_paths(paths)[source]

Check that paths patterns in PATHS fit (at least some of) the requirements for optimizing data search

climaf.projects.optimize.cmip6_optimize_wildcards(kwargs)[source]

Allow to optimize CMIP6 data search by analyzing CMIP6 keyword values in KWARGS, and replacing some patterns using * or ? by the list of their possible values, by querying the file system

It is automatically activated and used when env.environment.optimize_cmip6_wildcards is True.

It assumes that all CMIP6 data are organized using CMIP6 canonical DRS with a pattern like : ${root}/CMIP6/${mip}/${institute}/${model}/${experiment}/${realization}/${table}/

It uses tables which are built automatically, stored in CLiMAF cache, and can be refreshed by clearing it. See clear_tables()

First principle is to focus on facets which are high in the DRS hierarchy, and so in the directories hierarchy. For such facets, in order to speed-up search when the facet value includes a wildcard, and when another facet allows to reduce significantly the number of values of the wildcard facet, we build a look-up table.

This is for instance the case in CMIP6 when facet ‘mip’ is * and facet ‘experiment’ is known. Or when ‘institute’ is * and ‘model’ is known.

Next principle is to the build a list of valid paths segment after segment, by testing which wildcard segments values among the possible ones actually lead to an existing path. For some cases, when there is no way to guess a limited list of values (as e.g. for ‘version’), glob.glob is used

Keyword PERIOD is not processed at that level.

Returns a list of non-wildcard KWARGS which match actually existing leaf directories, and which is used in later search (see selectFiles())

climaf.projects.optimize.cmip6_path2dict(path, root)[source]

Returns a dict of facet/value pairs derived from PATH after removing prefix ROOT and assuming that the path matches CMIP6 DRS

climaf.projects.optimize.dirnames_for_one_case(case_name, glob_pattern, split_index, case_value, key_index=-1, reset=False, value_pattern=None, root=None)[source]

Returns the ensemble of directories which have files matching a given GLOB_PATTERN, which is supposed to end with a “/*” which corresponds to CASE_VALUE. The directory names are extracted from glob() return at a hierarchy level indicated by SPLIT_INDEX;

Method : Uses an entry CASE_NAME in a global lookup table Try to read it from file if not present. If it fails builds (stores, and writes) it by globbing according to the pattern

If arg RESET is True, performs the globbing anyway and re-write the table on disk

See examples of use in cmip6_optimize_wildcards()

climaf.projects.optimize.listdirs(parent, pattern, test_exists=False)[source]

List directories which may actually exists, by complementing path PARENT with a single level of sub-directories, which match PATTERN

If the pattern includes no wildcard, simply complement with PATTERN, and test existence only if TEST_EXITS is True

Otherwise, use glob.glob to find existing sub-directories

climaf.projects.optimize.possible_values(project, tag, root, key, value_pattern)[source]

For a given PROJECT, returns the list of possible values for a facet (here called the value facet) given the value (KEY) of another facet (here called the key facet). Returns only values that match VALUE_PATTERN. Return [] if None found

If VALUE_PATTERN has no wildcard, just return it as result (in a list)

Values are searched based on additional information TAG, which carries two pieces of information : which is facet which value (KEY) is provided, and which is the facet which values are searched

Current implementation is based on globing and uses TAG to derive three items :

  • pattern to use for globing the filesystem
  • index of the value facet in the file hierarchy matching the pattern
  • index (or indices) of the key facet(s) in the file hierarchy matching the pattern

It then calls function dirnames_for_one_case which implements the globbing, and which caches its results in a json file.