How to handle CliMAF operators that concatenate data over time

SS - Feb 2020

This kind of operators needs some special declaration

Let us start with an empty cache and two small datasets (which are included in CliMAF distro)

In [1]:
from climaf.api import *
craz()
LC debug : False
CliMAF install => /home/stephane/tout/stejase/technique/climaf
python => /home/stephane/miniconda2/envs/py27/bin/python
---
Required softwares to run CliMAF => you are using the following versions/installations:
ncl 6.6.2 => /home/stephane/miniconda2/envs/py27/bin/ncl
cdo 1.9.6 => /home/stephane/miniconda2/envs/py27/bin/cdo
nco (ncks) 4.8.1 => /home/stephane/miniconda2/envs/py27/bin/ncks
ncdump 4.6.2 => /home/stephane/miniconda2/envs/py27/bin/ncdump
Check stamping requirements
nco (ncatted) found -> /home/stephane/miniconda2/envs/py27/bin/ncatted
convert found -> /usr/bin/convert
pdftk found -> /usr/bin/pdftk
exiv2 not available, can not stamp eps files
At least one stamping requirement is not fulfilled, turn it to None.
---
CliMAF version = 1.2.13
Cache directory set to : ~/tmp/climaf_cache (use $CLIMAF_CACHE if set) 
Cache directory for remote data set to : ~/tmp/climaf_cache/remote_data (use $CLIMAF_REMOTE_CACHE if set) 
error    : 'defining curl_tau_atm : command ferret is not executable'
warning  : Binary cdftools not found. Some operators won't work
Available macros read from ~/.climaf.macros are : []
In [2]:
tas1=ds(project="example", simulation="AMIPV6ALB2G", variable="tas", period="1980")
tas2=ds(project="example", simulation="AMIPV6ALB2G", variable="tas", period="1981")

Define an operator for time-concatenating datasets

In [3]:
cscript('cat', "ncrcat ${in_1} ${in_2} ${out}")
tas=cat(tas1,tas2)
tas
Out[3]:
cat(ds('example|AMIPV6ALB2G|tas|1980|global|monthly'),ds('example|AMIPV6ALB2G|tas|1981|global|monthly'))

Let CliMAF actually build a file representing the dataset, and check that its time length is correct

In [4]:
f=cfile(tas)
! ncdump -v time {f} | grep "time = .*currently"
	time = UNLIMITED ; // (24 currently)

Check which data have been generated in cache

In [5]:
cls()
Content of CliMAF cache
Out[5]:
["ds('example|AMIPV6ALB2G|tas|1980|global|monthly')",
 "cat(ds('example|AMIPV6ALB2G|tas|1980|global|monthly'),ds('example|AMIPV6ALB2G|tas|1981|global|monthly'))",
 "ds('example|AMIPV6ALB2G|tas|1981|global|monthly')"]

CliMAF has copied each component datset in cache ! This is quite costly, but can be fixed that way :

In [6]:
cdrop(tas1)
cdrop(tas2)
cls()
Content of CliMAF cache
Out[6]:
["cat(ds('example|AMIPV6ALB2G|tas|1980|global|monthly'),ds('example|AMIPV6ALB2G|tas|1981|global|monthly'))"]

An alternate way is to use argument 'select' during cat script declaration

In [7]:
cscript('cat_dont_select', "ncrcat ${in_1} ${in_2} ${out}", select=False)
tas_dont_select=cat_dont_select(tas1,tas2)
craz()
g=cfile(tas_dont_select)
In [8]:
cls()
Content of CliMAF cache
Out[8]:
["cat_dont_select(ds('example|AMIPV6ALB2G|tas|1980|global|monthly'),ds('example|AMIPV6ALB2G|tas|1981|global|monthly'))"]

That's better , but only for this use case , where the component datasets did not need any selection on variable, nor date, nor domain, nor aliasing, nor setting of missing values

Now, let us check how CliMAF knows these kind of datasets

In [9]:
from climaf.driver import timePeriod
print timePeriod(tas)
1980

It only knows the time period of first component dataset!

This is an issue, because other operators, like standard ones, are defined has needing a selection on time period, which CliMAF will automatically apply :

In [10]:
avg=ccdo(tas,operator='fldavg')
avg_file=cfile(avg)
! ncdump -v time {avg_file} | grep "time = .*currently"
	time = UNLIMITED ; // (12 currently)

This can be fixed by declaring that the operator do concatenate data over time

In [12]:
cscript('cat_dont_select_do_cat_time', "ncrcat ${in_1} ${in_2} ${out}", select=False, doCatTime=True)
tas=cat_dont_select_do_cat_time(tas1,tas2)
print timePeriod(tas)
1980-1981

And applying a further operator that potentially select time now works OK

In [13]:
avg=ccdo(tas,operator='fldavg')
avg_file=cfile(avg)
! ncdump -v time {avg_file} | grep "time = .*currently"
	time = UNLIMITED ; // (24 currently)

Summary : when declaring a CliMAF operator using cscript()

- argument select = False tells CliMAF to forget about smart handling of period, domain,aliasing ... for selecting/extracting data from files. If applicable to your use case, this can save disk space in cache. But this is a permanent property for the operator

- if the operator actually concatenates time over data files, this must be declared by doCatTime = True in order that operators chained on top of if works fine