ssb_timeseries.dataset
¶
The ssb_timeseries.dataset
module and its Dataset
class is the very core of the ssb_timeseries
package, defining most of the key functionality.
The dataset is the unit of analysis for both information model and workflow integration,and performance will benefit from linear algebra with sets as matrices consisting of series column vectors.
As described in the Information model time series datasets may consist of any number of series of the same SeriesType
.
The series types are defined by dimensionality characteristics:
Versioning
(NONE, AS_OF, NAMED)Temporality
(Valid AT point in time, or FROM and TO for duration)The type of the value. For now only scalar values are supported.
Additional type determinants (sparsity, irregular frequencies, non-numeric or non-scalar values, …) are conceivable and may be introduced later. The types are crucial because they are reflected in the physical storage structure. That in turn has practical implications for how the series can be interacted with, and for methods working on the data.
See also
The ssb_timeseries.catalog
module for tools for searching for datasets or series by names or metadata.
- class Dataset(name, data_type=None, as_of_tz=None, repository='', load_data=True, **kwargs)¶
Bases:
object
Datasets are containers for series of the same
SeriesType
with origin from the same process.That generally implies some common denominator in terms of descriptive metadata, but more important, it allows the Dataset to become a core unit of analysis for workflow. It becomes a natural chunk of data for reads and writes, and calculation.
For all the series in a dataset to be of the same
SeriesType
means they share dimensionality characteristicsVersioning
andTemporality
and any other schema information that have tecnical implications for how the data is handled. See the Information model documentation for more about that.The descriptive commonality is not enforced, but some aspects have technical implications. In particular, it is strongly encouraged to make sure that the resolutions of the series in datasets are the same, and to minimize the number of gaps in the series. Sparse data is a strong indication that a dataset is not well defined and that series in the set have different origins. ‘Gaps’ in this context is any representation of undefined values: None, null, NAN or “not a number” values, as opposed to the number zero. The number zero is a gray area - it can be perfectly valid, but can also be an indication that not all the series should be part of the same set.
- Variables:
name (str) – The name of the set.
data_type (SeriesType) – The type of the contents of the set.
as_of_tz (datetime) – The version datetime, if applicable to the
data_type
.data (Dataframe) – A dataframe or table structure with one or more datetime columns defined by
datatype
and a column per series in the set.tags (dict) – A dictionary with metadata describing both the dataset itself and the series in the set.
- Parameters:
name (str)
data_type (SeriesType)
as_of_tz (datetime)
repository (str)
load_data (bool)
kwargs (Any)
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.- __add__(other)¶
Add two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __eq__(other)¶
Check equality of two datasets or a dataset and a dataframe, numpy array or scalar.
- Parameters:
other (Self | DataFrame | Series | int | float)
- Return type:
Any
- __floordiv__(other)¶
Floor divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __getitem__(criteria='', **kwargs)¶
Access Dataset.data.columns via Dataset[ list[column_names] | pattern | tags].
- Parameters:
criteria – Either a string pattern or a dict of tags.
kwargs – If criteria is empty, this is passed to filter().
- Returns:
Self | None
- Raises:
TypeError – If filter() returns another type than Dataset.
- Return type:
Self | None
- __gt__(other)¶
Check greater than for two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __init__(name, data_type=None, as_of_tz=None, repository='', load_data=True, **kwargs)¶
Initialising a dataset object either retrieves an existing set or prepares a new one.
When preparing a new set, data_type must be specified. If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. Providing an AS_OF date has no effect if versioning is NONE. If not, but data is passed,
as_of_tz()
defaults to current time. For all dates, if no timezone is provided, CET is assumed.The data parameter accepts a dataframe with one or more date columns and one column per series. Initially only Pandas was supported, but this dependency is about to be relaxed to include other implementations of the samedata structure. Beyond Polars and Pyarrow, notable options under consideration include Ibis, DuckDB and Narwhals.
Data is kept in memory and not stored before explicit call to
save()
. The data is stored in parquet files, with JSON formatted metadata in the header.Metadata will always be read if the set exists. When loading existing sets, load_data = False will suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.
If series names can be mapped to metadata, the keyword arguments
attributes
,separator
andregex
will, if provided, be passed through toseries_names_to_tags
. If series names are not easily translated to tags,tag_dataset
andtag_series
and their siblingsretag
anddetag
can be used for manual meta data maintenance.- Keyword Arguments:
attributes (list[str]) – Attribute names for use with
series_names_to_tags
in combination with eitherseparator
orregex
.separator (str) – Character(s) separating
attributes
for use withseries_names_to_tags
.regex (str) – Regular expression with capture groups corresponding to
attributes
. Used instead of the separator to match more complicated name patterns inseries_names_to_tags
.
- Parameters:
name (str)
data_type (SeriesType)
as_of_tz (datetime)
repository (str)
load_data (bool)
kwargs (Any)
- Return type:
None
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.import ssb_timeseries as ts df = ts.sample_data.xyz_at() print(df) x = ts.dataset.Dataset( name='mydataset', data_type=ts.properties.SeriesType.simple(), data=df )
- __lt__(other)¶
Check less than for two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __mod__(other)¶
Modulo of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __mul__(other)¶
Multiply two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __pow__(other)¶
Power of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __radd__(other)¶
Right add two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __repr__()¶
Returns a machine readable string representation of Dataset, ideally sufficient to recreate object.
- Return type:
str
- __rfloordiv__(other)¶
Right floor divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rmod__(other)¶
Right modulo of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rmul__(other)¶
Right multiply two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rpow__(other)¶
Right power of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rsub__(other)¶
Right subtract two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rtruediv__(other)¶
Right divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __str__()¶
Returns a human readable string representation of the Dataset.
- Return type:
str
- __sub__(other)¶
Subtract two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __truediv__(other)¶
Divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- aggregate(attributes, taxonomies, functions, sep='_')¶
Aggregate dataset by taxonomy hierarchies.
- Parameters:
attributes (list[str]) – The attributes to aggregate by.
taxonomies (list[int | meta.Taxonomy | dict[str, str] | PathStr]) – Value definitions for attributes can be either :
meta.Taxonomy
objects, or klass_ids, data dictionaries or paths that can be used to retrieve or construct them.functions (list[str|F] | set[str|F]) – Optional function name (or list) of the function names to apply (mean | count | sum | …). Defaults to sum.
sep (str) – Optional separator used when joining multiple attributes into names of aggregated series. Defaults to ‘_’.
- Returns:
A dataset object with the aggregated data. If the taxonomy object has hierarchical structure, aggregate series are calculated for parent nodes at all levels. If the taxonomy is a flat list, only a single total aggregate series is calculated.
- Return type:
Self
- Raises:
TypeError – If any of the taxonomy identifiere are of unexpected types.
Examples
To calculate 10 and 90 percentiles and median for the dataset x where codes from KLASS 157 (energy_balance) distinguishes between series in the set.
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> from ssb_timeseries.meta import Taxonomy >>> >>> klass157 = Taxonomy(klass_id=157) >>> klass157_leaves = [n.name for n in klass157.structure.root.leaves] >>> tag_permutation_space = {"A": klass157_leaves, "B": ["q"], "C": ["z"]} >>> series_names: list[list[str]] = [value for value in tag_permutation_space.values()] >>> sample_df = create_df(*series_names, start_date="2024-01-01", end_date="2024-12-31", freq="MS",) >>> sample_set = Dataset(name="sample_set", >>> data_type=SeriesType.simple(), >>> data=sample_df, >>> attributes=["A", "B", "C"], >>> ) >>> >>> def perc10(x): >>> return x.quantile(.1, axis=1, numeric_only=True, interpolation="linear") >>> >>> def perc90(x): >>> return x.quantile(.9, axis=1, numeric_only=True, interpolation="linear") >>> >>> percentiles = sample_set.aggregate(["energy_balance"], [157], [perc10, 'median', perc90])
- all()¶
Check if all values in series columns evaluate to true.
- Return type:
bool
- any()¶
Check if any values in series columns evaluate to true.
- Return type:
bool
- copy(new_name, **kwargs)¶
Create a copy of the Dataset.
The copy need to get a new name, but unless other information is spcecified, it will be create wiht the same data_type, as_of_tz, data, and tags.
- Return type:
Self
- Parameters:
new_name (str)
kwargs (Any)
- datetime_columns(*comparisons)¶
Get names of datetime columns (valid_at, valid_from, valid_to).
- Parameters:
*comparisons (Self | pd.DataFrame) – Objects to compare with. If provided, returns the intersection of self and all comparisons.
- Returns:
The (common) datetime column names of self (and comparisons).
- Return type:
list[str]
- Raises:
ValueError – If comparisons are not of type Self or pd.DataFrame.
- default_tags()¶
Return default tags for set and series.
- Return type:
dict
[str
,dict
[str
,str
|list
[str
]] |dict
[str
,dict
[str
,str
|list
[str
]]]]
- detag_dataset(*args, **kwargs)¶
Detag selected attributes of the set.
Tags to be removed may be provided as list of attribute names or as kwargs with attribute-value pairs. :rtype:
None
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.- Parameters:
args (str)
kwargs (Any)
- Return type:
None
- detag_series(*args, **kwargs)¶
Detag selected attributes of series in the set.
Tags to be removed may be specified by args or kwargs. Attributes listed in args will be removed from all series.
For kwargs, attributes will be removed from the series if the value matches exactly. If the value is a list, the matching value is removed. If kwargs contain all=True, all attributes except defaults are removed.
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.- Parameters:
args (str)
kwargs (Any)
- Return type:
None
- filter(pattern='', tags=None, regex='', output='dataset', new_name='', **kwargs)¶
Filter dataset.data by textual pattern, regex or metadata tag dictionary.
Or a combination.
- Parameters:
pattern (str) – Text pattern for search ‘like’ in column names. Defaults to ‘’.
regex (str) – Expression for regex search in column names. Defaults to ‘’.
tags (dict) – Dictionary with tags to search for. Defaults to None. All tags in dict must be satisfied for the same series (tags are combined by AND). If a list of values is provided for a tag, the criteria is satisfied for either of them (OR). | list(dict) Support for list(dict) is planned, not yet implemented, to satisfy alternative sets of criteria (the dicts will be combined by OR).
output (str) – Output type - dataset or dataframe.(df). Defaults to ‘dataset’. Short forms ‘df’ or ‘ds’ are accepted.
new_name (str) – Name of new Dataset. If not provided, a new name is generated.
**kwargs – if provided, goes into the init of the new set.
- Returns:
By default a new Dataset (a deep copy of self). If output=”dataframe” or “df”, a dataframe. TODO: Explore shallow copy / nocopy options.
- Return type:
Dataset | Dataframe
- groupby(freq, func='auto', *args, **kwargs)¶
Group dataset data by specified frequency and function.
Returns a new Dataset.
- Return type:
Self
- Parameters:
freq (str)
func (str)
args (Any)
kwargs (Any)
- math(other, func)¶
Generic helper making math functions work on numeric, non date columns of dataframe to dataframe, matrix to matrix, matrix to vector and matrix to scalar.
Although the purpose was to limit “boilerplate” for core linear algebra functions, it also extend to other operations that follow the same differentiation pattern.
- Parameters:
other (dataframe | series | matrix | vector | scalar) – One (or more?) pandas (polars to come) datframe or series, numpy matrix or vector or a scalar value.
func (_type_) – The function to be applied as self.func(**other:Self) or (in some cases) with infix notation self f other. Note that one or more date columns of the self / lefthand side argument are preserved, ie data shifting operations are not supported.
- Raises:
ValueError – “Unsupported operand type”
ValueError – “Incompatible shapes.”
- Returns:
Depending on the inputs: A new dataset / vector / scalar with the result. For datasets, the name of the new set is derived from inputs and the functions applied.
- Return type:
Any
- moving_average(start=0, stop=0, nan_rows='return')¶
Returns a new Dataset with moving averages for all series.
The average is calculated over a time window defined by from and to period offsets. Negative values denotes periods before current, positive after. Both default to 0, ie the current period; so at least one of them should be used.
- Return type:
Self
- Parameters:
start (int)
stop (int)
nan_rows (str)
>>> x.moving_average(start= -3, stop= -1) # xdoctest: +SKIP signifies the average over the three periods before (not including the current).
Offset parameters will overflow the date range at the beginning and/or end, Moving averages can not be calculated.
Set the parameter nans to control the behaviour in such cases: ‘return’ to return rows with all NaN values (default). ‘remove’ to remove these rows from both ends.
TO DO: Add parameter to choose returned time window? TO DO: Add more NaN handling options? TO DO: Add parameter to ensure/alter sampling frequency before calculating.
- numeric_columns()¶
Get names of all numeric series columns (ie columns that are not datetime).
- Return type:
list
[str
]
- plot(*args, **kwargs)¶
Plot dataset data.
Convenience wrapper around Dataframe.plot() with sensible defaults.
- Return type:
Any
- Parameters:
args (Any)
kwargs (Any)
- rename(new_name)¶
Rename the Dataset.
For use by .copy, and on very rare other occasions. Does not move or rename any previously stored data.
- Return type:
None
- Parameters:
new_name (str)
- replace_tags(*args)¶
Retag selected attributes of series in the set.
The tags to be replaced and their replacements should be specified in tuple(s) of
tag dictionaries
; each argument in*args
should be on the form({<old_tags>},{<new_tags>})
.- Return type:
None
- Parameters:
args (tuple[dict[str, str | list[str]], dict[str, str | list[str]]])
- Both old and new
TagDict
can contain multiple tags. Each tuple is evaluated independently for each series in the set.
If the tag dict to be replaced contains multiple tags, all must match for tags to be replaced.
If the new tag dict contains multiple tags, all are added where there is a match.
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.
- resample(freq, func, *args, **kwargs)¶
Alter frequency of dataset data.
- Return type:
Self
- Parameters:
freq (str)
func (Callable | str)
args (Any)
kwargs (Any)
- save(as_of_tz=None)¶
Persist the Dataset.
- Parameters:
as_of_tz (datetime) – Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of._utc (utc dates under the hood).
- Return type:
None
- property series: list[str]¶
Get series names.
- series_names_to_tags(attributes=None, separator='', regex='')¶
Tag all series in the dataset based on a series ‘attributes’, ie a list of attributes matching positions in the series names when split on ‘separator’.
Alternatively, a regular expression with groups that match the attributes may be provided. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.Examples
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df
Tag using attributes and dcefault separator:
Let us create some data where the series names are formed by the values [‘x’, ‘y’, ‘z’] separated from [‘a’, ‘b’, ‘c’] by an underscore:
>>> some_data = create_df( >>> ["x_a", "y_b", "z_c"], >>> start_date="2024-01-01", >>> end_date="2024-12-31", >>> freq="MS", >>> )
Then put it into a dataset and tag:
>>> p = Dataset( >>> name="sample_set", >>> data_type=SeriesType.simple(), >>> data=some_data, >>> ) >>> p.series_names_to_tags(attributes=['XYZ', 'ABC'])
>>> p.tags
The above approach may be used at any time to add tags for an existing dataset, but the same arguments can also be provided when initialising the set:
>>> z = Dataset( >>> name="copy_of_sample_set", >>> data_type=SeriesType.simple(), >>> data=some_data, >>> attributes=['XYZ', 'ABC'], >>> )
Best practice is to do this only in the process that writes data to the set. For a finite number of series, it does not need to be repeated.
If, on the other hand, the number of series can change over time, doing so at the time of writing ensures all series are tagged. Tag using attributes and regex:
If series names are less well formed, a regular expression with groups matching the attribute list can be provided instead of the separator parameter.
>>> more_data = create_df( >>> ["x_1,,a", "y...b..", "z..1.1-23..c"], >>> start_date="2024-01-01", >>> end_date="2024-12-31", >>> freq="MS", >>> ) >>> x = Dataset( >>> name="bigger_sample_set", >>> data_type=SeriesType.simple(), >>> data=more_data, >>> ) >>> x.series_names_to_tags(attributes=['XYZ', 'ABC'], regex=r'([a-z])*([a-z])')
- Parameters:
attributes (list[str] | None)
separator (str)
regex (str)
- Return type:
None
- property series_tags: dict[str, dict[str, str | list[str]]]¶
Get series tags.
- snapshot(as_of_tz=None)¶
Copy data snapshot to immutable processing stage bucket and shared buckets.
- Parameters:
as_of_tz (datetime) – Optional. Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of_utc (utc dates under the hood).
- Return type:
None
- tag_dataset(tags=None, **kwargs)¶
Tag the set.
Tags may be provided as dictionary of tags, or as kwargs.
In both cases they take the form of attribute-value pairs.
Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
Note that while no such restrictions are enforced, it is strongly recommended that both attribute names (
keys
) andvalues
are standardised. The best way to ensure that is to use taxonomies (for SSB: KLASS code lists). However, custom controlled vocabularies can also be maintained in files. :rtype:None
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.Examples
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> >>> x = Dataset(name='sample_dataset', >>> data_type=SeriesType.simple(), >>> data=create_df(['x','y','z'], >>> start_date='2024-01-01', >>> end_date='2024-12-31', >>> freq='MS',) >>> ) >>> >>> x.tag_dataset(tags={'country': 'Norway', 'about': 'something_important'}) >>> x.tag_dataset(another_attribute='another_value')
- Parameters:
tags (dict[str, str | list[str]])
kwargs (str | list[str] | set[str])
- Return type:
None
- tag_series(names='*', tags=None, **kwargs)¶
Tag the series identified by
names
with provided tags.Tags may be provided as dictionary of tags, or as kwargs.
In both cases they take the form of attribute-value pairs.
Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
If series names follow the same pattern of attribute values in the same order separated by the same character sequence, tags can be propagated accordingly by specifying
attributes
andseparator
parameters. The separator will default to underscore if not provided. Note that propagation by pattern will affect all series in the set, not only the ones identified bynames
. :rtype:None
Maintaining tags
There are several ways to maintain metadata (tags). See the tagging guide for detailed information.
Tagging functions Automatic tagging withseries_names_to_tags()
is convenient when series names are constructed from metadata parts with a uniform pattern. Then tags may be derived from series names by mappping name parts toattributes
either by splitting on aseparator
orregex
.
Manually tagging a dataset withtag_dataset()
will tag the set and propagate tags to all series in the set, whiletag_series()
may be used to tag individual series. If corrections need to be made, tags can be replaced withreplace_tags()
or removed withdetag_dataset()
anddetag_series()
.Examples
Dependencies
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> >>> some_data = create_df(['x', 'y', 'z'], start_date='2024-01-01', end_date='2024-12-31', freq='MS')
Tag by kwargs
>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data) >>> x.tag_series(example_1='string_1', example_2=['a', 'b', 'c'])
Tag by dict
>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data) >>> x.tag_series(tags={'example_1': 'string_1', 'example_2': ['a', 'b', 'c']})
- Parameters:
names (str | list[str])
tags (dict[str, str | list[str]])
kwargs (str | list[str])
- Return type:
None
- vectors(pattern='')¶
Get vectors with names equal to column names from Dataset.data.
- Parameters:
pattern (str) – Optional pattern for simple filtering of column names containing pattern. Defaults to ‘’.
- Return type:
None
Warning
Caution! This (re)assigns variables in the scope of the calling function by way of stack inspection and hence risks of reassigning objects, functions, or variables if they happen to have the same name.
- versions(**kwargs)¶
Get list of all series version markers (as_of dates or version names).
By default as_of dates will be returned in local timezone. Provide return_type = ‘utc’ to return in UTC, ‘raw’ to return as-is.
- Return type:
list
[datetime
|str
]- Parameters:
kwargs (Any)
- class IO(*args, **kwargs)¶
Bases:
Protocol
Interface for IO operations.
- save()¶
Save the dataset.
- Return type:
None
- snapshot()¶
Save a snapshot of the dataset.
- Return type:
None
- column_aggregate(df, method)¶
Helper function to calculate aggregate over dataframe columns.
- Return type:
Series
|Any
- Parameters:
df (DataFrame)
method (str | Callable)
- search(pattern='*', as_of_tz=None, repository='', require_unique=False)¶
Search for datasets by name matching pattern.
- Returns:
The dataset for a single match, a list for no or multiple matches.
- Return type:
list[io.SearchResult] | Dataset | list[None]
- Raises:
ValueError – If require_unique = True and a unique result is not found.
- Parameters:
pattern (str)
as_of_tz (datetime)
repository (str)
require_unique (bool)
- select_repository(name='')¶
Select a named or default repository from the configuration.
If there is only one repo, the choice is easy and criteria does not matter. Otherwise, if a
name
is provided, only that is checked. If no name is provided, the first item marked with ‘default’: True is picked. If no item is identified by name or marking as default, the last item is returned. (This behaviour is questionable - it may be turned into an error.)- Return type:
Any
- Parameters:
name (str)