ssb_timeseries.dataset

The ssb_timeseries.dataset module and its Dataset class is the very core of the ssb_timeseries package, defining most of the key functionality.

The dataset is the unit of analysis for both information model and workflow integration,and performance will benefit from linear algebra with sets as matrices consisting of series column vectors.

As described in the Information model time series datasets may consist of any number of series of the same SeriesType. The series types are defined by dimensionality characteristics:

  • Versioning (NONE, AS_OF, NAMED)

  • Temporality (Valid AT point in time, or FROM and TO for duration)

  • The type of the value. For now only scalar values are supported.

Additional type determinants (sparsity, irregular frequencies, non-numeric or non-scalar values, …) are conceivable and may be introduced later. The types are crucial because they are reflected in the physical storage structure. That in turn has practical implications for how the series can be interacted with, and for methods working on the data.

See also

The ssb_timeseries.catalog module for tools for searching for datasets or series by names or metadata.

class Dataset(name, data_type=None, as_of_tz=None, repository='', load_data=True, **kwargs)

Bases: object

Datasets are containers for series of the same SeriesType with origin from the same process.

That generally implies some common denominator in terms of descriptive metadata, but more important, it allows the Dataset to become a core unit of analysis for workflow. It becomes a natural chunk of data for reads and writes, and calculation.

For all the series in a dataset to be of the same SeriesType means they share dimensionality characteristics Versioning and Temporality and any other schema information that have tecnical implications for how the data is handled. See the Information model documentation for more about that.

The descriptive commonality is not enforced, but some aspects have technical implications. In particular, it is strongly encouraged to make sure that the resolutions of the series in datasets are the same, and to minimize the number of gaps in the series. Sparse data is a strong indication that a dataset is not well defined and that series in the set have different origins. ‘Gaps’ in this context is any representation of undefined values: None, null, NAN or “not a number” values, as opposed to the number zero. The number zero is a gray area - it can be perfectly valid, but can also be an indication that not all the series should be part of the same set.

Variables:
  • name (str) – The name of the set.

  • data_type (SeriesType) – The type of the contents of the set.

  • as_of_tz (datetime) – The version datetime, if applicable to the data_type.

  • data (Dataframe) – A dataframe or table structure with one or more datetime columns defined by datatype and a column per series in the set.

  • tags (dict) – A dictionary with metadata describing both the dataset itself and the series in the set.

Parameters:
  • name (str)

  • data_type (SeriesType)

  • as_of_tz (datetime)

  • repository (str)

  • load_data (bool)

  • kwargs (Any)

__add__(other)

Add two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__eq__(other)

Check equality of two datasets or a dataset and a dataframe, numpy array or scalar.

Parameters:

other (Self | DataFrame | Series | int | float)

Return type:

Any

__floordiv__(other)

Floor divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__getitem__(criteria='', **kwargs)

Access Dataset.data.columns via Dataset[ list[column_names] | pattern | tags].

Parameters:
  • criteria – Either a string pattern or a dict of tags.

  • kwargs – If criteria is empty, this is passed to filter().

Returns:

Self | None

Raises:

TypeError – If filter() returns another type than Dataset.

Return type:

Self | None

__gt__(other)

Check greater than for two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__init__(name, data_type=None, as_of_tz=None, repository='', load_data=True, **kwargs)

Initialising a dataset object either retrieves an existing set or prepares a new one.

When preparing a new set, data_type must be specified. If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. Providing an AS_OF date has no effect if versioning is NONE. If not, but data is passed, as_of_tz() defaults to current time. For all dates, if no timezone is provided, CET is assumed.

The data parameter accepts a dataframe with one or more date columns and one column per series. Initially only Pandas was supported, but this dependency is about to be relaxed to include other implementations of the samedata structure. Beyond Polars and Pyarrow, notable options under consideration include Ibis, DuckDB and Narwhals.

Data is kept in memory and not stored before explicit call to save(). The data is stored in parquet files, with JSON formatted metadata in the header.

Metadata will always be read if the set exists. When loading existing sets, load_data = False will suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.

If series names can be mapped to metadata, the keyword arguments attributes, separator and regex will, if provided, be passed through to series_names_to_tags. If series names are not easily translated to tags, tag_dataset and tag_series and their siblings retag and detag can be used for manual meta data maintenance.

Keyword Arguments:
  • attributes (list[str]) – Attribute names for use with series_names_to_tags in combination with either separator or regex.

  • separator (str) – Character(s) separating attributes for use with series_names_to_tags.

  • regex (str) – Regular expression with capture groups corresponding to attributes. Used instead of the separator to match more complicated name patterns in series_names_to_tags.

Parameters:
  • name (str)

  • data_type (SeriesType)

  • as_of_tz (datetime)

  • repository (str)

  • load_data (bool)

  • kwargs (Any)

Return type:

None

import ssb_timeseries as ts
df = ts.sample_data.xyz_at()
print(df)

x = ts.dataset.Dataset(
    name='mydataset',
    data_type=ts.properties.SeriesType.simple(),
    data=df
)
__lt__(other)

Check less than for two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__mod__(other)

Modulo of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__mul__(other)

Multiply two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__pow__(other)

Power of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__radd__(other)

Right add two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__repr__()

Returns a machine readable string representation of Dataset, ideally sufficient to recreate object.

Return type:

str

__rfloordiv__(other)

Right floor divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rmod__(other)

Right modulo of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rmul__(other)

Right multiply two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rpow__(other)

Right power of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rsub__(other)

Right subtract two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rtruediv__(other)

Right divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__str__()

Returns a human readable string representation of the Dataset.

Return type:

str

__sub__(other)

Subtract two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__truediv__(other)

Divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

aggregate(attributes, taxonomies, functions, sep='_')

Aggregate dataset by taxonomy hierarchies.

Parameters:
  • attributes (list[str]) – The attributes to aggregate by.

  • taxonomies (list[int | meta.Taxonomy | dict[str, str] | PathStr]) – Value definitions for attributes can be either :meta.Taxonomy objects, or klass_ids, data dictionaries or paths that can be used to retrieve or construct them.

  • functions (list[str|F] | set[str|F]) – Optional function name (or list) of the function names to apply (mean | count | sum | …). Defaults to sum.

  • sep (str) – Optional separator used when joining multiple attributes into names of aggregated series. Defaults to ‘_’.

Returns:

A dataset object with the aggregated data. If the taxonomy object has hierarchical structure, aggregate series are calculated for parent nodes at all levels. If the taxonomy is a flat list, only a single total aggregate series is calculated.

Return type:

Self

Raises:

TypeError – If any of the taxonomy identifiere are of unexpected types.

Examples

To calculate 10 and 90 percentiles and median for the dataset x where codes from KLASS 157 (energy_balance) distinguishes between series in the set.

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>> from ssb_timeseries.meta import Taxonomy
>>>
>>> klass157 = Taxonomy(klass_id=157)
>>> klass157_leaves = [n.name for n in klass157.structure.root.leaves]
>>> tag_permutation_space = {"A": klass157_leaves, "B": ["q"], "C": ["z"]}
>>> series_names: list[list[str]] = [value for value in tag_permutation_space.values()]
>>> sample_df = create_df(*series_names, start_date="2024-01-01", end_date="2024-12-31", freq="MS",)
>>> sample_set = Dataset(name="sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=sample_df,
>>>     attributes=["A", "B", "C"],
>>> )
>>>
>>> def perc10(x):
>>>     return x.quantile(.1, axis=1, numeric_only=True, interpolation="linear")
>>>
>>> def perc90(x):
>>>     return x.quantile(.9, axis=1, numeric_only=True, interpolation="linear")
>>>
>>> percentiles = sample_set.aggregate(["energy_balance"], [157], [perc10, 'median', perc90])
all()

Check if all values in series columns evaluate to true.

Return type:

bool

any()

Check if any values in series columns evaluate to true.

Return type:

bool

copy(new_name, **kwargs)

Create a copy of the Dataset.

The copy need to get a new name, but unless other information is spcecified, it will be create wiht the same data_type, as_of_tz, data, and tags.

Return type:

Self

Parameters:
  • new_name (str)

  • kwargs (Any)

datetime_columns(*comparisons)

Get names of datetime columns (valid_at, valid_from, valid_to).

Parameters:

*comparisons (Self | pd.DataFrame) – Objects to compare with. If provided, returns the intersection of self and all comparisons.

Returns:

The (common) datetime column names of self (and comparisons).

Return type:

list[str]

Raises:

ValueError – If comparisons are not of type Self or pd.DataFrame.

default_tags()

Return default tags for set and series.

Return type:

dict[str, dict[str, str | list[str]] | dict[str, dict[str, str | list[str]]]]

detag_dataset(*args, **kwargs)

Detag selected attributes of the set.

Tags to be removed may be provided as list of attribute names or as kwargs with attribute-value pairs. :rtype: None

Parameters:
  • args (str)

  • kwargs (Any)

Return type:

None

detag_series(*args, **kwargs)

Detag selected attributes of series in the set.

Tags to be removed may be specified by args or kwargs. Attributes listed in args will be removed from all series.

For kwargs, attributes will be removed from the series if the value matches exactly. If the value is a list, the matching value is removed. If kwargs contain all=True, all attributes except defaults are removed.

Parameters:
  • args (str)

  • kwargs (Any)

Return type:

None

filter(pattern='', tags=None, regex='', output='dataset', new_name='', **kwargs)

Filter dataset.data by textual pattern, regex or metadata tag dictionary.

Or a combination.

Parameters:
  • pattern (str) – Text pattern for search ‘like’ in column names. Defaults to ‘’.

  • regex (str) – Expression for regex search in column names. Defaults to ‘’.

  • tags (dict) – Dictionary with tags to search for. Defaults to None. All tags in dict must be satisfied for the same series (tags are combined by AND). If a list of values is provided for a tag, the criteria is satisfied for either of them (OR). | list(dict) Support for list(dict) is planned, not yet implemented, to satisfy alternative sets of criteria (the dicts will be combined by OR).

  • output (str) – Output type - dataset or dataframe.(df). Defaults to ‘dataset’. Short forms ‘df’ or ‘ds’ are accepted.

  • new_name (str) – Name of new Dataset. If not provided, a new name is generated.

  • **kwargs – if provided, goes into the init of the new set.

Returns:

By default a new Dataset (a deep copy of self). If output=”dataframe” or “df”, a dataframe. TODO: Explore shallow copy / nocopy options.

Return type:

Dataset | Dataframe

groupby(freq, func='auto', *args, **kwargs)

Group dataset data by specified frequency and function.

Returns a new Dataset.

Return type:

Self

Parameters:
  • freq (str)

  • func (str)

  • args (Any)

  • kwargs (Any)

math(other, func)

Generic helper making math functions work on numeric, non date columns of dataframe to dataframe, matrix to matrix, matrix to vector and matrix to scalar.

Although the purpose was to limit “boilerplate” for core linear algebra functions, it also extend to other operations that follow the same differentiation pattern.

Parameters:
  • other (dataframe | series | matrix | vector | scalar) – One (or more?) pandas (polars to come) datframe or series, numpy matrix or vector or a scalar value.

  • func (_type_) – The function to be applied as self.func(**other:Self) or (in some cases) with infix notation self f other. Note that one or more date columns of the self / lefthand side argument are preserved, ie data shifting operations are not supported.

Raises:
  • ValueError – “Unsupported operand type”

  • ValueError – “Incompatible shapes.”

Returns:

Depending on the inputs: A new dataset / vector / scalar with the result. For datasets, the name of the new set is derived from inputs and the functions applied.

Return type:

Any

moving_average(start=0, stop=0, nan_rows='return')

Returns a new Dataset with moving averages for all series.

The average is calculated over a time window defined by from and to period offsets. Negative values denotes periods before current, positive after. Both default to 0, ie the current period; so at least one of them should be used.

Return type:

Self

Parameters:
  • start (int)

  • stop (int)

  • nan_rows (str)

>>> x.moving_average(start= -3, stop= -1) # xdoctest: +SKIP
signifies the average over the three periods before (not including the current).

Offset parameters will overflow the date range at the beginning and/or end, Moving averages can not be calculated.

Set the parameter nans to control the behaviour in such cases: ‘return’ to return rows with all NaN values (default). ‘remove’ to remove these rows from both ends.

TO DO: Add parameter to choose returned time window? TO DO: Add more NaN handling options? TO DO: Add parameter to ensure/alter sampling frequency before calculating.

numeric_columns()

Get names of all numeric series columns (ie columns that are not datetime).

Return type:

list[str]

plot(*args, **kwargs)

Plot dataset data.

Convenience wrapper around Dataframe.plot() with sensible defaults.

Return type:

Any

Parameters:
  • args (Any)

  • kwargs (Any)

rename(new_name)

Rename the Dataset.

For use by .copy, and on very rare other occasions. Does not move or rename any previously stored data.

Return type:

None

Parameters:

new_name (str)

replace_tags(*args)

Retag selected attributes of series in the set.

The tags to be replaced and their replacements should be specified in tuple(s) of tag dictionaries; each argument in *args should be on the form ({<old_tags>},{<new_tags>}).

Return type:

None

Parameters:

args (tuple[dict[str, str | list[str]], dict[str, str | list[str]]])

Both old and new TagDict can contain multiple tags.
  • Each tuple is evaluated independently for each series in the set.

  • If the tag dict to be replaced contains multiple tags, all must match for tags to be replaced.

  • If the new tag dict contains multiple tags, all are added where there is a match.

resample(freq, func, *args, **kwargs)

Alter frequency of dataset data.

Return type:

Self

Parameters:
  • freq (str)

  • func (Callable | str)

  • args (Any)

  • kwargs (Any)

save(as_of_tz=None)

Persist the Dataset.

Parameters:

as_of_tz (datetime) – Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of._utc (utc dates under the hood).

Return type:

None

property series: list[str]

Get series names.

series_names_to_tags(attributes=None, separator='', regex='')

Tag all series in the dataset based on a series ‘attributes’, ie a list of attributes matching positions in the series names when split on ‘separator’.

Alternatively, a regular expression with groups that match the attributes may be provided. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

Examples

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df

Tag using attributes and dcefault separator:

Let us create some data where the series names are formed by the values [‘x’, ‘y’, ‘z’] separated from [‘a’, ‘b’, ‘c’] by an underscore:

>>> some_data = create_df(
>>>     ["x_a", "y_b", "z_c"],
>>>     start_date="2024-01-01",
>>>     end_date="2024-12-31",
>>>     freq="MS",
>>> )

Then put it into a dataset and tag:

>>> p = Dataset(
>>>     name="sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=some_data,
>>> )
>>> p.series_names_to_tags(attributes=['XYZ', 'ABC'])
>>> p.tags

The above approach may be used at any time to add tags for an existing dataset, but the same arguments can also be provided when initialising the set:

>>> z = Dataset(
>>>     name="copy_of_sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=some_data,
>>>     attributes=['XYZ', 'ABC'],
>>> )

Best practice is to do this only in the process that writes data to the set. For a finite number of series, it does not need to be repeated.

If, on the other hand, the number of series can change over time, doing so at the time of writing ensures all series are tagged. Tag using attributes and regex:

If series names are less well formed, a regular expression with groups matching the attribute list can be provided instead of the separator parameter.

>>> more_data = create_df(
>>>     ["x_1,,a", "y...b..", "z..1.1-23..c"],
>>>     start_date="2024-01-01",
>>>     end_date="2024-12-31",
>>>     freq="MS",
>>> )
>>> x = Dataset(
>>>     name="bigger_sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=more_data,
>>> )
>>> x.series_names_to_tags(attributes=['XYZ', 'ABC'], regex=r'([a-z])*([a-z])')
Parameters:
  • attributes (list[str] | None)

  • separator (str)

  • regex (str)

Return type:

None

property series_tags: dict[str, dict[str, str | list[str]]]

Get series tags.

snapshot(as_of_tz=None)

Copy data snapshot to immutable processing stage bucket and shared buckets.

Parameters:

as_of_tz (datetime) – Optional. Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of_utc (utc dates under the hood).

Return type:

None

tag_dataset(tags=None, **kwargs)

Tag the set.

Tags may be provided as dictionary of tags, or as kwargs.

In both cases they take the form of attribute-value pairs.

Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

Note that while no such restrictions are enforced, it is strongly recommended that both attribute names (keys) and values are standardised. The best way to ensure that is to use taxonomies (for SSB: KLASS code lists). However, custom controlled vocabularies can also be maintained in files. :rtype: None

Examples

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>>
>>> x = Dataset(name='sample_dataset',
>>>         data_type=SeriesType.simple(),
>>>         data=create_df(['x','y','z'],
>>>             start_date='2024-01-01',
>>>             end_date='2024-12-31',
>>>             freq='MS',)
>>> )
>>>
>>> x.tag_dataset(tags={'country': 'Norway', 'about': 'something_important'})
>>> x.tag_dataset(another_attribute='another_value')
Parameters:
  • tags (dict[str, str | list[str]])

  • kwargs (str | list[str] | set[str])

Return type:

None

tag_series(names='*', tags=None, **kwargs)

Tag the series identified by names with provided tags.

Tags may be provided as dictionary of tags, or as kwargs.

In both cases they take the form of attribute-value pairs.

Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

If series names follow the same pattern of attribute values in the same order separated by the same character sequence, tags can be propagated accordingly by specifying attributes and separator parameters. The separator will default to underscore if not provided. Note that propagation by pattern will affect all series in the set, not only the ones identified by names. :rtype: None

Examples

Dependencies

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>>
>>> some_data = create_df(['x', 'y', 'z'], start_date='2024-01-01', end_date='2024-12-31', freq='MS')

Tag by kwargs

>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data)
>>> x.tag_series(example_1='string_1', example_2=['a', 'b', 'c'])

Tag by dict

>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data)
>>> x.tag_series(tags={'example_1': 'string_1', 'example_2': ['a', 'b', 'c']})
Parameters:
  • names (str | list[str])

  • tags (dict[str, str | list[str]])

  • kwargs (str | list[str])

Return type:

None

vectors(pattern='')

Get vectors with names equal to column names from Dataset.data.

Parameters:

pattern (str) – Optional pattern for simple filtering of column names containing pattern. Defaults to ‘’.

Return type:

None

Warning

Caution! This (re)assigns variables in the scope of the calling function by way of stack inspection and hence risks of reassigning objects, functions, or variables if they happen to have the same name.

versions(**kwargs)

Get list of all series version markers (as_of dates or version names).

By default as_of dates will be returned in local timezone. Provide return_type = ‘utc’ to return in UTC, ‘raw’ to return as-is.

Return type:

list[datetime | str]

Parameters:

kwargs (Any)

class IO(*args, **kwargs)

Bases: Protocol

Interface for IO operations.

save()

Save the dataset.

Return type:

None

snapshot()

Save a snapshot of the dataset.

Return type:

None

column_aggregate(df, method)

Helper function to calculate aggregate over dataframe columns.

Return type:

Series | Any

Parameters:
  • df (DataFrame)

  • method (str | Callable)

search(pattern='*', as_of_tz=None, repository='', require_unique=False)

Search for datasets by name matching pattern.

Returns:

The dataset for a single match, a list for no or multiple matches.

Return type:

list[io.SearchResult] | Dataset | list[None]

Raises:

ValueError – If require_unique = True and a unique result is not found.

Parameters:
  • pattern (str)

  • as_of_tz (datetime)

  • repository (str)

  • require_unique (bool)

select_repository(name='')

Select a named or default repository from the configuration.

If there is only one repo, the choice is easy and criteria does not matter. Otherwise, if a name is provided, only that is checked. If no name is provided, the first item marked with ‘default’: True is picked. If no item is identified by name or marking as default, the last item is returned. (This behaviour is questionable - it may be turned into an error.)

Return type:

Any

Parameters:

name (str)