ssb_timeseries.dataset

The ssb_timeseries.dataset module and its Dataset class is the very core of the ssb_timeseries package, defining most of the key functionality.

The dataset is the unit of analysis for both information model and workflow integration,and performance will benefit from linear algebra with sets as matrices consisting of series column vectors.

As described in the Information model time series datasets may consist of any number of series of the same type. Series types are defined by properties.Versioning and properties.Temporality, see properties.SeriesType.

It is also strongly encouraged to make sure that the resolutions of the series in datasets are the same, and to minimize the number of gaps in the series. Very sparse data is a strong indication that a dataset is not well defined: may indicate that series in the set have different origins. What counts as ‘gaps’ in this context is any representation of undefined values: None, null, NAN or “not a number” values, as opposed to the number zero. The number zero is a gray area - it can be perfectly valid, but can also be an indication that not all the series should be part of the same set.

See also

See documentation for the ssb_timeseries.catalog module for tools for searching for datasets or series by names or metadata.

class Dataset(name, data_type=None, as_of_tz=None, load_data=True, **kwargs)

Bases: object

Datasets are the core unit of analysis for workflow and data storage.

A dataset is a logical collection of data and metadata stemming from the same process origin. All the series in a dataset must be of the same type.

The type defines
  • versioning (NONE, AS_OF, NAMED)

  • temporality (Valid AT point in time, or FROM and TO for duration)

  • value (for now only scalars)

Initialising a dataset object can either retrieve data and metadata for an existing set or prepare a new one.

If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. If not, but data is passed, :as_of_tz() defaults to current time. Providing an AS_OF date has no effect if versioning is NONE.

When loading existing sets, load_data = false can be set in order to suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.

Metadata will always be read. Data is represented as Pandas dataframes; but Polars lazyframes or Pyarrow tables is likely to be better. Initial implementation assumes stores data in parquet files, but feather files and various database options are considered for later.

Support for additional “type” features/flags behaviours like sparse data may be added later (if needed).

Data is kept in memory and not stored before explicit call to .save.

Parameters:
  • name (str)

  • data_type (SeriesType)

  • as_of_tz (datetime)

  • load_data (bool)

  • kwargs (Any)

__add__(other)

Add two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__eq__(other)

Check equality of two datasets or a dataset and a dataframe, numpy array or scalar.

Parameters:

other (Self | DataFrame | Series | int | float)

Return type:

Any

__floordiv__(other)

Floor divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__getitem__(criteria='', **kwargs)

Access Dataset.data.columns via Dataset[ list[column_names] | pattern | tags].

Parameters:
  • criteria – Either a string pattern or a dict of tags.

  • kwargs – If criteria is empty, this is passed to filter().

Returns:

Self | None

Raises:

TypeError – If filter() returns another type than Dataset.

Return type:

Self | None

__gt__(other)

Check greater than for two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__init__(name, data_type=None, as_of_tz=None, load_data=True, **kwargs)

Initialising a dataset object can either retrieve data and metadata for an existing set or prepare a new one.

If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. If not, but data is passed, :as_of_tz() defaults to current time. Providing an AS_OF date has no effect if versioning is NONE.

When loading existing sets, load_data = false can be set in order to suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.

Metadata will always be read. Data is represented as Pandas dataframes; but Polars lazyframes or Pyarrow tables is likely to be better. Initial implementation assumes stores data in parquet files, but feather files and various database options are considered for later.

Support for additional “type” features/flags behaviours like sparse data may be added later (if needed).

Data is kept in memory and not stored before explicit call to .save.

Parameters:
  • name (str)

  • data_type (SeriesType)

  • as_of_tz (datetime)

  • load_data (bool)

  • kwargs (Any)

Return type:

None

__lt__(other)

Check less than for two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__mod__(other)

Modulo of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__mul__(other)

Multiply two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__pow__(other)

Power of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__radd__(other)

Right add two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__repr__()

Returns a machine readable string representation of Dataset, ideally sufficient to recreate object.

Return type:

str

__rfloordiv__(other)

Right floor divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rmod__(other)

Right modulo of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rmul__(other)

Right multiply two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rpow__(other)

Right power of two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rsub__(other)

Right subtract two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__rtruediv__(other)

Right divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__str__()

Returns a human readable string representation of the Dataset.

Return type:

str

__sub__(other)

Subtract two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

__truediv__(other)

Divide two datasets or a dataset and a dataframe, numpy array or scalar.

Return type:

Any

Parameters:

other (Self | DataFrame | Series | int | float)

aggregate(attributes, taxonomies, functions, sep='_')

Aggregate dataset by taxonomy hierarchies.

Parameters:
  • attributes (list[str]) – The attributes to aggregate by.

  • taxonomies (list[int | meta.Taxonomy | dict[str, str] | PathStr]) – Value definitions for attributes can be either :meta.Taxonomy objects, or klass_ids, data dictionaries or paths that can be used to retrieve or construct them.

  • functions (list[str|F] | set[str|F]) – Optional function name (or list) of the function names to apply (mean | count | sum | …). Defaults to sum.

  • sep (str) – Optional separator used when joining multiple attributes into names of aggregated series. Defaults to ‘_’.

Returns:

A dataset object with the aggregated data. If the taxonomy object has hierarchical structure, aggregate series are calculated for parent nodes at all levels. If the taxonomy is a flat list, only a single total aggregate series is calculated.

Return type:

Self

Raises:

TypeError – If any of the taxonomy identifiere are of unexpected types.

Examples

To calculate 10 and 90 percentiles and median for the dataset x where codes from KLASS 157 (energy_balance) distinguishes between series in the set.

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>> from ssb_timeseries.meta import Taxonomy
>>>
>>> klass157 = Taxonomy(klass_id=157)
>>> klass157_leaves = [n.name for n in klass157.structure.root.leaves]
>>> tag_permutation_space = {"A": klass157_leaves, "B": ["q"], "C": ["z"]}
>>> series_names: list[list[str]] = [value for value in tag_permutation_space.values()]
>>> sample_df = create_df(*series_names, start_date="2024-01-01", end_date="2024-12-31", freq="MS",)
>>> sample_set = Dataset(name="sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=sample_df,
>>>     name_pattern=["A", "B", "C"],
>>> )
>>>
>>> def perc10(x):
>>>     return x.quantile(.1, axis=1, numeric_only=True, interpolation="linear")
>>>
>>> def perc90(x):
>>>     return x.quantile(.9, axis=1, numeric_only=True, interpolation="linear")
>>>
>>> percentiles = sample_set.aggregate(["energy_balance"], [157], [perc10, 'median', perc90])
all()

Check if all values in series columns evaluate to true.

Return type:

bool

any()

Check if any values in series columns evaluate to true.

Return type:

bool

copy(new_name, **kwargs)

Create a copy of the Dataset.

The copy need to get a new name, but unless other information is spcecified, it will be create wiht the same data_type, as_of_tz, data, and tags.

Return type:

Self

Parameters:
  • new_name (str)

  • kwargs (Any)

datetime_columns(*comparisons)

Get names of datetime columns (valid_at, valid_from, valid_to).

Parameters:

*comparisons (Self | pd.DataFrame) – Objects to compare with. If provided, returns the intersection of self and all comparisons.

Returns:

The (common) datetime column names of self (and comparisons).

Return type:

list[str]

Raises:

ValueError – If comparisons are not of type Self or pd.DataFrame.

default_tags()

Return default tags for set and series.

Return type:

dict[str, dict[str, str | list[str]] | dict[str, dict[str, str | list[str]]]]

detag_dataset(*args, **kwargs)

Detag selected attributes of the set.

Tags to be removed may be provided as list of attribute names or as kwargs with attribute-value pairs.

Return type:

None

Parameters:
  • args (str)

  • kwargs (Any)

detag_series(*args, **kwargs)

Detag selected attributes of series in the set.

Tags to be removed may be specified by args or kwargs. Attributes listed in args will be removed from all series.

For kwargs, attributes will be removed from the series if the value matches exactly. If the value is a list, the matching value is removed. If kwargs contain all=True, all attributes except defaults are removed.

Parameters:
  • args (str)

  • kwargs (Any)

Return type:

None

filter(pattern='', tags=None, regex='', output='dataset', new_name='', **kwargs)

Filter dataset.data by textual pattern, regex or metadata tag dictionary.

Or a combination.

Parameters:
  • pattern (str) – Text pattern for search ‘like’ in column names. Defaults to ‘’.

  • regex (str) – Expression for regex search in column names. Defaults to ‘’.

  • tags (dict) – Dictionary with tags to search for. Defaults to None. All tags in dict must be satisfied for the same series (tags are combined by AND). If a list of values is provided for a tag, the criteria is satisfied for either of them (OR). | list(dict) Support for list(dict) is planned, not yet implemented, to satisfy alternative sets of criteria (the dicts will be combined by OR).

  • output (str) – Output type - dataset or dataframe.(df). Defaults to ‘dataset’. Short forms ‘df’ or ‘ds’ are accepted.

  • new_name (str) – Name of new Dataset. If not provided, a new name is generated.

  • **kwargs – if provided, goes into the init of the new set.

Returns:

By default a new Dataset (a deep copy of self). If output=”dataframe” or “df”, a dataframe. TODO: Explore shallow copy / nocopy options.

Return type:

Dataset | Dataframe

groupby(freq, func='auto', *args, **kwargs)

Group dataset data by specified frequency and function.

Returns a new Dataset.

Return type:

Self

Parameters:
  • freq (str)

  • func (str)

  • args (Any)

  • kwargs (Any)

math(other, func)

Generic helper making math functions work on numeric, non date columns of dataframe to dataframe, matrix to matrix, matrix to vector and matrix to scalar.

Although the purpose was to limit “boilerplate” for core linear algebra functions, it also extend to other operations that follow the same differentiation pattern.

Parameters:
  • other (dataframe | series | matrix | vector | scalar) – One (or more?) pandas (polars to come) datframe or series, numpy matrix or vector or a scalar value.

  • func (_type_) – The function to be applied as self.func(**other:Self) or (in some cases) with infix notation self f other. Note that one or more date columns of the self / lefthand side argument are preserved, ie data shifting operations are not supported.

Raises:
  • ValueError – “Unsupported operand type”

  • ValueError – “Incompatible shapes.”

Returns:

Depending on the inputs: A new dataset / vector / scalar with the result. For datasets, the name of the new set is derived from inputs and the functions applied.

Return type:

Any

numeric_columns()

Get names of all numeric series columns (ie columns that are not datetime).

Return type:

list[str]

plot(*args, **kwargs)

Plot dataset data.

Convenience wrapper around Dataframe.plot() with sensible defaults.

Return type:

Any

Parameters:
  • args (Any)

  • kwargs (Any)

rename(new_name)

Rename the Dataset.

For use by .copy, and on very rare other occasions. Does not move or rename any previously stored data.

Return type:

None

Parameters:

new_name (str)

replace_tags(*args)

Retag selected attributes of series in the set.

Return type:

None

Parameters:

args (tuple[dict[str, str | list[str]], dict[str, str | list[str]]])

The tags to be replaced and their replacements should be specified as tuple(s) of dictionaries for (old_tags, new_tags). Both can contain multiple tags.
  • Each tuple is evaluated independently for each series in the set.

  • If the tag dict to be replaced contains multiple tags, all must match for tags to be replaced.

  • If the new tag dict contains multiple tags, all are added where there is a match.

resample(freq, func, *args, **kwargs)

Alter frequency of dataset data.

Return type:

Self

Parameters:
  • freq (str)

  • func (Callable | str)

  • args (Any)

  • kwargs (Any)

save(as_of_tz=None)

Persist the Dataset.

Parameters:

as_of_tz (datetime) – Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of._utc (utc dates under the hood).

Return type:

None

property series: list[str]

Get series names.

series_names_to_tags(attributes=None, separator='', regex='')

Tag all series in the dataset based on a series ‘attributes’, ie a list of attributes matching positions in the series names when split on ‘separator’.

Alternatively, a regular expression with groups that match the attributes may be provided. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code. Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

Example

Dependencies

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df

Tag using name_pattern

If all series names follow a uniform pattern where attribute values are separated by the same character sequence:

>>> some_data = create_df(["x_a", "y_b", "z_c"], start_date="2024-01-01", end_date="2024-12-31", freq="MS",)
>>> x = Dataset(name="sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=some_data)
>>> x.series_names_to_tags(attributes=['XYZ', 'ABC'])

Tag by regex

If series names are less well formed, a regular expression with groups matching the attribute list can be provided instead of the separator parameter.

>>> more_data = create_df(["x_1,,a", "y...b..", "z..1.1-23..c"], start_date="2024-01-01", end_date="2024-12-31", freq="MS")
>>> x = Dataset(name="sample_set",data_type=SeriesType.simple(),data=more_data,)
>>> x.series_names_to_tags(attributes=['XYZ', 'ABC'], regex=r'([a-z])*([a-z])')

The above approach may be used to add tags for an existing dataset, but the same arguments can also be provided when initialising the set:

>>> # xdoctest: +SKIP
>>> # this line is not valid python code
>>> z = Dataset(name="sample_set",
>>>     data_type=SeriesType.simple(),
>>>     data=some_data,
>>>     name_pattern=['XYZ', 'ABC'])
>>> # xdoctest: -SKIP

Best practice is to do this only in the process that writes data to the set. For a finite number of series, it does not need to be repeated.

If, on the other hand, the number of series can change over time, doing so at the time of writing ensures all series are tagged.

Parameters:
  • attributes (list[str] | None)

  • separator (str)

  • regex (str)

Return type:

None

property series_tags: dict[str, dict[str, str | list[str]]]

Get series tags.

snapshot(as_of_tz=None)

Copy data snapshot to immutable processing stage bucket and shared buckets.

Parameters:

as_of_tz (datetime) – Optional. Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of_utc (utc dates under the hood).

Return type:

None

tag_dataset(tags=None, **kwargs)

Tag the set.

Tags may be provided as dictionary of tags, or as kwargs.

In both cases they take the form of attribute-value pairs.

Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

Return type:

None

Parameters:
  • tags (dict[str, str | list[str]])

  • kwargs (str | list[str] | set[str])

Examples

Dependencies

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>>
>>> x = Dataset(name='sample_dataset',
>>>         data_type=SeriesType.simple(),
>>>         data=create_df(['x','y','z'],
>>>             start_date='2024-01-01',
>>>             end_date='2024-12-31',
>>>             freq='MS',)
>>> )
>>>
>>> x.tag_dataset(tags={'country': 'Norway', 'about': 'something_important'})
>>> x.tag_dataset(another_attribute='another_value')

Note that while no such restrictions are enforced, it is strongly recommended that both attribute names (keys) and values are standardised. The best way to ensure that is to use taxonomies (for SSB: KLASS code lists). However, custom controlled vocabularies can also be maintained in files.i

tag_series(identifiers=None, name_pattern=None, separator='_', tags=None, **kwargs)

Tag the series identified by identifiers with provided tags.

Tags may be provided as dictionary of tags, or as kwargs.

In both cases they take the form of attribute-value pairs.

Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.

Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.

If series names follow the same pattern of attribute values in the same order separated by the same character sequence, tags can be propagated accordingly by specifying name_pattern and separator parameters. The separator will default to underscore if not provided. Note that propagation by pattern will affect all series in the set, not only the ones identified by identifiers.

Return type:

None

Parameters:
  • identifiers (str | list[str] | None)

  • name_pattern (list[str] | None)

  • separator (str)

  • tags (dict[str, str | list[str]])

  • kwargs (str | list[str])

Examples

Dependencies

>>> from ssb_timeseries.dataset import Dataset
>>> from ssb_timeseries.properties import SeriesType
>>> from ssb_timeseries.sample_data import create_df
>>>
>>> some_data = create_df(['x', 'y', 'z'], start_date='2024-01-01', end_date='2024-12-31', freq='MS')

Tag by kwargs

>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data)
>>> x.tag_series(example_1='string_1', example_2=['a', 'b', 'c'])

Tag by dict

>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data)
>>> x.tag_series(tags={'example_1': 'string_1', 'example_2': ['a', 'b', 'c']})
vectors(pattern='')

Get vectors with names equal to column names from Dataset.data.

Parameters:

pattern (str) – Optional pattern for simple filtering of column names containing pattern. Defaults to ‘’.

Return type:

None

Warning

Caution! This (re)assigns variables in the scope of the calling function by way of stack inspection and hence risks of reassigning objects, functions, or variables if they happen to have the same name.

versions(**kwargs)

Get list of all series version markers (as_of dates or version names).

By default as_of dates will be returned in local timezone. Provide return_type = ‘utc’ to return in UTC, ‘raw’ to return as-is.

Return type:

list[datetime | str]

Parameters:

kwargs (Any)

class IO(*args, **kwargs)

Bases: Protocol

Interface for IO operations.

save()

Save the dataset.

Return type:

None

snapshot()

Save a snapshot of the dataset.

Return type:

None

Search across datasets by tags pattern.

Return type:

list[SearchResult] | Dataset | list[None]

Parameters:
  • pattern (dict[str, str | list[str]])

  • as_of_tz (datetime)

  • object_type (str | list[str])

column_aggregate(df, method)

Helper function to calculate aggregate over dataframe columns.

Return type:

Series | Any

Parameters:
  • df (DataFrame)

  • method (str | Callable)

search(pattern='*', as_of_tz=None)

Search for datasets by name matching pattern.

Return type:

list[SearchResult] | Dataset | list[None]

Parameters:
  • pattern (str)

  • as_of_tz (datetime)