ssb_timeseries.dataset
¶
The ssb_timeseries.dataset
module and its Dataset
class is the very core of the ssb_timeseries
package, defining most of the key functionality.
The dataset is the unit of analysis for both information model and workflow integration,and performance will benefit from linear algebra with sets as matrices consisting of series column vectors.
As described in the Information model time series datasets may consist of any number of series of the same type
. Series types are defined by properties.Versioning
and properties.Temporality
, see properties.SeriesType
.
It is also strongly encouraged to make sure that the resolutions of the series in datasets are the same, and to minimize the number of gaps in the series. Very sparse data is a strong indication that a dataset is not well defined: may indicate that series in the set have different origins. What counts as ‘gaps’ in this context is any representation of undefined values: None, null, NAN or “not a number” values, as opposed to the number zero. The number zero is a gray area - it can be perfectly valid, but can also be an indication that not all the series should be part of the same set.
See also
See documentation for the ssb_timeseries.catalog
module for tools for searching for datasets or series by names or metadata.
- class Dataset(name, data_type=None, as_of_tz=None, load_data=True, **kwargs)¶
Bases:
object
Datasets are the core unit of analysis for workflow and data storage.
A dataset is a logical collection of data and metadata stemming from the same process origin. All the series in a dataset must be of the same type.
- The type defines
versioning (NONE, AS_OF, NAMED)
temporality (Valid AT point in time, or FROM and TO for duration)
value (for now only scalars)
Initialising a dataset object can either retrieve data and metadata for an existing set or prepare a new one.
If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. If not, but data is passed, :
as_of_tz()
defaults to current time. Providing an AS_OF date has no effect if versioning is NONE.When loading existing sets, load_data = false can be set in order to suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.
Metadata will always be read. Data is represented as Pandas dataframes; but Polars lazyframes or Pyarrow tables is likely to be better. Initial implementation assumes stores data in parquet files, but feather files and various database options are considered for later.
Support for additional “type” features/flags behaviours like sparse data may be added later (if needed).
Data is kept in memory and not stored before explicit call to .save.
- Parameters:
name (str)
data_type (SeriesType)
as_of_tz (datetime)
load_data (bool)
kwargs (Any)
- __add__(other)¶
Add two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __eq__(other)¶
Check equality of two datasets or a dataset and a dataframe, numpy array or scalar.
- Parameters:
other (Self | DataFrame | Series | int | float)
- Return type:
Any
- __floordiv__(other)¶
Floor divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __getitem__(criteria='', **kwargs)¶
Access Dataset.data.columns via Dataset[ list[column_names] | pattern | tags].
- Parameters:
criteria – Either a string pattern or a dict of tags.
kwargs – If criteria is empty, this is passed to filter().
- Returns:
Self | None
- Raises:
TypeError – If filter() returns another type than Dataset.
- Return type:
Self | None
- __gt__(other)¶
Check greater than for two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __init__(name, data_type=None, as_of_tz=None, load_data=True, **kwargs)¶
Initialising a dataset object can either retrieve data and metadata for an existing set or prepare a new one.
If data_type versioning is specified as AS_OF, a datetime with timezone should be provided. If not, but data is passed, :
as_of_tz()
defaults to current time. Providing an AS_OF date has no effect if versioning is NONE.When loading existing sets, load_data = false can be set in order to suppress reading large amounts of data. For data_types with AS_OF versioning, not providing the AS_OF date will have the same effect.
Metadata will always be read. Data is represented as Pandas dataframes; but Polars lazyframes or Pyarrow tables is likely to be better. Initial implementation assumes stores data in parquet files, but feather files and various database options are considered for later.
Support for additional “type” features/flags behaviours like sparse data may be added later (if needed).
Data is kept in memory and not stored before explicit call to .save.
- Parameters:
name (str)
data_type (SeriesType)
as_of_tz (datetime)
load_data (bool)
kwargs (Any)
- Return type:
None
- __lt__(other)¶
Check less than for two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __mod__(other)¶
Modulo of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __mul__(other)¶
Multiply two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __pow__(other)¶
Power of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __radd__(other)¶
Right add two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __repr__()¶
Returns a machine readable string representation of Dataset, ideally sufficient to recreate object.
- Return type:
str
- __rfloordiv__(other)¶
Right floor divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rmod__(other)¶
Right modulo of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rmul__(other)¶
Right multiply two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rpow__(other)¶
Right power of two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rsub__(other)¶
Right subtract two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __rtruediv__(other)¶
Right divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __str__()¶
Returns a human readable string representation of the Dataset.
- Return type:
str
- __sub__(other)¶
Subtract two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- __truediv__(other)¶
Divide two datasets or a dataset and a dataframe, numpy array or scalar.
- Return type:
Any
- Parameters:
other (Self | DataFrame | Series | int | float)
- aggregate(attributes, taxonomies, functions, sep='_')¶
Aggregate dataset by taxonomy hierarchies.
- Parameters:
attributes (list[str]) – The attributes to aggregate by.
taxonomies (list[int | meta.Taxonomy | dict[str, str] | PathStr]) – Value definitions for attributes can be either :
meta.Taxonomy
objects, or klass_ids, data dictionaries or paths that can be used to retrieve or construct them.functions (list[str|F] | set[str|F]) – Optional function name (or list) of the function names to apply (mean | count | sum | …). Defaults to sum.
sep (str) – Optional separator used when joining multiple attributes into names of aggregated series. Defaults to ‘_’.
- Returns:
A dataset object with the aggregated data. If the taxonomy object has hierarchical structure, aggregate series are calculated for parent nodes at all levels. If the taxonomy is a flat list, only a single total aggregate series is calculated.
- Return type:
Self
- Raises:
TypeError – If any of the taxonomy identifiere are of unexpected types.
Examples
To calculate 10 and 90 percentiles and median for the dataset x where codes from KLASS 157 (energy_balance) distinguishes between series in the set.
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> from ssb_timeseries.meta import Taxonomy >>> >>> klass157 = Taxonomy(klass_id=157) >>> klass157_leaves = [n.name for n in klass157.structure.root.leaves] >>> tag_permutation_space = {"A": klass157_leaves, "B": ["q"], "C": ["z"]} >>> series_names: list[list[str]] = [value for value in tag_permutation_space.values()] >>> sample_df = create_df(*series_names, start_date="2024-01-01", end_date="2024-12-31", freq="MS",) >>> sample_set = Dataset(name="sample_set", >>> data_type=SeriesType.simple(), >>> data=sample_df, >>> name_pattern=["A", "B", "C"], >>> ) >>> >>> def perc10(x): >>> return x.quantile(.1, axis=1, numeric_only=True, interpolation="linear") >>> >>> def perc90(x): >>> return x.quantile(.9, axis=1, numeric_only=True, interpolation="linear") >>> >>> percentiles = sample_set.aggregate(["energy_balance"], [157], [perc10, 'median', perc90])
- all()¶
Check if all values in series columns evaluate to true.
- Return type:
bool
- any()¶
Check if any values in series columns evaluate to true.
- Return type:
bool
- copy(new_name, **kwargs)¶
Create a copy of the Dataset.
The copy need to get a new name, but unless other information is spcecified, it will be create wiht the same data_type, as_of_tz, data, and tags.
- Return type:
Self
- Parameters:
new_name (str)
kwargs (Any)
- datetime_columns(*comparisons)¶
Get names of datetime columns (valid_at, valid_from, valid_to).
- Parameters:
*comparisons (Self | pd.DataFrame) – Objects to compare with. If provided, returns the intersection of self and all comparisons.
- Returns:
The (common) datetime column names of self (and comparisons).
- Return type:
list[str]
- Raises:
ValueError – If comparisons are not of type Self or pd.DataFrame.
- default_tags()¶
Return default tags for set and series.
- Return type:
dict
[str
,dict
[str
,str
|list
[str
]] |dict
[str
,dict
[str
,str
|list
[str
]]]]
- detag_dataset(*args, **kwargs)¶
Detag selected attributes of the set.
Tags to be removed may be provided as list of attribute names or as kwargs with attribute-value pairs.
- Return type:
None
- Parameters:
args (str)
kwargs (Any)
- detag_series(*args, **kwargs)¶
Detag selected attributes of series in the set.
Tags to be removed may be specified by args or kwargs. Attributes listed in args will be removed from all series.
For kwargs, attributes will be removed from the series if the value matches exactly. If the value is a list, the matching value is removed. If kwargs contain all=True, all attributes except defaults are removed.
- Parameters:
args (str)
kwargs (Any)
- Return type:
None
- filter(pattern='', tags=None, regex='', output='dataset', new_name='', **kwargs)¶
Filter dataset.data by textual pattern, regex or metadata tag dictionary.
Or a combination.
- Parameters:
pattern (str) – Text pattern for search ‘like’ in column names. Defaults to ‘’.
regex (str) – Expression for regex search in column names. Defaults to ‘’.
tags (dict) – Dictionary with tags to search for. Defaults to None. All tags in dict must be satisfied for the same series (tags are combined by AND). If a list of values is provided for a tag, the criteria is satisfied for either of them (OR). | list(dict) Support for list(dict) is planned, not yet implemented, to satisfy alternative sets of criteria (the dicts will be combined by OR).
output (str) – Output type - dataset or dataframe.(df). Defaults to ‘dataset’. Short forms ‘df’ or ‘ds’ are accepted.
new_name (str) – Name of new Dataset. If not provided, a new name is generated.
**kwargs – if provided, goes into the init of the new set.
- Returns:
By default a new Dataset (a deep copy of self). If output=”dataframe” or “df”, a dataframe. TODO: Explore shallow copy / nocopy options.
- Return type:
Dataset | Dataframe
- groupby(freq, func='auto', *args, **kwargs)¶
Group dataset data by specified frequency and function.
Returns a new Dataset.
- Return type:
Self
- Parameters:
freq (str)
func (str)
args (Any)
kwargs (Any)
- math(other, func)¶
Generic helper making math functions work on numeric, non date columns of dataframe to dataframe, matrix to matrix, matrix to vector and matrix to scalar.
Although the purpose was to limit “boilerplate” for core linear algebra functions, it also extend to other operations that follow the same differentiation pattern.
- Parameters:
other (dataframe | series | matrix | vector | scalar) – One (or more?) pandas (polars to come) datframe or series, numpy matrix or vector or a scalar value.
func (_type_) – The function to be applied as self.func(**other:Self) or (in some cases) with infix notation self f other. Note that one or more date columns of the self / lefthand side argument are preserved, ie data shifting operations are not supported.
- Raises:
ValueError – “Unsupported operand type”
ValueError – “Incompatible shapes.”
- Returns:
Depending on the inputs: A new dataset / vector / scalar with the result. For datasets, the name of the new set is derived from inputs and the functions applied.
- Return type:
Any
- numeric_columns()¶
Get names of all numeric series columns (ie columns that are not datetime).
- Return type:
list
[str
]
- plot(*args, **kwargs)¶
Plot dataset data.
Convenience wrapper around Dataframe.plot() with sensible defaults.
- Return type:
Any
- Parameters:
args (Any)
kwargs (Any)
- rename(new_name)¶
Rename the Dataset.
For use by .copy, and on very rare other occasions. Does not move or rename any previously stored data.
- Return type:
None
- Parameters:
new_name (str)
- replace_tags(*args)¶
Retag selected attributes of series in the set.
- Return type:
None
- Parameters:
args (tuple[dict[str, str | list[str]], dict[str, str | list[str]]])
- The tags to be replaced and their replacements should be specified as tuple(s) of dictionaries for (old_tags, new_tags). Both can contain multiple tags.
Each tuple is evaluated independently for each series in the set.
If the tag dict to be replaced contains multiple tags, all must match for tags to be replaced.
If the new tag dict contains multiple tags, all are added where there is a match.
- resample(freq, func, *args, **kwargs)¶
Alter frequency of dataset data.
- Return type:
Self
- Parameters:
freq (str)
func (Callable | str)
args (Any)
kwargs (Any)
- save(as_of_tz=None)¶
Persist the Dataset.
- Parameters:
as_of_tz (datetime) – Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of._utc (utc dates under the hood).
- Return type:
None
- property series: list[str]¶
Get series names.
- series_names_to_tags(attributes=None, separator='', regex='')¶
Tag all series in the dataset based on a series ‘attributes’, ie a list of attributes matching positions in the series names when split on ‘separator’.
Alternatively, a regular expression with groups that match the attributes may be provided. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code. Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
Example
Dependencies
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df
Tag using name_pattern
If all series names follow a uniform pattern where attribute values are separated by the same character sequence:
>>> some_data = create_df(["x_a", "y_b", "z_c"], start_date="2024-01-01", end_date="2024-12-31", freq="MS",) >>> x = Dataset(name="sample_set", >>> data_type=SeriesType.simple(), >>> data=some_data) >>> x.series_names_to_tags(attributes=['XYZ', 'ABC'])
Tag by regex
If series names are less well formed, a regular expression with groups matching the attribute list can be provided instead of the separator parameter.
>>> more_data = create_df(["x_1,,a", "y...b..", "z..1.1-23..c"], start_date="2024-01-01", end_date="2024-12-31", freq="MS") >>> x = Dataset(name="sample_set",data_type=SeriesType.simple(),data=more_data,) >>> x.series_names_to_tags(attributes=['XYZ', 'ABC'], regex=r'([a-z])*([a-z])')
The above approach may be used to add tags for an existing dataset, but the same arguments can also be provided when initialising the set:
>>> # xdoctest: +SKIP >>> # this line is not valid python code >>> z = Dataset(name="sample_set", >>> data_type=SeriesType.simple(), >>> data=some_data, >>> name_pattern=['XYZ', 'ABC']) >>> # xdoctest: -SKIP
Best practice is to do this only in the process that writes data to the set. For a finite number of series, it does not need to be repeated.
If, on the other hand, the number of series can change over time, doing so at the time of writing ensures all series are tagged.
- Parameters:
attributes (list[str] | None)
separator (str)
regex (str)
- Return type:
None
- property series_tags: dict[str, dict[str, str | list[str]]]¶
Get series tags.
- snapshot(as_of_tz=None)¶
Copy data snapshot to immutable processing stage bucket and shared buckets.
- Parameters:
as_of_tz (datetime) – Optional. Provide a timezone sensitive as_of date in order to create another version. The default is None, which will save with Dataset.as_of_utc (utc dates under the hood).
- Return type:
None
- tag_dataset(tags=None, **kwargs)¶
Tag the set.
Tags may be provided as dictionary of tags, or as kwargs.
In both cases they take the form of attribute-value pairs.
Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
- Return type:
None
- Parameters:
tags (dict[str, str | list[str]])
kwargs (str | list[str] | set[str])
Examples
Dependencies
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> >>> x = Dataset(name='sample_dataset', >>> data_type=SeriesType.simple(), >>> data=create_df(['x','y','z'], >>> start_date='2024-01-01', >>> end_date='2024-12-31', >>> freq='MS',) >>> ) >>> >>> x.tag_dataset(tags={'country': 'Norway', 'about': 'something_important'}) >>> x.tag_dataset(another_attribute='another_value')
Note that while no such restrictions are enforced, it is strongly recommended that both attribute names (
keys
) andvalues
are standardised. The best way to ensure that is to use taxonomies (for SSB: KLASS code lists). However, custom controlled vocabularies can also be maintained in files.iSee also
- tag_series(identifiers=None, name_pattern=None, separator='_', tags=None, **kwargs)¶
Tag the series identified by
identifiers
with provided tags.Tags may be provided as dictionary of tags, or as kwargs.
In both cases they take the form of attribute-value pairs.
Attribute (str): Attribute identifier. Ideally attributes relies on KLASS, ie a KLASS taxonomy defines the possible attribute values.
Value (str): Element identifier, unique within the taxonomy. Ideally KLASS code.
If series names follow the same pattern of attribute values in the same order separated by the same character sequence, tags can be propagated accordingly by specifying
name_pattern
andseparator
parameters. The separator will default to underscore if not provided. Note that propagation by pattern will affect all series in the set, not only the ones identified byidentifiers
.- Return type:
None
- Parameters:
identifiers (str | list[str] | None)
name_pattern (list[str] | None)
separator (str)
tags (dict[str, str | list[str]])
kwargs (str | list[str])
Examples
Dependencies
>>> from ssb_timeseries.dataset import Dataset >>> from ssb_timeseries.properties import SeriesType >>> from ssb_timeseries.sample_data import create_df >>> >>> some_data = create_df(['x', 'y', 'z'], start_date='2024-01-01', end_date='2024-12-31', freq='MS')
Tag by kwargs
>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data) >>> x.tag_series(example_1='string_1', example_2=['a', 'b', 'c'])
Tag by dict
>>> x = Dataset(name='sample_set',data_type=SeriesType.simple(),data=some_data) >>> x.tag_series(tags={'example_1': 'string_1', 'example_2': ['a', 'b', 'c']})
- vectors(pattern='')¶
Get vectors with names equal to column names from Dataset.data.
- Parameters:
pattern (str) – Optional pattern for simple filtering of column names containing pattern. Defaults to ‘’.
- Return type:
None
Warning
Caution! This (re)assigns variables in the scope of the calling function by way of stack inspection and hence risks of reassigning objects, functions, or variables if they happen to have the same name.
- versions(**kwargs)¶
Get list of all series version markers (as_of dates or version names).
By default as_of dates will be returned in local timezone. Provide return_type = ‘utc’ to return in UTC, ‘raw’ to return as-is.
- Return type:
list
[datetime
|str
]- Parameters:
kwargs (Any)
- class IO(*args, **kwargs)¶
Bases:
Protocol
Interface for IO operations.
- save()¶
Save the dataset.
- Return type:
None
- snapshot()¶
Save a snapshot of the dataset.
- Return type:
None
- catalog_search(pattern, as_of_tz=None, object_type='dataset')¶
Search across datasets by tags pattern.
- Return type:
list
[SearchResult
] |Dataset
|list
[None
]- Parameters:
pattern (dict[str, str | list[str]])
as_of_tz (datetime)
object_type (str | list[str])
- column_aggregate(df, method)¶
Helper function to calculate aggregate over dataframe columns.
- Return type:
Series
|Any
- Parameters:
df (DataFrame)
method (str | Callable)
- search(pattern='*', as_of_tz=None)¶
Search for datasets by name matching pattern.
- Return type:
list
[SearchResult
] |Dataset
|list
[None
]- Parameters:
pattern (str)
as_of_tz (datetime)