SSB Timeseries¶

[ License ][license]

Statistics Norway is the national statistical agency of Norway. We collect, produce and communicate official statistics related to the economy, population and society at national, regional and local levels and conduct extensive research and analysis activities.

Time series are an integral part of statistics production.

The requirements for a complete time series system are diverse:

At the core is storage with performant read and write, search and filtering of time series data
A wide selection of math and statistics methods and libraries for calculations and models
Versioning is essential in tracking development of estimates and previously published data
Descriptive metadata is key to findability and understanding
Workflow integration with automation and process monitoring makes life easier, but also ensures consistency and quality
Visualisations is important not only to present the final results, but also for routine and ad hoc inspection and quality control
Data lineage and process metadata are instrumental for quality control, but also provides the transparency required for building trust

Very purposely, ssb-timeseries is not a complete solution. It is as a glue layer that sits between the overall data platform and the statistics production code. Its main responsibility is to interface with other systems and components and enforce consistency with our process and information models.

It provides some functionality of its own, but mainly useful abstractions and convenience methods that reduce boilerplate and brings several best of breed technologies for data analysis together in a single package.

Openness is a goal in itself, both in terms of transparency and licensing, and in the provide compatibility with relevant frameworks and standards.

The workflow of Stastistics Norway is mostly batch oriented. Consequently, the ssb-timeseries library is Dataset centric:

One or more series form a set.
All series in a set must be of the same type.
A set should typically be read and written at the same time.
All series in a set should stem from the same process.

All series in a set being of the same type and otherwise complying with the underlying information model simplifies the implementation of storage, descriptive metadata and search and enables key calculation features:

Since each series is represented as a column vector in a dataset matrix, linear algebra is readily available. Datasets can be added, subtracted, multiplied and divided with each other and dataframes, matrices, vectors (untested) and scalars according to normal rules.
Time algebra features allow up- and downsamliong that make use of the date columns. Basic time aggregation: Dataset.groupby('quarter', 'sum'|'mean'|'auto')
Metadata calculations uses the descriptions of the individual series for calculations ranging from simple things like unit conversions to using relations between entities in tag values to group series for aggregation.

Nulls are allowed, so it is not a strict requirement that all series in a set are written at the same time, but managing workflows becomes much simpler if they are.

The io module connects the dataset to helper class(es) that takes care of reading and writing data. This structure abstracts away the IO mechanics, so that the user do not need to know about implementation details, but only the information model meaning of the choices made. Also, although the current implementation uses pyarrow and parquet data structures under the hood, by replacing the io-module, a database could be used instead.

The data itself has a wide variety, but while data volumes are substantial, they are not enormous. The data resolution and publishing frequencies are typically low: monthly, quarterly and yearly are most typical.

Quality and reliability is by far more important than latency. Our mission comes with strict requirements for transparency and data quality. Some are mandated by law, others stem from commitment to international standards and best practices. This shifts the focus towards process and data control.