Architecture¶
This document provides a high-level overview of the ssb-timeseries library’s internal architecture and its core design principles.
Core User Interfaces¶
A user primarily interacts with the library through two main entry points, which are designed to separate the acts of working with data from finding data.
The
Datasetclass: This is the central point of interaction. It represents a single time series dataset, bringing together its data and metadata. It provides a rich API for I/O, calculations, and data manipulation.The
Catalogmodule: This provides functions for discovering and searching for datasets across all configured storage locations (repositories).
Key Helper Modules¶
While Dataset and Catalog are the main interfaces, several helper modules provide the foundation for the library’s flexibility and robustness.
config: This module is the library’s central nervous system. It loads all environment-specific settings from a JSON file, defining where data is stored and which backend handlers to use. This decouples the library’s logic from hardcoded paths and storage implementations.meta: This module manages structured metadata. Beyond simple key-value tags, it handles taxonomies (like those from SSB’s KLASS) and provides the logic for advanced, metadata-driven operations like hierarchical aggregation.io: This module acts as the single gateway for all storage operations. As a strict facade, it ensures that all parts of the library read and write data through a consistent, high-level API.dates: This module provides utility functions for standardizing all time-related operations, ensuring consistent handling of timezones, frequencies, and formats throughout the library.
Data Handling: The Interoperable Data Model¶
The library’s approach to data handling is guided by a core conceptual model that directly influences its choice of technologies and commitment to interoperability.
The Concept: Datasets as Matrices¶
A key design feature is the interpretation of datasets as mathematical matrices of series vectors, all aligned by a common date axis. The library aims to provide easy and intuitive use of linear algebra for calculating derived data.
To accomplish this, the basic data structure is a table where each time series is a column vector, and the Dataset object itself exposes a rich set of mathematical operations (e.g., +, -, *, /).
This allows for natural, expressive code, such as new_dataset = (dataset_a + dataset_b) / 2.
The Implementation: An Opinionated, High-Performance Stack¶
This conceptual model naturally leads to an opinionated selection of high-performance, column-oriented technologies:
Apache Parquet is the standard for permanent storage. Its columnar format is highly efficient for the analytical queries typical in time series analysis.
Apache Arrow is the preferred format for in-memory data. Its columnar layout and zero-copy read capabilities ensure high performance and seamless data sharing between processes.
NumPy serves as the powerful and reliable engine for all linear algebra calculations. When you perform a mathematical operation on a
Dataset, the numeric data is typically converted to NumPy arrays to execute the computation.
The Principle: Openness and Abstraction¶
While the core stack is opinionated, a primary goal is to avoid creating a “walled garden.” The library is designed to be a good citizen in the PyData ecosystem.
This is achieved through Narwhals, a lightweight abstraction layer that provides a unified API over multiple dataframe backends. This means the library’s internal logic works seamlessly whether the in-memory data is a Pandas DataFrame, a Polars DataFrame, or a PyArrow Table, offering maximum flexibility to users.
A Commitment to Interoperability¶
To guarantee that the Dataset object can be used by other libraries, it adheres to several standard protocols:
The NumPy
__array__Protocol: ADatasetcan be passed directly to most NumPy functions (e.g.,np.mean(my_dataset)), as it knows how to convert itself into a NumPy array.The DataFrame Interchange Protocol (
__dataframe__): This allows aDatasetto be converted into other dataframe types (like Pandas or Polars) with minimal overhead.The Arrow C Data Interface (
__arrow_c_stream__): This enables efficient, zero-copy data sharing with other Arrow-native libraries and even other programming languages like R or Julia.Standard Python Operators: By overloading operators like
__add__and__mul__, theDatasetobject can be used directly in mathematical expressions. The motive is to provide a natural and highly expressive syntax, allowing users to write code likenew_dataset = (dataset_a + dataset_b) / 2.
Metadata Handling: From Concept to Implementation¶
The library’s approach to metadata is central to its design. It begins with a conceptual model and is realized through a specific technical implementation.
The Concept: Rich, Structured Descriptions¶
At its core, every Dataset and Series is described by a collection of attributes, or “tags”.
Rather than being simple key-value pairs, these attributes are designed to take their values from well-defined taxonomies (such as those from SSB’s KLASS).
This ensures that metadata is structured, consistent, and meaningful.
This conceptual model is detailed further in the Information Model.
The Implementation: A Dual Storage Approach¶
The technical implementation is designed to satisfy two core requirements: data portability and centralized discoverability. This is achieved with a dual storage approach:
Embedded for Portability: All descriptive tags are embedded directly into the header of the Parquet file. This ensures that the data and metadata are always connected, making each file a self-contained artifact that can be moved or shared without losing its context.
Indexed for Discoverability: To fulfill the requirement for a central data catalog, the metadata is also duplicated into an indexed JSON catalog. This provides the crucial performance benefit of enabling fast, efficient searches across all datasets in a repository without needing to read the large data files themselves.
The Decoupled I/O System¶
The library’s ability to adapt to different storage environments is based on a decoupled I/O system that follows a classic Facade and Strategy design pattern.
The Facade (
ssb_timeseries.io): As mentioned, this module is the single entry point for all storage operations. It presents a simple API (e.g.,read_data,save) to the rest of the application.Pluggable Handlers (The Strategy): The facade reads the project’s configuration to dynamically load the appropriate I/O handler for a given task. These handlers are the concrete “strategies” for different backends (e.g., local files, cloud buckets) and are defined in a single JSON file, as detailed in the Configure I/O guide. This design allows for specifying custom handlers from outside the core library.
The Contract (
protocols.py): The methods required for any I/O handler are formally defined inprotocols.pyusingtyping.Protocol, ensuring that any custom handler is compatible with the library’s I/O system.