# Architecture

This document provides a high-level overview of the `ssb-timeseries` library's internal architecture and its core design principles.

## Core User Interfaces

A user primarily interacts with the library through two main entry points, which are designed to separate the acts of *working with* data from *finding* data.

-   The **`Dataset` class**: This is the central point of interaction. It represents a single time series dataset, bringing together its data and metadata. It provides a rich API for I/O, calculations, and data manipulation.
-   The **`Catalog` module**: This provides functions for discovering and searching for datasets across all configured storage locations (repositories).

## Key Helper Modules

While `Dataset` and `Catalog` are the main interfaces, several helper modules provide the foundation for the library's flexibility and robustness.

-   **`config`**: This module is the library's central nervous system. It loads all environment-specific settings from a JSON file, defining where data is stored and which backend handlers to use. This decouples the library's logic from hardcoded paths and storage implementations.
-   **`meta`**: This module manages structured metadata. Beyond simple key-value tags, it handles taxonomies (like those from SSB's KLASS) and provides the logic for advanced, metadata-driven operations like hierarchical aggregation.
-   **`io`**: This module acts as the single gateway for all storage operations. As a strict **facade**, it ensures that all parts of the library read and write data through a consistent, high-level API.
-   **`dates`**: This module provides utility functions for standardizing all time-related operations, ensuring consistent handling of timezones, frequencies, and formats throughout the library.

## Data Handling: The Interoperable Data Model

The library's approach to data handling is guided by a core conceptual model that directly influences its choice of technologies and commitment to interoperability.

### The Concept: Datasets as Matrices

A key design feature is the interpretation of datasets as mathematical matrices of series vectors, all aligned by a common date axis.
The library aims to provide easy and intuitive use of linear algebra for calculating derived data.

To accomplish this, the basic data structure is a table where each time series is a column vector, and the `Dataset` object itself exposes a rich set of mathematical operations (e.g., `+`, `-`, `*`, `/`).
This allows for natural, expressive code, such as `new_dataset = (dataset_a + dataset_b) / 2`.

### The Implementation: An Opinionated, High-Performance Stack

This conceptual model naturally leads to an opinionated selection of high-performance, column-oriented technologies:

-   **[Apache Parquet](https://parquet.apache.org/)** is the standard for permanent storage. Its columnar format is highly efficient for the analytical queries typical in time series analysis.
-   **[Apache Arrow](https://arrow.apache.org/)** is the preferred format for in-memory data. Its columnar layout and zero-copy read capabilities ensure high performance and seamless data sharing between processes.
-   **[NumPy](https://numpy.org/)** serves as the powerful and reliable engine for all linear algebra calculations. When you perform a mathematical operation on a `Dataset`, the numeric data is typically converted to NumPy arrays to execute the computation.

### The Principle: Openness and Abstraction

While the core stack is opinionated, a primary goal is to avoid creating a "walled garden." The library is designed to be a good citizen in the PyData ecosystem.

This is achieved through **[Narwhals](https://narwhals-dev.github.io/narwhals/)**, a lightweight abstraction layer that provides a unified API over multiple dataframe backends. This means the library's internal logic works seamlessly whether the in-memory data is a Pandas DataFrame, a Polars DataFrame, or a PyArrow Table, offering maximum flexibility to users.

### A Commitment to Interoperability

To guarantee that the `Dataset` object can be used by other libraries, it adheres to several standard protocols:

-   **The NumPy `__array__` Protocol**: A `Dataset` can be passed directly to most NumPy functions (e.g., `np.mean(my_dataset)`), as it knows how to convert itself into a NumPy array.
-   **The DataFrame Interchange Protocol (`__dataframe__`)**: This allows a `Dataset` to be converted into other dataframe types (like Pandas or Polars) with minimal overhead.
-   **The Arrow C Data Interface (`__arrow_c_stream__`)**: This enables efficient, zero-copy data sharing with other Arrow-native libraries and even other programming languages like R or Julia.
-   **Standard Python Operators**: By overloading operators like `__add__` and `__mul__`, the `Dataset` object can be used directly in mathematical expressions. The motive is to provide a natural and highly expressive syntax, allowing users to write code like `new_dataset = (dataset_a + dataset_b) / 2`.

## Metadata Handling: From Concept to Implementation

The library's approach to metadata is central to its design. It begins with a conceptual model and is realized through a specific technical implementation.

### The Concept: Rich, Structured Descriptions

At its core, every `Dataset` and `Series` is described by a collection of attributes, or "tags".
Rather than being simple key-value pairs, these attributes are designed to take their values from well-defined taxonomies (such as those from SSB's KLASS).
This ensures that metadata is structured, consistent, and meaningful.

This conceptual model is detailed further in the {doc}`info-model`.

### The Implementation: A Dual Storage Approach

The technical implementation is designed to satisfy two core requirements: data portability and centralized discoverability. This is achieved with a dual storage approach:

1.  **Embedded for Portability**: All descriptive tags are embedded directly into the header of the Parquet file. This ensures that the data and metadata are always connected, making each file a self-contained artifact that can be moved or shared without losing its context.
2.  **Indexed for Discoverability**: To fulfill the requirement for a central data catalog, the metadata is also duplicated into an indexed **JSON catalog**. This provides the crucial performance benefit of enabling fast, efficient searches across all datasets in a repository without needing to read the large data files themselves.

## The Decoupled I/O System

The library's ability to adapt to different storage environments is based on a decoupled I/O system that follows a classic **Facade** and **Strategy** design pattern.

-   **The Facade (`ssb_timeseries.io`)**: As mentioned, this module is the single entry point for all storage operations. It presents a simple API (e.g., `read_data`, `save`) to the rest of the application.
-   **Pluggable Handlers (The Strategy)**: The facade reads the project's configuration to dynamically load the appropriate **I/O handler** for a given task. These handlers are the concrete "strategies" for different backends (e.g., local files, cloud buckets) and are defined in a single JSON file, as detailed in the {doc}`configure-io` guide. This design allows for specifying custom handlers from outside the core library.
-   **The Contract (`protocols.py`)**: The methods required for any I/O handler are formally defined in `protocols.py` using `typing.Protocol`, ensuring that any custom handler is compatible with the library's I/O system.