SSB Parquedit¶

A Python package for manually editing tabular data stored as Parquet files on DaplaLab — Statistics Norway’s cloud data platform. Built on top of DuckDB and the DuckLake catalog, it provides a clean Python interface for creating tables, inserting data, querying results and editing rows directly from Google Cloud Storage (GCS). Intended for single-table editing. Does not support primary- and foreign keys.

Table of Contents¶

Features¶

Auto-configuration — reads Dapla environment variables to build connection config automatically
DuckLake catalog integration — metadata stored in PostgreSQL, data stored in GCS
Create tables from a pandas DataFrame, a JSON Schema dict, or an existing GCS Parquet file
Insert data from a pandas DataFrame or a gs:// Parquet path — rows are automatically assigned a unique rowid within a table
Edit data - Update value(s) in a single row by its rowid.
Query tables with where-conditions, column selection, sorting, pagination, and multiple output formats (pandas, polars, pyarrow)
Find edits Retrieve historical column-level edits for a specified table
Count rows
Check table existence safely
Partition tables by one or more columns

Requirements¶

Python >=3.12
Access to a DaplaLab environment
A PostgreSQL instance reachable at localhost for DuckLake metadata storage
A GCS bucket following the naming convention ssb-{team-name}-data-produkt-{environment}

Python dependencies¶

Package	Version
`duckdb`	`==1.5.2`
`pandas`	`>=3.0.0, <4.0.0`
`polars`	`>=1.38.1, <2.0.0`
`pyarrow`	`>=23.0.1, <24.0.0`
`gcsfs`	`>=2026.1.0, <2027.0.0`
`click`	`>=8.0.1`
`tenacity`	`>=9.1.4,<10.0.0`

Installation¶

poetry add ssb-parquedit

Usage¶

Basic setup¶

ParquEdit reads its connection configuration automatically from Dapla-environment variables.

from ssb_parquedit import ParquEdit

# Auto-configure from environment
con = ParquEdit()

Creating a table¶

Tables can be created from a DataFrame schema, a JSON Schema dict, or an existing Parquet file.

import pandas as pd

df = pd.DataFrame({"name": ["Alice", "Bob"], "age": [30, 25]})

# Option 1: Create from DataFrame (empty — schema only)
con.create_table(table_name="my_table_1",
                 source=df,
                 product_name="my-product",
                 user_defined_id=["name"])

# Option 2: Create and immediately populate with data
con.create_table(table_name="my_table_2",
                 source=df,
                 product_name="my-product",
                 user_defined_id=["name"],
                 fill=True)

# Option 3: Create from a JSON Schema
schema = {
    "properties": {
        "name": {"type": "string"},
        "age":  {"type": "integer"},
    }
}
con.create_table(table_name="my_table_3",
                 source=schema,
                 product_name="my-product",
                 user_defined_id=["name"])

# Option 4: Create from an existing GCS Parquet file (schema inferred from file)
con.create_table(table_name="my_table_4",
                 source="gs://my-bucket/path/to/file.parquet",
                 product_name="my-product",
                 user_defined_id=["id", "year"])

# Option 5: Create with partitioning and immediately populate with data
con.create_table(table_name="my_table_5",
                 source=df,
                 product_name="my-product",
                 part_columns=["age"],
                 user_defined_id=["name"],
                 fill=True)

Notes:

product_name is required and is stored as a comment on the table.

table_name must be lowercase, start with a letter or underscore, contain only lowercase letters, numbers, and underscores, and be at most 20 characters.

user_defined_id — a list of columns that together uniquely identify a row in a table, used to mimic a primary key.

Inserting data in an existing table¶

# Insert from a DataFrame
con.insert_data(table_name="my_table_1",
                 source=df)

# Insert from a GCS Parquet file
con.insert_data(table_name="my_table_4",
                 source="gs://my-bucket/path/to/file.parquet")

Each inserted row is automatically assigned a unique rowid within the table

Editing a row¶

edit() updates exactly one row — identified by its rowid — and logs the change reason and comment to the DuckLake snapshot.

# First look up the rowid of the row you want to edit
result = con.view(table_name="my_table_1",
                  where="name = 'Alice'")
rowid = result["rowid"].iloc[0]

# Then edit it
con.edit(
    table_name="my_table_1",
    rowid=rowid,
    changes={"name":"Alice B", "age": 33},
    change_event_reason="REVIEW",
    change_comment="Corrected name and age after data review",
)

changes is a dict of {column_name: new_value} pairs.

change_event_reason must be one of: OTHER_SOURCE, REVIEW, OWNER, MARGINAL_UNIT, DUPLICATE, OTHER

Querying data¶

# View all rows (returns pandas DataFrame by default)
result = con.view(table_name="my_table_1")

# Filter with a WHERE clause
result = con.view(table_name="my_table_1", where="age > 25")
result = con.view(table_name="my_table_1", where="name = 'Alice' AND age >= 30")

# Limit and offset (pagination)
result = con.view(table_name="my_table_1",
                  limit=10,
                  offset=2)

# Select specific columns
result = con.view(table_name="my_table_1",
                  columns=["name", "age"])

# Sort results
result = con.view(table_name="my_table_1",
                   order_by="age DESC")

# Return as polars or pyarrow
result = con.view(table_name="my_table_1",
                   output_format="polars")

result = con.view(table_name="my_table_1",
                   output_format="pyarrow")

Counting rows¶

total = con.count(table_name="my_table_1",
                   where="name='Alice'")

Checking table existence¶

if con.exists(table_name="my_table_1"):
    print("Table found")

List all tables¶

con.list_tables()

List edits¶

get_edits() - Retrieves the full changelog for a table by reading DuckLake snapshot metadata. Each row represents a single edit, with columns for who made the change, when, the reason, which row was affected (identified by its unique key), and the old and new values for all modified columns.

Optionally filter by table name, or omit it to get the changelog for all tables.

# All edits for a specific table
df = con.get_edits(table_name="my_table")

# All edits across all tables
df = con.get_edits()

The returned DataFrame includes these changelog columns:

Column	Description
`changed_by`	User who made the edit
`change_event_reason`	Reason code (e.g. `REVIEW`, `OWNER`)
`change_comment`	Free-text comment from the editor
`table_name`	Table the edit was made on
`rowid`	Internal row identifier
`user_defined_id`	Business key values identifying the row
`old_values`	Dict of column → old value for changed columns
`new_values`	Dict of column → new value for changed columns
`product_name`	Product name the table belongs to

Drop table¶

drop_table() - Drops a table from the DuckLake catalog. By default, only removes the table from the catalog. DuckLake preserves data files and snapshot history, so edit history remains accessible via get_edits() after a normal drop. When purge=True, additionally expires snapshots and deletes GCS data files. This permanently destroys all history and cannot be undone.

# Removes the table from the catalog
con.drop_table(table_name="my_table")

# Removes the table from the catalog, expires snapshots and deletes data files
con.drop_table(table_name="my_table", purge=True)

Maintenance¶

Flush inlined data¶

Flush inlined data materializes inlined rows into Parquet files for a table. This is a maintenance operation for workloads with frequent small writes, helping keep storage layout efficient and query performance stable. It does not change table values, only how data is physically stored. The operation is safe to run repeatedly: Running it when nothing is pending has no effect.

# Flushes inlined data for table 'my_table'
con.flush_inlined_table(table_name="my_table")

Merge adjacent files¶

Merge adjacent files compacts a table’s small Parquet files into fewer, larger files. This is a maintenance step for tables that receive many small writes, improving scan efficiency and reducing file-management overhead. It preserves table data and history semantics, changing only physical file layout. The operation is safe to run repeatedly: Running it when nothing is mergeable has no effect.

# Merge adjacent files for table 'my_table'
con.merge_adjacent_files(table_name="my_table")

Advanced¶

Accessing the raw DuckDB connection¶

ParquEdit wraps a DuckDBConnection, which exposes the underlying duckdb.DuckDBPyConnection via its .raw property. This is useful when integrating with libraries that require a native DuckDB connection, such as Ibis.

import ibis
from ssb_parquedit import ParquEdit

con = ParquEdit()
raw = con._get_connection().raw  # duckdb.DuckDBPyConnection

ibis_conn = ibis.duckdb.connect(conn=raw)
table = ibis_conn.table("my_table_1")

Notes:

_get_connection() is an internal method. The raw connection shares state with ParquEdit — closing either will affect both. Do not close the raw connection manually while ParquEdit is still in use.

When using the raw connection, the user is resposible to provide the required information that ParquEdit-methods gives. E.g when creating and editing tables.

Setting up local connection¶

Create a ParquEdit instance backed by a persistent local SQLite catalog. Useful for local development and testing without GCS or PostgreSQL access. The catalog and data files are stored at path and persist across sessions. The directory is created if it does not already exist.

con = ParquEdit().local(path="/home/onyxia/work/")

Project structure¶

src/ssb_parquedit/
├── parquedit.py      # ParquEdit facade — main public API
├── connection.py     # DuckDB + DuckLake catalog connection management
├── ddl.py            # DDL operations (CREATE TABLE, partitioning)
├── dml.py            # DML operations (INSERT, EDIT)
├── query.py          # Query operations (SELECT, COUNT, EXISTS)
├── maintenance.py    # Maintenance operations (flush inlined data, merge adjacent files)
├── functions.py      # Environment helpers (Dapla config auto-detection)
├── local.py          # Local DuckDB connection backed by SQLite (dev/testing)
└── utils.py          # Schema utilities and SQL sanitization

Contributing¶

Contributions are very welcome. To learn more, see the Contributor Guide.

License¶

Distributed under the terms of the MIT license. SSB Parquedit is free and open source software.

Issues¶

If you encounter any problems, please file an issue along with a detailed description.

Credits¶

This project was generated from Statistics Norway’s SSB PyPI Template. Maintained by Team Fellesfunksjoner at Statistics Norway (Data Enablement Department 724).