Reference¶

ssb_parquedit package¶

ssb_parquedit.parquedit module¶

ParquEdit - Clean facade for DuckDB table management with DuckLake catalog.

class ParquEdit(config=None)¶

Bases: object

A class for managing DuckDB tables with DuckLake catalog integration.

This facade provides a unified interface to DDL, DML, and Query operations. Each method opens and closes its own connection automatically.

Parameters:: config (dict[str, str] | None)

close()¶

Close the connection explicitly.

Return type:: None

count(table_name, where=None)¶

Count the number of rows in a table.

Parameters:

table_name (str) – The name of the table to count rows in.
where (str | None) – Optional SQL WHERE clause to filter results. Defaults to None.

Returns:

The number of rows in the table.

Return type:

int

create_table(table_name, source, product_name=None, user_defined_id=None, part_columns=None, fill=False)¶

Create a new table in the DuckLake catalog.

Parameters:

table_name (str) – Name of the table to create. Must be lowercase, start with a letter or underscore, and contain only lowercase letters, numbers, and underscores. Maximum 20 characters.
source (DataFrame | dict[str, Any] | str) – Source for the table schema. Can be: - pd.DataFrame: Creates table structure from the DataFrame schema. - dict: JSON Schema specification defining the table structure. - str: GCS path (gs://) to a Parquet file to infer schema from.
product_name (str | None) – Label identifying the product this table belongs to. Stored as a comment on the table. Must not be None or empty.
user_defined_id (list[str] | None) – A list of columns that together uniquely identify a row, used to mimic a primary key. Defaults to None.
part_columns (list[str] | None) – Optional list of column names to partition the table by.
fill (bool) – If True, inserts data from source into the table immediately after creation. Defaults to False.

Raises:

ValueError – If product_name is None or empty.

Return type:

None

drop_table(table_name, purge=False)¶

Drop a table from the DuckLake catalog.

By default, only removes the table from the catalog. DuckLake preserves data files and snapshot history, so edit history remains accessible via get_edits() after a normal drop.

When purge=True, additionally expires snapshots and deletes GCS data files. This permanently destroys all history and cannot be undone.

Parameters:

table_name (str) – Name of the table to drop.
purge (bool) – If True, expire snapshots and delete GCS data files. Defaults to False. History is permanently lost when True.

Return type:

None

Example

>>>
>>> con = ParquEdit()
>>> con.drop_table("my_table")             # History preserved
>>> con.drop_table("my_table", purge=True) # Full deletion, history lost

edit(table_name, rowid, changes, change_event_reason, change_comment)¶

Edit a single row in a table by its row ID.

Parameters:

table_name (str) – The name of the table to edit.
rowid (int) – The ID of the row to update.
changes (dict[str, Any]) – A dictionary mapping column names to their new values.
change_event_reason (str) – A short reason code describing the type of change event.
change_comment (str) – A human-readable comment describing the change.

Return type:

None

exists(table_name)¶

Check if a table exists in the database.

Parameters:: table_name (str) – The name of the table to check for existence.
Returns:: True if the table exists, False otherwise.
Return type:: bool

flush_inlined_table(table_name)¶

Flush a table’s inlined data to Parquet storage.

Materializes any inlined inserts and deletes for the given table from the metadata catalog into Parquet files. Tables with no inlined data are left untouched.

Parameters:: table_name (str) – Name of the table to flush.
Return type:: None

classmethod from_connection(connection, db_config=None)¶

Create a ParquEdit instance from an existing connection.

Useful for testing and advanced use cases where the connection is created and configured outside of ParquEdit.

Parameters:

connection (DuckDBConnection) – An already established DuckDBConnection.
db_config (dict[str, str] | None) – Optional configuration. Defaults to {}.

Returns:

An instance connected to the given connection.

Return type:

ParquEdit

get_edits(table_name=None)¶

Retrieve changelog entries from DuckLake snapshots.

Fetches all snapshots with non-null commit metadata, parses the JSON payload in ‘commit_extra_info’ into separate columns, and optionally filters by table name.

Parameters:: table_name (str | None) – If provided, only returns edits for the given table. If None, returns edits for all tables.
Return type:: DataFrame
Returns:: A DataFrame with snapshot data and parsed changelog columns, including change_event_reason, changed_by, user_defined_id, old_values, new_values, and more.

insert_data(table_name, source)¶

Insert data into a table.

Parameters:

table_name (str) – The name of the table to insert data into.
source (DataFrame | dict[str, Any] | str) – The data to insert. Can be a pandas DataFrame, a dictionary mapping column names to values, or a string file path to a data file.

Return type:

None

list_tables()¶

List all tables in the current catalog.

Returns:: A list of table names in the catalog, sorted alphabetically.
Return type:: list[str]

classmethod local(path=PosixPath('/home/runner/.parquedit'))¶

Create a ParquEdit instance backed by a persistent local SQLite catalog.

Useful for local development without GCS or PostgreSQL access. The catalog and data files are stored at path and persist across sessions. The directory is created if it does not already exist.

Parameters:: path (str | Path) – Directory for the SQLite catalog and Parquet data files. Defaults to ~/.parquedit.
Returns:: An instance backed by a local SQLite DuckLake catalog.
Return type:: ParquEdit

Example

>>>
>>> pe = ParquEdit.local()                       # uses ~/.parquedit
>>> pe = ParquEdit.local("/tmp/my_dev_catalog")  # custom path
>>> pe.create_table("cities", source=df, product_name="dev")
>>> pe.close()

merge_adjacent_files(table_name)¶

Compact a table’s small Parquet files into fewer, larger ones.

Merges the adjacent Parquet files for the given table without expiring snapshots, preserving time travel and the data change feed. Old files are not deleted by this call; run a cleanup afterwards to remove them.

Parameters:: table_name (str) – Name of the table to compact.
Return type:: None

view(table_name, where=None, limit=None, offset=0, columns=None, order_by=None, output_format='pandas')¶

View the contents of a table.

Parameters:

table_name (str) – The name of the table to query.
where (str | None) – Optional SQL WHERE clause to filter results. Defaults to None.
limit (int | None) – Maximum number of rows to return. Defaults to None.
offset (int) – Number of rows to skip before returning results. Defaults to 0.
columns (list[str] | None) – List of column names to include. Defaults to None, which returns all columns.
order_by (str | None) – Column name to sort results by. Defaults to None.
output_format (str) – Format of the returned data. Defaults to ‘pandas’.

Returns:

Query results in the specified output format.

Return type:

Any