Reference

ssb_parquedit package

ssb_parquedit.parquedit module

ParquEdit - Clean facade for DuckDB table management with DuckLake catalog.

class ParquEdit(config=None)

Bases: object

A class for managing DuckDB tables with DuckLake catalog integration.

This facade provides a unified interface to DDL, DML, and Query operations. Each method opens and closes its own connection automatically.

Parameters:

config (dict[str, str] | None)

count(table_name, filters=None)

Count the number of rows in a table.

Parameters:
  • table_name (str) – The name of the table to count rows in.

  • filters (dict[str, Any] | list[dict[str, Any]] | None) – Filter conditions to apply before counting. Can be a single filter dict or a list of filter dicts. Defaults to None.

Returns:

The number of rows matching the given filters.

Return type:

int

create_table(table_name, source, product_name=None, part_columns=None, fill=False)

Create a new table in the DuckLake catalog.

Parameters:
  • table_name (str) – Name of the table to create. Must be lowercase, start with a letter or underscore, and contain only lowercase letters, numbers, and underscores. Maximum 20 characters.

  • source (DataFrame | dict[str, Any] | str) – Source for the table schema. Can be: - pd.DataFrame: Creates table structure from the DataFrame schema. - dict: JSON Schema specification defining the table structure. - str: GCS path (gs://) to a Parquet file to infer schema from.

  • product_name (str | None) – Label identifying the product this table belongs to. Stored as a comment on the table. Must not be None or empty.

  • part_columns (list[str] | None) – Optional list of column names to partition the table by.

  • fill (bool) – If True, inserts data from source into the table immediately after creation. Defaults to False.

Raises:

ValueError – If product_name is None or empty.

Return type:

None

drop_table(table_name, cleanup=True)

Drop a table from the DuckLake catalog with optional cleanup.

Table deletion is only allowed in the TEST environment to prevent accidental data loss in production. In PROD or other environments, this method will raise a PermissionError.

Optionally performs comprehensive cleanup: - Expires snapshots (removes old transaction logs from metadata) - Cleans GCS bucket (removes orphaned Parquet files)

Parameters:
  • table_name (str) – Name of the table to drop.

  • cleanup (bool) – If True, expire snapshots and clean GCS files. Defaults to True.

Return type:

None

Example

>>>
>>> con = ParquEdit()
>>> con.drop_table("temporary_table")  # Drop with cleanup
>>> con.drop_table("temp_table", cleanup=False)  # Drop only
exists(table_name)

Check if a table exists in the database.

Parameters:

table_name (str) – The name of the table to check for existence.

Returns:

True if the table exists, False otherwise.

Return type:

bool

insert_data(table_name, source)

Insert data into a table.

Parameters:
  • table_name (str) – The name of the table to insert data into.

  • source (DataFrame | dict[str, Any] | str) – The data to insert. Can be a pandas DataFrame, a dictionary mapping column names to values, or a string file path to a data file.

Return type:

None

list_tables()

List all tables in the current catalog.

Returns:

A list of table names in the catalog, sorted alphabetically.

Return type:

list[str]

view(table_name, limit=None, offset=0, columns=None, filters=None, order_by=None, output_format='pandas')

View the contents of a table.

Parameters:
  • table_name (str) – The name of the table to query.

  • limit (int | None) – Maximum number of rows to return. Defaults to None.

  • offset (int) – Number of rows to skip before returning results. Defaults to 0.

  • columns (list[str] | None) – List of column names to include. Defaults to None, which returns all columns.

  • filters (dict[str, Any] | list[dict[str, Any]] | None) – Filter conditions to apply. Can be a single filter dict or a list of filter dicts. Defaults to None.

  • order_by (str | None) – Column name to sort results by. Defaults to None.

  • output_format (str) – Format of the returned data. Defaults to “pandas”.

Returns:

Query results in the specified output format.

Return type:

Any