Reference

dapla package

Subpackages

dapla.backports module

details(gcs_path)

Backported dapla function to support detailed list of files for a given GCS path.

Parameters:

gcs_path (str) – the path from which you want to list all folders

Return type:

list[dict[str, str]]

Returns:

A list of dicts containing file details

show(gcs_path)

Backported dapla function to recursively show all folders below a given GCS path.

Parameters:

gcs_path (str) – the path from which you want to list all folders

Return type:

list[str]

Returns:

A simplified list of files or folders

dapla.collector module

class CollectorClient(collector_url)

Bases: object

Client for working with DataCollector.

Parameters:

collector_url (str)

running_tasks()

Get all running collector tasks.

Return type:

Response

start(specification)

Start a new collector task.

Parameters:

specification (dict[str, Any]) – The JSON object of the collector specification

Returns:

The “requests.Response” object from the API call

Return type:

Response

stop(task_id)

Stop a running collector task.

Parameters:

task_id (int) – The id of the task to stop.

Returns:

The “requests.Response” object from the API call

Return type:

Response

dapla.converter module

class ConverterClient(converter_url)

Bases: object

Client for working with DataConverter.

Parameters:

converter_url (str)

get_job_summary(job_id)

Retrieve the execution summary for a specific converter job.

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

get_pseudo_report(job_id)

Get a report with details about how pseudonymization is being applied for a specific job.

The report includes: * mathced pseudo rules * unmatched pseudo rules * pseudo rule to field match map - rule names and their corresponding matches * field matches to pseudo rule map - field names and the corresponding pseudo rule that covers this field * metrics * textual schema hierarchy that illustrates how the pseudo rules are being applied

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

get_pseudo_schema(job_id)

Get hierarchical schema representation that details how pseudo rules are being applied.

This is a smaller version get_pseudo_report

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

start(job_config)

Schedule a new converter job.

Parameters:

job_config (dict[str, Any]) – The JSON object of the job configuration

Return type:

Response

Returns:

The “requests.Response” object from the API call

start_simulation(job_config)

Start a simulated converter job.

Useful for testing job configurations or diagnosing pseudonymization issues.

Parameters:

job_config (dict[str, Any]) – The JSON object of the job configuration

Return type:

Response

Returns:

The “requests.Response” object from the API call

stop_job(job_id)

Stop a specific converter job.

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

dapla.doctor module

class Doctor

Bases: object

Class of functions that perform checks on Dapla.

Checks whether user is authenticated, if the keycloak-token is valid and if user has access to GCS. Each method can be run individually or collectively with the ‘health’ method.

static bucket_access()

Checks whether user has access to a common google bucket.

Return type:

bool

static gcs_credentials_valid()

Checks whether the users google cloud storage token is valid by accessing a GCS service.

Return type:

bool

classmethod health()

Runs a series of checks to determine the health of Dapla setup.

Return type:

None

static keycloak_token_valid()

Checks whether the keycloak token is valid by attempting to access a keycloak-token protected service.

Return type:

bool

dapla.files module

class FileClient

Bases: object

Client for working with buckets and files on Google Cloud Storage.

This class should not be instantiated, only the static methods should be used.

static cat(gcs_path)

Get string content of a file from GCS.

Parameters:

gcs_path (str) – The GCS path to a file.

Return type:

str

Returns:

utf-8 decoded string content of the given file

static gcs_open(gcs_path, mode='r')

Open a file in GCS, works like regular python open().

Parameters:
  • gcs_path (str) – The GCS path to a file.

  • mode (str) – File open mode. Defaults to ‘r’

Return type:

TextIOWrapper | AbstractBufferedFile

Returns:

A file-like object.

static get_gcs_file_system(**kwargs)

Return a pythonic file-system for Google Cloud Storage - initialized with a personal Google Identity token.

Parameters:

kwargs (Any) – Additional arguments to pass to the underlying GCSFileSystem.

Return type:

GCSFileSystem

Returns:

A GCSFileSystem instance.

See https://gcsfs.readthedocs.io/en/latest for advanced usage

static get_versions(bucket_name, file_path)

Get all versions of a file in a bucket.

Parameters:
  • bucket_name (str) – Bucket name where the file is located.

  • file_path (str) – Path to the file.

Return type:

Any

Returns:

List of versions of the file.

static load_csv_to_pandas(gcs_path, **kwargs)

Reads a CSV file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .csv file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_csv().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_json_to_pandas(gcs_path, **kwargs)

Reads a JSON file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .json file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_json().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_xml_to_pandas(gcs_path, **kwargs)

Reads an XML file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .xml file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_xml().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static ls(gcs_path, detail=False, **kwargs)

List the contents of a GCS bucket path.

Parameters:
  • gcs_path (str) – The GCS path to a directory.

  • detail (bool) – Whether to return detailed information about the files.

  • kwargs (Any) – Additional arguments to pass to the underlying ‘ls()’ method.

Return type:

Any

Returns:

List of strings if detail is False, or list of directory information dicts if detail is True.

static restore_version(source_bucket_name, source_file_name, source_generation_id, **kwargs)

Restores soft deleted/object versionated version of file to the live version.

Parameters:
  • source_bucket_name (str) – source bucket name where the file is located.

  • source_file_name (str) – non-current file name.

  • source_generation_id (str) – generation_id of the non-current.

  • kwargs (Any) – Additional arguments to pass to the underlying ‘copy_blob()’ method.

Return type:

Any

Returns:

A new blob with new generation id.

static save_pandas_to_csv(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to a CSV file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .csv file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_csv().

Return type:

None

static save_pandas_to_json(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to a JSON file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .json file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_json().

Return type:

None

static save_pandas_to_xml(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to an XML file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .xml file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_xml().

Return type:

None

dapla.gcs module

class GCSFileSystem(**kwargs)

Bases: GCSFileSystem

GCSFileSystem is a wrapper around gcsfs.GCSFileSystem.

isdir(path)

Check if path is a directory.

Return type:

bool

Parameters:

path (str)

dapla.git module

repo_root_dir(directory=None)

Find the root directory of a git repo, searching upwards from a given path.

Parameters:

directory (Union[Path, str, None]) – The path to search from, defaults to the current working directory. The directory can be of type string or of type pathlib.Path.

Return type:

Path

Returns:

Path to the git repo’s root directory.

Raises:

RuntimeError – If no .git directory is found when searching upwards.

Example:

>>> import dapla as dp
>>> import tomli
>>>
>>> config_file = dp.repo_root_dir() / "pyproject.toml"
>>> with open(config_file, mode="rb") as fp:
>>>     config = tomli.load(fp)

dapla.guardian module

class GuardianClient

Bases: object

Client for interacting with the Maskinporten Guardian.

static call_api(api_endpoint_url, maskinporten_client_id, scopes, keycloak_token=None)

Call an external API using Maskinporten Guardian.

Parameters:
  • api_endpoint_url (str) – URL to the target API

  • maskinporten_client_id (str) – the Maskinporten client id

  • scopes (str) – the Maskinporten scopes

  • keycloak_token (Optional[str]) – the user’s personal Keycloak token. Automatic fetch attempt will be made if left empty.

Raises:

RuntimeError – If the API call fails

Return type:

Any

Returns:

The endpoint json response

static get_guardian_token(guardian_endpoint, keycloak_token, body)

Retrieve access token from Maskinporten Guardian.

Parameters:
  • guardian_endpoint (str) – URL to the maskinporten guardian

  • keycloak_token (str) – the user’s Keycloak token

  • body (dict[str, str]) – maskinporten request body

Raises:

RuntimeError – If the Guardian token request fails

Return type:

str

Returns:

The maskinporten access token

static get_guardian_url()

Get the Guardian URL for the current environment.

Return type:

str

dapla.pandas module

class SupportedFileFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

A collection of supported file formats.

CSV = 'csv'
EXCEL = 'excel'
FWF = 'fwf'
JSON = 'json'
PARQUET = 'parquet'
SAS7BDAT = 'sas7bdat'
XML = 'xml'
read_pandas(gcs_path, file_format='parquet', columns=None, filters=None, **kwargs)

Convenience method for reading a dataset from a given GCS path and convert it to a Pandas dataframe.

Parameters:
  • gcs_path (str | list[str]) – Path or paths to the directory or file you want to get the contents of. Support multiple paths if reading parquet format.

  • file_format (Optional[str]) – The expected file format. All file formats other than “parquet” are delegated to Pandas methods like read_json, read_csv, etc. Defaults to “parquet”.

  • columns (Optional[list[str]]) – Choose specific columsn to read. Defaults to None.

  • filters (Union[list[tuple[Any] | list[tuple[Any]]], Expression, None]) – Add row filter to process when reading parquet. The filter should follow pyarrow methods. See examples in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset. Defaults to None.

  • kwargs (Any) – Additional arguments to pass to the underlying Pandas “read_*()” method.

Raises:

ValueError – If multiple paths are provided for non-parquet formats.

Return type:

Union[DataFrame, Series]

Returns:

A Pandas DataFrame containing the selected dataset.

write_pandas(df, gcs_path, file_format='parquet', **kwargs)

Convenience method for writing a Pandas DataFrame to a given GCS path.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to write to file.

  • gcs_path (str) – The GCS path to the destination file. Must have an extension that corresponds to the file_format

  • file_format (str) – The expected file format. All file formats other than “parquet” are delegated to Pandas

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas “to_*()” method.

Raises:
  • ValueError – If the file format is invalid.

  • ValueError – If the path does not have an extension that corresponds to the file format.

Return type:

None

dapla.pubsub module

exception EmptyListError

Bases: Exception

Empty list error.

trigger_source_data_processing(project_id, source_name, folder_prefix, kuben=False)

Triggers a source data processing service with every file that has a given prefix.

Parameters:
  • project_id (str) – The ID of Google Cloud project containing the pubsub topic, this is normally the standard project.

  • folder_prefix (str) – The folder prefix of the files to be processed.

  • source_name (str) – The name of source that should process the files.

  • kuben (bool) – Whether the team is on kuben or legacy.

Return type:

None