Reference¶

dapla package¶

Subpackages¶

dapla.backports module¶

details(gcs_path)¶

Backported dapla function to support detailed list of files for a given GCS path.

Parameters:: gcs_path (str) – the path from which you want to list all folders
Return type:: list[dict[str, str]]
Returns:: A list of dicts containing file details

show(gcs_path)¶

Backported dapla function to recursively show all folders below a given GCS path.

Parameters:: gcs_path (str) – the path from which you want to list all folders
Return type:: list[str]
Returns:: A simplified list of files or folders

dapla.collector module¶

class CollectorClient(collector_url)¶

Bases: object

Client for working with DataCollector.

Parameters:: collector_url (str)

running_tasks()¶

Get all running collector tasks.

Return type:: Response

start(specification)¶

Start a new collector task.

Parameters:: specification (dict[str, Any]) – The JSON object of the collector specification
Returns:: The “requests.Response” object from the API call
Return type:: Response

stop(task_id)¶

Stop a running collector task.

Parameters:: task_id (int) – The id of the task to stop.
Returns:: The “requests.Response” object from the API call
Return type:: Response

dapla.converter module¶

class ConverterClient(converter_url)¶

Bases: object

Client for working with DataConverter.

Parameters:: converter_url (str)

get_job_summary(job_id)¶

Retrieve the execution summary for a specific converter job.

Parameters:: job_id (str) – The ID of the job
Return type:: Response
Returns:: The “requests.Response” object from the API call

get_pseudo_report(job_id)¶

Get a report with details about how pseudonymization is being applied for a specific job.

The report includes: * mathced pseudo rules * unmatched pseudo rules * pseudo rule to field match map - rule names and their corresponding matches * field matches to pseudo rule map - field names and the corresponding pseudo rule that covers this field * metrics * textual schema hierarchy that illustrates how the pseudo rules are being applied

Parameters:: job_id (str) – The ID of the job
Return type:: Response
Returns:: The “requests.Response” object from the API call

get_pseudo_schema(job_id)¶

Get hierarchical schema representation that details how pseudo rules are being applied.

This is a smaller version get_pseudo_report

Parameters:: job_id (str) – The ID of the job
Return type:: Response
Returns:: The “requests.Response” object from the API call

start(job_config)¶

Schedule a new converter job.

Parameters:: job_config (dict[str, Any]) – The JSON object of the job configuration
Return type:: Response
Returns:: The “requests.Response” object from the API call

start_simulation(job_config)¶

Start a simulated converter job.

Useful for testing job configurations or diagnosing pseudonymization issues.

Parameters:: job_config (dict[str, Any]) – The JSON object of the job configuration
Return type:: Response
Returns:: The “requests.Response” object from the API call

stop_job(job_id)¶

Stop a specific converter job.

Parameters:: job_id (str) – The ID of the job
Return type:: Response
Returns:: The “requests.Response” object from the API call

dapla.doctor module¶

class Doctor¶

Bases: object

Class of functions that perform checks on Dapla.

Checks whether user is authenticated, if the keycloak-token is valid and if user has access to GCS. Each method can be run individually or collectively with the ‘health’ method.

static bucket_access()¶

Checks whether user has access to a common google bucket.

Return type:: bool

static gcs_credentials_valid()¶

Checks whether the users google cloud storage token is valid by accessing a GCS service.

Return type:: bool

classmethod health()¶

Runs a series of checks to determine the health of Dapla setup.

Return type:: None

static keycloak_token_valid()¶

Checks whether the keycloak token is valid by attempting to access a keycloak-token protected service.

Return type:: bool

dapla.files module¶

class FileClient¶

Bases: object

Client for working with buckets and files on Google Cloud Storage.

This class should not be instantiated, only the static methods should be used.

static cat(gcs_path)¶

Get string content of a file from GCS.

Parameters:: gcs_path (str) – The GCS path to a file.
Return type:: str
Returns:: utf-8 decoded string content of the given file

static gcs_open(gcs_path, mode='r')¶

Open a file in GCS, works like regular python open().

Parameters:

gcs_path (str) – The GCS path to a file.
mode (str) – File open mode. Defaults to ‘r’

Return type:

TextIOWrapper | AbstractBufferedFile

Returns:

A file-like object.

static get_gcs_file_system(**kwargs)¶

Return a pythonic file-system for Google Cloud Storage - initialized with a personal Google Identity token.

Parameters:: kwargs (Any) – Additional arguments to pass to the underlying GCSFileSystem.
Return type:: GCSFileSystem
Returns:: A GCSFileSystem instance.

See https://gcsfs.readthedocs.io/en/latest for advanced usage

static get_versions(bucket_name, file_path)¶

Get all versions of a file in a bucket.

Parameters:

bucket_name (str) – Bucket name where the file is located.
file_path (str) – Path to the file.

Return type:

Any

Returns:

List of versions of the file.

static load_csv_to_pandas(gcs_path, **kwargs)¶

Reads a CSV file from Google Cloud Storage into a Pandas DataFrame.

Parameters:

gcs_path (str) – The GCS path to a .csv file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas read_csv().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_json_to_pandas(gcs_path, **kwargs)¶

Reads a JSON file from Google Cloud Storage into a Pandas DataFrame.

Parameters:

gcs_path (str) – The GCS path to a .json file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas read_json().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_xml_to_pandas(gcs_path, **kwargs)¶

Reads an XML file from Google Cloud Storage into a Pandas DataFrame.

Parameters:

gcs_path (str) – The GCS path to a .xml file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas read_xml().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static ls(gcs_path, detail=False, **kwargs)¶

List the contents of a GCS bucket path.

Parameters:

gcs_path (str) – The GCS path to a directory.
detail (bool) – Whether to return detailed information about the files.
kwargs (Any) – Additional arguments to pass to the underlying ‘ls()’ method.

Return type:

Any

Returns:

List of strings if detail is False, or list of directory information dicts if detail is True.

static restore_version(source_bucket_name, source_file_name, source_generation_id, **kwargs)¶

Restores soft deleted/object versionated version of file to the live version.

Parameters:

source_bucket_name (str) – source bucket name where the file is located.
source_file_name (str) – non-current file name.
source_generation_id (str) – generation_id of the non-current.
kwargs (Any) – Additional arguments to pass to the underlying ‘copy_blob()’ method.

Return type:

Any

Returns:

A new blob with new generation id.

static save_pandas_to_csv(df, gcs_path, **kwargs)¶

Write the contents of a Pandas DataFrame to a CSV file in a bucket.

Parameters:

df (DataFrame) – The Pandas DataFrame to save to file.
gcs_path (str) – The GCS path to the destination .csv file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas to_csv().

Return type:

None

static save_pandas_to_json(df, gcs_path, **kwargs)¶

Write the contents of a Pandas DataFrame to a JSON file in a bucket.

Parameters:

df (DataFrame) – The Pandas DataFrame to save to file.
gcs_path (str) – The GCS path to the destination .json file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas to_json().

Return type:

None

static save_pandas_to_xml(df, gcs_path, **kwargs)¶

Write the contents of a Pandas DataFrame to an XML file in a bucket.

Parameters:

df (DataFrame) – The Pandas DataFrame to save to file.
gcs_path (str) – The GCS path to the destination .xml file.
**kwargs (Any) – Additional arguments to pass to the underlying Pandas to_xml().

Return type:

None

dapla.gcs module¶

class GCSFileSystem(**kwargs)¶

Bases: GCSFileSystem

GCSFileSystem is a wrapper around gcsfs.GCSFileSystem.

isdir(path)¶

Check if path is a directory.

Return type:: bool
Parameters:: path (str)

dapla.git module¶

repo_root_dir(directory=None)¶

Find the root directory of a git repo, searching upwards from a given path.

Parameters:: directory (Union[Path, str, None]) – The path to search from, defaults to the current working directory. The directory can be of type string or of type pathlib.Path.
Return type:: Path
Returns:: Path to the git repo’s root directory.
Raises:: RuntimeError – If no .git directory is found when searching upwards.

Example:¶

>>> import dapla as dp
>>> import tomli
>>>
>>> config_file = dp.repo_root_dir() / "pyproject.toml"
>>> with open(config_file, mode="rb") as fp:
>>>     config = tomli.load(fp)

dapla.guardian module¶

class GuardianClient¶

Bases: object

Client for interacting with the Maskinporten Guardian.

static call_api(api_endpoint_url, maskinporten_client_id, scopes, keycloak_token=None)¶

Call an external API using Maskinporten Guardian.

Parameters:

api_endpoint_url (str) – URL to the target API
maskinporten_client_id (str) – the Maskinporten client id
scopes (str) – the Maskinporten scopes
keycloak_token (Optional[str]) – the user’s personal Keycloak token. Automatic fetch attempt will be made if left empty.

Raises:

RuntimeError – If the API call fails

Return type:

Any

Returns:

The endpoint json response

static get_guardian_token(guardian_endpoint, keycloak_token, body)¶

Retrieve access token from Maskinporten Guardian.

Parameters:

guardian_endpoint (str) – URL to the maskinporten guardian
keycloak_token (str) – the user’s Keycloak token
body (dict[str, str]) – maskinporten request body

Raises:

RuntimeError – If the Guardian token request fails

Return type:

str

Returns:

The maskinporten access token

static get_guardian_url()¶

Get the Guardian URL for the current environment.

Return type:: str

dapla.pandas module¶

class SupportedFileFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶

Bases: Enum

A collection of supported file formats.

CSV = 'csv'¶

EXCEL = 'excel'¶

FWF = 'fwf'¶

JSON = 'json'¶

PARQUET = 'parquet'¶

SAS7BDAT = 'sas7bdat'¶

XML = 'xml'¶

read_pandas(gcs_path, file_format='parquet', columns=None, filters=None, **kwargs)¶

Convenience method for reading a dataset from a given GCS path and convert it to a Pandas dataframe.

Parameters:

gcs_path (str | list[str]) – Path or paths to the directory or file you want to get the contents of. Support multiple paths if reading parquet format.
file_format (Optional[str]) – The expected file format. All file formats other than “parquet” are delegated to Pandas methods like read_json, read_csv, etc. Defaults to “parquet”.
columns (Optional[list[str]]) – Choose specific columsn to read. Defaults to None.
filters (Union[list[tuple[Any] | list[tuple[Any]]], Expression, None]) – Add row filter to process when reading parquet. The filter should follow pyarrow methods. See examples in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset. Defaults to None.
kwargs (Any) – Additional arguments to pass to the underlying Pandas “read_*()” method.

Raises:

ValueError – If multiple paths are provided for non-parquet formats.

Return type:

Union[DataFrame, Series]

Returns:

A Pandas DataFrame containing the selected dataset.

write_pandas(df, gcs_path, file_format='parquet', **kwargs)¶

Convenience method for writing a Pandas DataFrame to a given GCS path.

Parameters:

df (DataFrame) – The Pandas DataFrame to write to file.
gcs_path (str) – The GCS path to the destination file. Must have an extension that corresponds to the file_format
file_format (str) – The expected file format. All file formats other than “parquet” are delegated to Pandas
**kwargs (Any) – Additional arguments to pass to the underlying Pandas “to_*()” method.

Raises:

ValueError – If the file format is invalid.
ValueError – If the path does not have an extension that corresponds to the file format.

Return type:

None

dapla.pubsub module¶

exception EmptyListError¶

Bases: Exception

Empty list error.

trigger_source_data_processing(project_id, source_name, folder_prefix, kuben=False)¶

Triggers a source data processing service with every file that has a given prefix.

Parameters:

project_id (str) – The ID of Google Cloud project containing the pubsub topic, this is normally the standard project.
folder_prefix (str) – The folder prefix of the files to be processed.
source_name (str) – The name of source that should process the files.
kuben (bool) – Whether the team is on kuben or legacy.

Return type:

None