Reference

dapla package

Subpackages

dapla.auth module

class AuthClient

Bases: object

Client for retrieving authentication information.

static fetch_email_from_credentials()

Retrieves an e-mail based on current Google Credentials. Potentially makes a Google API call.

Return type:

Optional[str]

static fetch_google_credentials(force_token_exchange=False)

Fetches the Google credentials for the current user.

Parameters:

force_token_exchange (bool) – Forces authentication by token exchange.

Raises:

AuthError – If fails to fetch credentials.

Return type:

Credentials

Returns:

The Google “Credentials” object.

static fetch_google_token(request=None, scopes=None, from_jupyterhub=False)

Fetches the Google token for the current user.

Scopes in the argument is ignored, but are kept for compatibility with the Credentials refresh handler method signature.

Parameters:
  • request (Optional[Request]) – The GoogleAuthRequest object.

  • scopes (Optional[Sequence[str]]) – The scopes to request.

  • from_jupyterhub (bool) – Whether the google token should be exchanged from JupyterHub. if false, exchange from an OIDC endpoint decided by OIDC_TOKEN_EXCHANGE_URL.

Raises:

AuthError – If the token exchange.

Return type:

tuple[str, datetime]

Returns:

The Google token.

static fetch_google_token_from_jupyter()

Fetches the personal access token for the current user.

Raises:

AuthError – If the token exchange request to JupyterHub fails.

Return type:

str

Returns:

The personal access token.

static fetch_google_token_from_oidc_exchange(request, _scopes)

Fetches the Google token by exchanging an OIDC token.

Parameters:
  • request (Request) – The GoogleAuthRequest object.

  • _scopes (Sequence[str]) – The scopes to request.

Raises:

AuthError – If the request to the OIDC token exchange endpoint fails.

Return type:

tuple[str, datetime]

Returns:

A tuple of (google-token, expiry).

static fetch_local_user_from_jupyter()

Retrieves user information, most notably access tokens for use in authentication.

Raises:

AuthError – If the request to the user endpoint fails.

Return type:

dict[str, Any]

Returns:

The user data from the token .

static fetch_personal_token()

If Dapla Region is Dapla Lab, retrieve the OIDC token/Keycloak token from the environment.

Returns:

The OIDC token.

Return type:

str

Raises:

MissingConfigurationException – If the OIDC_TOKEN environment variable is missing or is not set.

If Dapla Region is BIP, retrieve the Keycloak token jupyterhub.

Returns:

personal/keycloak token.

Return type:

str

Raises:

AuthError – Handles AuthError.

static get_dapla_region()

Checks if the current Dapla Region is Dapla Lab.

Return type:

Optional[DaplaRegion]

exception AuthError

Bases: Exception

This exception class is used when the communication with the custom auth handler fails.

This is normally due to stale auth session.

exception MissingConfigurationException(variable_name)

Bases: Exception

Exception raised when a required environment variable or configuration is missing.

Parameters:

variable_name (str)

Return type:

None

dapla.backports module

details(gcs_path)

Backported dapla function to support detailed list of files for a given GCS path.

Parameters:

gcs_path (str) – the path from which you want to list all folders

Return type:

list[dict[str, str]]

Returns:

A list of dicts containing file details

show(gcs_path)

Backported dapla function to recursively show all folders below a given GCS path.

Parameters:

gcs_path (str) – the path from which you want to list all folders

Return type:

list[str]

Returns:

A simplified list of files or folders

dapla.collector module

class CollectorClient(collector_url)

Bases: object

Client for working with DataCollector.

Parameters:

collector_url (str)

running_tasks()

Get all running collector tasks.

Return type:

Response

start(specification)

Start a new collector task.

Parameters:

specification (dict[str, Any]) – The JSON object of the collector specification

Returns:

The “requests.Response” object from the API call

Return type:

Response

stop(task_id)

Stop a running collector task.

Parameters:

task_id (int) – The id of the task to stop.

Returns:

The “requests.Response” object from the API call

Return type:

Response

dapla.converter module

class ConverterClient(converter_url)

Bases: object

Client for working with DataConverter.

Parameters:

converter_url (str)

get_job_summary(job_id)

Retrieve the execution summary for a specific converter job.

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

get_pseudo_report(job_id)

Get a report with details about how pseudonymization is being applied for a specific job.

The report includes: * mathced pseudo rules * unmatched pseudo rules * pseudo rule to field match map - rule names and their corresponding matches * field matches to pseudo rule map - field names and the corresponding pseudo rule that covers this field * metrics * textual schema hierarchy that illustrates how the pseudo rules are being applied

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

get_pseudo_schema(job_id)

Get hierarchical schema representation that details how pseudo rules are being applied.

This is a smaller version get_pseudo_report

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

start(job_config)

Schedule a new converter job.

Parameters:

job_config (dict[str, Any]) – The JSON object of the job configuration

Return type:

Response

Returns:

The “requests.Response” object from the API call

start_simulation(job_config)

Start a simulated converter job.

Useful for testing job configurations or diagnosing pseudonymization issues.

Parameters:

job_config (dict[str, Any]) – The JSON object of the job configuration

Return type:

Response

Returns:

The “requests.Response” object from the API call

stop_job(job_id)

Stop a specific converter job.

Parameters:

job_id (str) – The ID of the job

Return type:

Response

Returns:

The “requests.Response” object from the API call

dapla.doctor module

class Doctor

Bases: object

Class of functions that perform checks on Dapla.

Checks whether user is authenticated, if the keycloak-token is valid and if user has access to GCS. Each method can be run individually or collectively with the ‘health’ method.

static bucket_access()

Checks whether user has access to a common google bucket.

Return type:

bool

static gcs_credentials_valid()

Checks whether the users google cloud storage token is valid by accessing a GCS service.

Return type:

bool

classmethod health()

Runs a series of checks to determine the health of Dapla setup.

Return type:

None

static jupyterhub_auth_valid()

Checks whether user is logged in and authenticated to Jupyterhub or Dapla Lab.

Return type:

bool

static keycloak_token_valid()

Checks whether the keycloak token is valid by attempting to access a keycloak-token protected service.

Return type:

bool

dapla.files module

class FileClient

Bases: object

Client for working with buckets and files on Google Cloud Storage.

This class should not be instantiated, only the static methods should be used.

static cat(gcs_path)

Get string content of a file from GCS.

Parameters:

gcs_path (str) – The GCS path to a file.

Return type:

str

Returns:

utf-8 decoded string content of the given file

static gcs_open(gcs_path, mode='r')

Open a file in GCS, works like regular python open().

Parameters:
  • gcs_path (str) – The GCS path to a file.

  • mode (str) – File open mode. Defaults to ‘r’

Return type:

TextIOWrapper | AbstractBufferedFile

Returns:

A file-like object.

static get_gcs_file_system(**kwargs)

Return a pythonic file-system for Google Cloud Storage - initialized with a personal Google Identity token.

Parameters:

kwargs (Any) – Additional arguments to pass to the underlying GCSFileSystem.

Return type:

GCSFileSystem

Returns:

A GCSFileSystem instance.

See https://gcsfs.readthedocs.io/en/latest for advanced usage

static get_versions(bucket_name, file_path)

Get all versions of a file in a bucket.

Parameters:
  • bucket_name (str) – Bucket name where the file is located.

  • file_path (str) – Path to the file.

Return type:

Any

Returns:

List of versions of the file.

static load_csv_to_pandas(gcs_path, **kwargs)

Reads a CSV file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .csv file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_csv().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_json_to_pandas(gcs_path, **kwargs)

Reads a JSON file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .json file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_json().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static load_xml_to_pandas(gcs_path, **kwargs)

Reads an XML file from Google Cloud Storage into a Pandas DataFrame.

Parameters:
  • gcs_path (str) – The GCS path to a .xml file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas read_xml().

Return type:

DataFrame

Returns:

A Pandas DataFrame.

static ls(gcs_path, detail=False, **kwargs)

List the contents of a GCS bucket path.

Parameters:
  • gcs_path (str) – The GCS path to a directory.

  • detail (bool) – Whether to return detailed information about the files.

  • kwargs (Any) – Additional arguments to pass to the underlying ‘ls()’ method.

Return type:

Any

Returns:

List of strings if detail is False, or list of directory information dicts if detail is True.

static restore_version(source_bucket_name, source_file_name, source_generation_id, **kwargs)

Restores soft deleted/object versionated version of file to the live version.

Parameters:
  • source_bucket_name (str) – source bucket name where the file is located.

  • source_file_name (str) – non-current file name.

  • source_generation_id (str) – generation_id of the non-current.

  • kwargs (Any) – Additional arguments to pass to the underlying ‘copy_blob()’ method.

Return type:

Any

Returns:

A new blob with new generation id.

static save_pandas_to_csv(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to a CSV file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .csv file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_csv().

Return type:

None

static save_pandas_to_json(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to a JSON file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .json file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_json().

Return type:

None

static save_pandas_to_xml(df, gcs_path, **kwargs)

Write the contents of a Pandas DataFrame to an XML file in a bucket.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to save to file.

  • gcs_path (str) – The GCS path to the destination .xml file.

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas to_xml().

Return type:

None

dapla.gcs module

class GCSFileSystem(**kwargs)

Bases: GCSFileSystem

GCSFileSystem is a wrapper around gcsfs.GCSFileSystem.

isdir(path)

Check if path is a directory.

Return type:

bool

Parameters:

path (str)

dapla.git module

repo_root_dir(directory=None)

Find the root directory of a git repo, searching upwards from a given path.

Parameters:

directory (Union[Path, str, None]) – The path to search from, defaults to the current working directory. The directory can be of type string or of type pathlib.Path.

Return type:

Path

Returns:

Path to the git repo’s root directory.

Raises:

RuntimeError – If no .git directory is found when searching upwards.

Example:

>>> import dapla as dp
>>> import tomli
>>>
>>> config_file = dp.repo_root_dir() / "pyproject.toml"
>>> with open(config_file, mode="rb") as fp:
>>>     config = tomli.load(fp)

dapla.guardian module

class GuardianClient

Bases: object

Client for interacting with the Maskinporten Guardian.

static call_api(api_endpoint_url, maskinporten_client_id, scopes, guardian_endpoint_url='http://maskinporten-guardian.dapla.svc.cluster.local/maskinporten/access-token', keycloak_token=None)

Call an external API using Maskinporten Guardian.

Parameters:
  • api_endpoint_url (str) – URL to the target API

  • maskinporten_client_id (str) – the Maskinporten client id

  • scopes (str) – the Maskinporten scopes

  • guardian_endpoint_url (str) – URL to the Maskinporten Guardian

  • keycloak_token (Optional[str]) – the user’s personal Keycloak token. Automatic fetch attempt will be made if left empty.

Raises:

RuntimeError – If the API call fails

Return type:

Any

Returns:

The endpoint json response

static get_guardian_token(guardian_endpoint, keycloak_token, body)

Retrieve access token from Maskinporten Guardian.

Parameters:
  • guardian_endpoint (str) – URL to the maskinporten guardian

  • keycloak_token (str) – the user’s Keycloak token

  • body (dict[str, str]) – maskinporten request body

Raises:

RuntimeError – If the Guardian token request fails

Return type:

str

Returns:

The maskinporten access token

dapla.jupyterhub module

generate_api_token(expires_in=3600, description='Generated API token from Dapla Toolbelt')

Generate a new API token for the logged in Jupyterhub user.

Such tokens can be used by third party applications to connect to Jupyterhub running remotely. Examples are IDEs like VSCode or Pycharm. :type expires_in: int :param expires_in: number of seconds until the token expires :type description: str :param description: optional description of the token :rtype: dict[str, str] :return: a dict that contains the token value, and token URL

Parameters:
  • expires_in (int)

  • description (str)

Return type:

dict[str, str]

dapla.pandas module

class SupportedFileFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

A collection of supported file formats.

CSV = 'csv'
EXCEL = 'excel'
FWF = 'fwf'
JSON = 'json'
PARQUET = 'parquet'
SAS7BDAT = 'sas7bdat'
XML = 'xml'
read_pandas(gcs_path, file_format='parquet', columns=None, filters=None, **kwargs)

Convenience method for reading a dataset from a given GCS path and convert it to a Pandas dataframe.

Parameters:
  • gcs_path (str | list[str]) – Path or paths to the directory or file you want to get the contents of. Support multiple paths if reading parquet format.

  • file_format (Optional[str]) – The expected file format. All file formats other than “parquet” are delegated to Pandas methods like read_json, read_csv, etc. Defaults to “parquet”.

  • columns (Optional[list[str]]) – Choose specific columsn to read. Defaults to None.

  • filters (Union[list[tuple[Any] | list[tuple[Any]]], Expression, None]) – Add row filter to process when reading parquet. The filter should follow pyarrow methods. See examples in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset. Defaults to None.

  • kwargs (Any) – Additional arguments to pass to the underlying Pandas “read_*()” method.

Raises:

ValueError – If multiple paths are provided for non-parquet formats.

Return type:

Union[DataFrame, Series]

Returns:

A Pandas DataFrame containing the selected dataset.

write_pandas(df, gcs_path, file_format='parquet', **kwargs)

Convenience method for writing a Pandas DataFrame to a given GCS path.

Parameters:
  • df (DataFrame) – The Pandas DataFrame to write to file.

  • gcs_path (str) – The GCS path to the destination file. Must have an extension that corresponds to the file_format

  • file_format (str) – The expected file format. All file formats other than “parquet” are delegated to Pandas

  • **kwargs (Any) – Additional arguments to pass to the underlying Pandas “to_*()” method.

Raises:
  • ValueError – If the file format is invalid.

  • ValueError – If the path does not have an extension that corresponds to the file format.

Return type:

None

dapla.pubsub module

exception EmptyListError

Bases: Exception

Empty list error.

trigger_source_data_processing(project_id, source_name, folder_prefix, kuben=False)

Triggers a source data processing service with every file that has a given prefix.

Parameters:
  • project_id (str) – The ID of Google Cloud project containing the pubsub topic, this is normally the standard project.

  • folder_prefix (str) – The folder prefix of the files to be processed.

  • source_name (str) – The name of source that should process the files.

  • kuben (bool) – Whether the team is on kuben or legacy.

Return type:

None