Reference¶
dapla package¶
Subpackages¶
dapla.auth module¶
- class AuthClient¶
Bases:
object
Client for retrieving authentication information.
- static fetch_email_from_credentials()¶
Retrieves an e-mail based on current Google Credentials. Potentially makes a Google API call.
- Return type:
Optional
[str
]
- static fetch_google_credentials(force_token_exchange=False)¶
Fetches the Google credentials for the current user.
- Parameters:
force_token_exchange (
bool
) – Forces authentication by token exchange.- Raises:
AuthError – If fails to fetch credentials.
- Return type:
Credentials
- Returns:
The Google “Credentials” object.
- static fetch_google_token(request=None, scopes=None, from_jupyterhub=False)¶
Fetches the Google token for the current user.
Scopes in the argument is ignored, but are kept for compatibility with the Credentials refresh handler method signature.
- Parameters:
request (
Optional
[Request
]) – The GoogleAuthRequest object.scopes (
Optional
[Sequence
[str
]]) – The scopes to request.from_jupyterhub (
bool
) – Whether the google token should be exchanged from JupyterHub. if false, exchange from an OIDC endpoint decided by OIDC_TOKEN_EXCHANGE_URL.
- Raises:
AuthError – If the token exchange.
- Return type:
tuple
[str
,datetime
]- Returns:
The Google token.
- static fetch_google_token_from_jupyter()¶
Fetches the personal access token for the current user.
- Raises:
AuthError – If the token exchange request to JupyterHub fails.
- Return type:
str
- Returns:
The personal access token.
- static fetch_google_token_from_oidc_exchange(request, _scopes)¶
Fetches the Google token by exchanging an OIDC token.
- Parameters:
request (
Request
) – The GoogleAuthRequest object._scopes (
Sequence
[str
]) – The scopes to request.
- Raises:
AuthError – If the request to the OIDC token exchange endpoint fails.
- Return type:
tuple
[str
,datetime
]- Returns:
A tuple of (google-token, expiry).
- static fetch_local_user_from_jupyter()¶
Retrieves user information, most notably access tokens for use in authentication.
- Raises:
AuthError – If the request to the user endpoint fails.
- Return type:
dict
[str
,Any
]- Returns:
The user data from the token .
- static fetch_personal_token()¶
If Dapla Region is Dapla Lab, retrieve the OIDC token/Keycloak token from the environment.
- Returns:
The OIDC token.
- Return type:
str
- Raises:
MissingConfigurationException – If the OIDC_TOKEN environment variable is missing or is not set.
If Dapla Region is BIP, retrieve the Keycloak token jupyterhub.
- Returns:
personal/keycloak token.
- Return type:
str
- Raises:
AuthError – Handles AuthError.
- static get_dapla_region()¶
Checks if the current Dapla Region is Dapla Lab.
- Return type:
Optional
[DaplaRegion
]
- exception AuthError¶
Bases:
Exception
This exception class is used when the communication with the custom auth handler fails.
This is normally due to stale auth session.
- exception MissingConfigurationException(variable_name)¶
Bases:
Exception
Exception raised when a required environment variable or configuration is missing.
- Parameters:
variable_name (str)
- Return type:
None
dapla.backports module¶
- details(gcs_path)¶
Backported dapla function to support detailed list of files for a given GCS path.
- Parameters:
gcs_path (
str
) – the path from which you want to list all folders- Return type:
list
[dict
[str
,str
]]- Returns:
A list of dicts containing file details
- show(gcs_path)¶
Backported dapla function to recursively show all folders below a given GCS path.
- Parameters:
gcs_path (
str
) – the path from which you want to list all folders- Return type:
list
[str
]- Returns:
A simplified list of files or folders
dapla.collector module¶
- class CollectorClient(collector_url)¶
Bases:
object
Client for working with DataCollector.
- Parameters:
collector_url (str)
- running_tasks()¶
Get all running collector tasks.
- Return type:
Response
- start(specification)¶
Start a new collector task.
- Parameters:
specification (
dict
[str
,Any
]) – The JSON object of the collector specification- Returns:
The “requests.Response” object from the API call
- Return type:
Response
- stop(task_id)¶
Stop a running collector task.
- Parameters:
task_id (
int
) – The id of the task to stop.- Returns:
The “requests.Response” object from the API call
- Return type:
Response
dapla.converter module¶
- class ConverterClient(converter_url)¶
Bases:
object
Client for working with DataConverter.
- Parameters:
converter_url (str)
- get_job_summary(job_id)¶
Retrieve the execution summary for a specific converter job.
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- get_pseudo_report(job_id)¶
Get a report with details about how pseudonymization is being applied for a specific job.
The report includes: * mathced pseudo rules * unmatched pseudo rules * pseudo rule to field match map - rule names and their corresponding matches * field matches to pseudo rule map - field names and the corresponding pseudo rule that covers this field * metrics * textual schema hierarchy that illustrates how the pseudo rules are being applied
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- get_pseudo_schema(job_id)¶
Get hierarchical schema representation that details how pseudo rules are being applied.
This is a smaller version get_pseudo_report
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- start(job_config)¶
Schedule a new converter job.
- Parameters:
job_config (
dict
[str
,Any
]) – The JSON object of the job configuration- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- start_simulation(job_config)¶
Start a simulated converter job.
Useful for testing job configurations or diagnosing pseudonymization issues.
- Parameters:
job_config (
dict
[str
,Any
]) – The JSON object of the job configuration- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- stop_job(job_id)¶
Stop a specific converter job.
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
dapla.doctor module¶
- class Doctor¶
Bases:
object
Class of functions that perform checks on Dapla.
Checks whether user is authenticated, if the keycloak-token is valid and if user has access to GCS. Each method can be run individually or collectively with the ‘health’ method.
- static bucket_access()¶
Checks whether user has access to a common google bucket.
- Return type:
bool
- static gcs_credentials_valid()¶
Checks whether the users google cloud storage token is valid by accessing a GCS service.
- Return type:
bool
- classmethod health()¶
Runs a series of checks to determine the health of Dapla setup.
- Return type:
None
- static jupyterhub_auth_valid()¶
Checks whether user is logged in and authenticated to Jupyterhub or Dapla Lab.
- Return type:
bool
- static keycloak_token_valid()¶
Checks whether the keycloak token is valid by attempting to access a keycloak-token protected service.
- Return type:
bool
dapla.files module¶
- class FileClient¶
Bases:
object
Client for working with buckets and files on Google Cloud Storage.
This class should not be instantiated, only the static methods should be used.
- static cat(gcs_path)¶
Get string content of a file from GCS.
- Parameters:
gcs_path (
str
) – The GCS path to a file.- Return type:
str
- Returns:
utf-8 decoded string content of the given file
- static gcs_open(gcs_path, mode='r')¶
Open a file in GCS, works like regular python open().
- Parameters:
gcs_path (
str
) – The GCS path to a file.mode (
str
) – File open mode. Defaults to ‘r’
- Return type:
TextIOWrapper
|AbstractBufferedFile
- Returns:
A file-like object.
- static get_gcs_file_system(**kwargs)¶
Return a pythonic file-system for Google Cloud Storage - initialized with a personal Google Identity token.
- Parameters:
kwargs (
Any
) – Additional arguments to pass to the underlying GCSFileSystem.- Return type:
- Returns:
A GCSFileSystem instance.
See https://gcsfs.readthedocs.io/en/latest for advanced usage
- static get_versions(bucket_name, file_path)¶
Get all versions of a file in a bucket.
- Parameters:
bucket_name (
str
) – Bucket name where the file is located.file_path (
str
) – Path to the file.
- Return type:
Any
- Returns:
List of versions of the file.
- static load_csv_to_pandas(gcs_path, **kwargs)¶
Reads a CSV file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .csv file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_csv().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static load_json_to_pandas(gcs_path, **kwargs)¶
Reads a JSON file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .json file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_json().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static load_xml_to_pandas(gcs_path, **kwargs)¶
Reads an XML file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .xml file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_xml().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static ls(gcs_path, detail=False, **kwargs)¶
List the contents of a GCS bucket path.
- Parameters:
gcs_path (
str
) – The GCS path to a directory.detail (
bool
) – Whether to return detailed information about the files.kwargs (
Any
) – Additional arguments to pass to the underlying ‘ls()’ method.
- Return type:
Any
- Returns:
List of strings if detail is False, or list of directory information dicts if detail is True.
- static restore_version(source_bucket_name, source_file_name, source_generation_id, **kwargs)¶
Restores soft deleted/object versionated version of file to the live version.
- Parameters:
source_bucket_name (
str
) – source bucket name where the file is located.source_file_name (
str
) – non-current file name.source_generation_id (
str
) – generation_id of the non-current.kwargs (
Any
) – Additional arguments to pass to the underlying ‘copy_blob()’ method.
- Return type:
Any
- Returns:
A new blob with new generation id.
- static save_pandas_to_csv(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to a CSV file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .csv file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_csv().
- Return type:
None
- static save_pandas_to_json(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to a JSON file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .json file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_json().
- Return type:
None
- static save_pandas_to_xml(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to an XML file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .xml file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_xml().
- Return type:
None
dapla.gcs module¶
dapla.git module¶
- repo_root_dir(directory=None)¶
Find the root directory of a git repo, searching upwards from a given path.
- Parameters:
directory (
Union
[Path
,str
,None
]) – The path to search from, defaults to the current working directory. The directory can be of type string or of type pathlib.Path.- Return type:
Path
- Returns:
Path to the git repo’s root directory.
- Raises:
RuntimeError – If no .git directory is found when searching upwards.
Example:¶
>>> import dapla as dp >>> import tomli >>> >>> config_file = dp.repo_root_dir() / "pyproject.toml" >>> with open(config_file, mode="rb") as fp: >>> config = tomli.load(fp)
dapla.guardian module¶
- class GuardianClient¶
Bases:
object
Client for interacting with the Maskinporten Guardian.
- static call_api(api_endpoint_url, maskinporten_client_id, scopes, guardian_endpoint_url='http://maskinporten-guardian.dapla.svc.cluster.local/maskinporten/access-token', keycloak_token=None)¶
Call an external API using Maskinporten Guardian.
- Parameters:
api_endpoint_url (
str
) – URL to the target APImaskinporten_client_id (
str
) – the Maskinporten client idscopes (
str
) – the Maskinporten scopesguardian_endpoint_url (
str
) – URL to the Maskinporten Guardiankeycloak_token (
Optional
[str
]) – the user’s personal Keycloak token. Automatic fetch attempt will be made if left empty.
- Raises:
RuntimeError – If the API call fails
- Return type:
Any
- Returns:
The endpoint json response
- static get_guardian_token(guardian_endpoint, keycloak_token, body)¶
Retrieve access token from Maskinporten Guardian.
- Parameters:
guardian_endpoint (
str
) – URL to the maskinporten guardiankeycloak_token (
str
) – the user’s Keycloak tokenbody (
dict
[str
,str
]) – maskinporten request body
- Raises:
RuntimeError – If the Guardian token request fails
- Return type:
str
- Returns:
The maskinporten access token
dapla.jupyterhub module¶
- generate_api_token(expires_in=3600, description='Generated API token from Dapla Toolbelt')¶
Generate a new API token for the logged in Jupyterhub user.
Such tokens can be used by third party applications to connect to Jupyterhub running remotely. Examples are IDEs like VSCode or Pycharm. :type expires_in:
int
:param expires_in: number of seconds until the token expires :type description:str
:param description: optional description of the token :rtype:dict
[str
,str
] :return: a dict that contains the token value, and token URL- Parameters:
expires_in (int)
description (str)
- Return type:
dict[str, str]
dapla.pandas module¶
- class SupportedFileFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
A collection of supported file formats.
- CSV = 'csv'¶
- EXCEL = 'excel'¶
- FWF = 'fwf'¶
- JSON = 'json'¶
- PARQUET = 'parquet'¶
- SAS7BDAT = 'sas7bdat'¶
- XML = 'xml'¶
- read_pandas(gcs_path, file_format='parquet', columns=None, filters=None, **kwargs)¶
Convenience method for reading a dataset from a given GCS path and convert it to a Pandas dataframe.
- Parameters:
gcs_path (
str
|list
[str
]) – Path or paths to the directory or file you want to get the contents of. Support multiple paths if reading parquet format.file_format (
Optional
[str
]) – The expected file format. All file formats other than “parquet” are delegated to Pandas methods like read_json, read_csv, etc. Defaults to “parquet”.columns (
Optional
[list
[str
]]) – Choose specific columsn to read. Defaults to None.filters (
Union
[list
[tuple
[Any
] |list
[tuple
[Any
]]],Expression
,None
]) – Add row filter to process when reading parquet. The filter should follow pyarrow methods. See examples in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset. Defaults to None.kwargs (
Any
) – Additional arguments to pass to the underlying Pandas “read_*()” method.
- Raises:
ValueError – If multiple paths are provided for non-parquet formats.
- Return type:
Union
[DataFrame
,Series
]- Returns:
A Pandas DataFrame containing the selected dataset.
- write_pandas(df, gcs_path, file_format='parquet', **kwargs)¶
Convenience method for writing a Pandas DataFrame to a given GCS path.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to write to file.gcs_path (
str
) – The GCS path to the destination file. Must have an extension that corresponds to the file_formatfile_format (
str
) – The expected file format. All file formats other than “parquet” are delegated to Pandas**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas “to_*()” method.
- Raises:
ValueError – If the file format is invalid.
ValueError – If the path does not have an extension that corresponds to the file format.
- Return type:
None
dapla.pubsub module¶
- exception EmptyListError¶
Bases:
Exception
Empty list error.
- trigger_source_data_processing(project_id, source_name, folder_prefix, kuben=False)¶
Triggers a source data processing service with every file that has a given prefix.
- Parameters:
project_id (str) – The ID of Google Cloud project containing the pubsub topic, this is normally the standard project.
folder_prefix (str) – The folder prefix of the files to be processed.
source_name (str) – The name of source that should process the files.
kuben (bool) – Whether the team is on kuben or legacy.
- Return type:
None