Reference¶
dapla package¶
Subpackages¶
dapla.backports module¶
- details(gcs_path)¶
Backported dapla function to support detailed list of files for a given GCS path.
- Parameters:
gcs_path (
str
) – the path from which you want to list all folders- Return type:
list
[dict
[str
,str
]]- Returns:
A list of dicts containing file details
- show(gcs_path)¶
Backported dapla function to recursively show all folders below a given GCS path.
- Parameters:
gcs_path (
str
) – the path from which you want to list all folders- Return type:
list
[str
]- Returns:
A simplified list of files or folders
dapla.collector module¶
- class CollectorClient(collector_url)¶
Bases:
object
Client for working with DataCollector.
- Parameters:
collector_url (str)
- running_tasks()¶
Get all running collector tasks.
- Return type:
Response
- start(specification)¶
Start a new collector task.
- Parameters:
specification (
dict
[str
,Any
]) – The JSON object of the collector specification- Returns:
The “requests.Response” object from the API call
- Return type:
Response
- stop(task_id)¶
Stop a running collector task.
- Parameters:
task_id (
int
) – The id of the task to stop.- Returns:
The “requests.Response” object from the API call
- Return type:
Response
dapla.converter module¶
- class ConverterClient(converter_url)¶
Bases:
object
Client for working with DataConverter.
- Parameters:
converter_url (str)
- get_job_summary(job_id)¶
Retrieve the execution summary for a specific converter job.
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- get_pseudo_report(job_id)¶
Get a report with details about how pseudonymization is being applied for a specific job.
The report includes: * mathced pseudo rules * unmatched pseudo rules * pseudo rule to field match map - rule names and their corresponding matches * field matches to pseudo rule map - field names and the corresponding pseudo rule that covers this field * metrics * textual schema hierarchy that illustrates how the pseudo rules are being applied
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- get_pseudo_schema(job_id)¶
Get hierarchical schema representation that details how pseudo rules are being applied.
This is a smaller version get_pseudo_report
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- start(job_config)¶
Schedule a new converter job.
- Parameters:
job_config (
dict
[str
,Any
]) – The JSON object of the job configuration- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- start_simulation(job_config)¶
Start a simulated converter job.
Useful for testing job configurations or diagnosing pseudonymization issues.
- Parameters:
job_config (
dict
[str
,Any
]) – The JSON object of the job configuration- Return type:
Response
- Returns:
The “requests.Response” object from the API call
- stop_job(job_id)¶
Stop a specific converter job.
- Parameters:
job_id (
str
) – The ID of the job- Return type:
Response
- Returns:
The “requests.Response” object from the API call
dapla.doctor module¶
- class Doctor¶
Bases:
object
Class of functions that perform checks on Dapla.
Checks whether user is authenticated, if the keycloak-token is valid and if user has access to GCS. Each method can be run individually or collectively with the ‘health’ method.
- static bucket_access()¶
Checks whether user has access to a common google bucket.
- Return type:
bool
- static gcs_credentials_valid()¶
Checks whether the users google cloud storage token is valid by accessing a GCS service.
- Return type:
bool
- classmethod health()¶
Runs a series of checks to determine the health of Dapla setup.
- Return type:
None
- static keycloak_token_valid()¶
Checks whether the keycloak token is valid by attempting to access a keycloak-token protected service.
- Return type:
bool
dapla.files module¶
- class FileClient¶
Bases:
object
Client for working with buckets and files on Google Cloud Storage.
This class should not be instantiated, only the static methods should be used.
- static cat(gcs_path)¶
Get string content of a file from GCS.
- Parameters:
gcs_path (
str
) – The GCS path to a file.- Return type:
str
- Returns:
utf-8 decoded string content of the given file
- static gcs_open(gcs_path, mode='r')¶
Open a file in GCS, works like regular python open().
- Parameters:
gcs_path (
str
) – The GCS path to a file.mode (
str
) – File open mode. Defaults to ‘r’
- Return type:
TextIOWrapper
|AbstractBufferedFile
- Returns:
A file-like object.
- static get_gcs_file_system(**kwargs)¶
Return a pythonic file-system for Google Cloud Storage - initialized with a personal Google Identity token.
- Parameters:
kwargs (
Any
) – Additional arguments to pass to the underlying GCSFileSystem.- Return type:
- Returns:
A GCSFileSystem instance.
See https://gcsfs.readthedocs.io/en/latest for advanced usage
- static get_versions(bucket_name, file_path)¶
Get all versions of a file in a bucket.
- Parameters:
bucket_name (
str
) – Bucket name where the file is located.file_path (
str
) – Path to the file.
- Return type:
Any
- Returns:
List of versions of the file.
- static load_csv_to_pandas(gcs_path, **kwargs)¶
Reads a CSV file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .csv file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_csv().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static load_json_to_pandas(gcs_path, **kwargs)¶
Reads a JSON file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .json file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_json().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static load_xml_to_pandas(gcs_path, **kwargs)¶
Reads an XML file from Google Cloud Storage into a Pandas DataFrame.
- Parameters:
gcs_path (
str
) – The GCS path to a .xml file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas read_xml().
- Return type:
DataFrame
- Returns:
A Pandas DataFrame.
- static ls(gcs_path, detail=False, **kwargs)¶
List the contents of a GCS bucket path.
- Parameters:
gcs_path (
str
) – The GCS path to a directory.detail (
bool
) – Whether to return detailed information about the files.kwargs (
Any
) – Additional arguments to pass to the underlying ‘ls()’ method.
- Return type:
Any
- Returns:
List of strings if detail is False, or list of directory information dicts if detail is True.
- static restore_version(source_bucket_name, source_file_name, source_generation_id, **kwargs)¶
Restores soft deleted/object versionated version of file to the live version.
- Parameters:
source_bucket_name (
str
) – source bucket name where the file is located.source_file_name (
str
) – non-current file name.source_generation_id (
str
) – generation_id of the non-current.kwargs (
Any
) – Additional arguments to pass to the underlying ‘copy_blob()’ method.
- Return type:
Any
- Returns:
A new blob with new generation id.
- static save_pandas_to_csv(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to a CSV file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .csv file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_csv().
- Return type:
None
- static save_pandas_to_json(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to a JSON file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .json file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_json().
- Return type:
None
- static save_pandas_to_xml(df, gcs_path, **kwargs)¶
Write the contents of a Pandas DataFrame to an XML file in a bucket.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to save to file.gcs_path (
str
) – The GCS path to the destination .xml file.**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas to_xml().
- Return type:
None
dapla.gcs module¶
dapla.git module¶
- repo_root_dir(directory=None)¶
Find the root directory of a git repo, searching upwards from a given path.
- Parameters:
directory (
Union
[Path
,str
,None
]) – The path to search from, defaults to the current working directory. The directory can be of type string or of type pathlib.Path.- Return type:
Path
- Returns:
Path to the git repo’s root directory.
- Raises:
RuntimeError – If no .git directory is found when searching upwards.
Example:¶
>>> import dapla as dp >>> import tomli >>> >>> config_file = dp.repo_root_dir() / "pyproject.toml" >>> with open(config_file, mode="rb") as fp: >>> config = tomli.load(fp)
dapla.guardian module¶
- class GuardianClient¶
Bases:
object
Client for interacting with the Maskinporten Guardian.
- static call_api(api_endpoint_url, maskinporten_client_id, scopes, keycloak_token=None)¶
Call an external API using Maskinporten Guardian.
- Parameters:
api_endpoint_url (
str
) – URL to the target APImaskinporten_client_id (
str
) – the Maskinporten client idscopes (
str
) – the Maskinporten scopeskeycloak_token (
Optional
[str
]) – the user’s personal Keycloak token. Automatic fetch attempt will be made if left empty.
- Raises:
RuntimeError – If the API call fails
- Return type:
Any
- Returns:
The endpoint json response
- static get_guardian_token(guardian_endpoint, keycloak_token, body)¶
Retrieve access token from Maskinporten Guardian.
- Parameters:
guardian_endpoint (
str
) – URL to the maskinporten guardiankeycloak_token (
str
) – the user’s Keycloak tokenbody (
dict
[str
,str
]) – maskinporten request body
- Raises:
RuntimeError – If the Guardian token request fails
- Return type:
str
- Returns:
The maskinporten access token
- static get_guardian_url()¶
Get the Guardian URL for the current environment.
- Return type:
str
dapla.pandas module¶
- class SupportedFileFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
Bases:
Enum
A collection of supported file formats.
- CSV = 'csv'¶
- EXCEL = 'excel'¶
- FWF = 'fwf'¶
- JSON = 'json'¶
- PARQUET = 'parquet'¶
- SAS7BDAT = 'sas7bdat'¶
- XML = 'xml'¶
- read_pandas(gcs_path, file_format='parquet', columns=None, filters=None, **kwargs)¶
Convenience method for reading a dataset from a given GCS path and convert it to a Pandas dataframe.
- Parameters:
gcs_path (
str
|list
[str
]) – Path or paths to the directory or file you want to get the contents of. Support multiple paths if reading parquet format.file_format (
Optional
[str
]) – The expected file format. All file formats other than “parquet” are delegated to Pandas methods like read_json, read_csv, etc. Defaults to “parquet”.columns (
Optional
[list
[str
]]) – Choose specific columsn to read. Defaults to None.filters (
Union
[list
[tuple
[Any
] |list
[tuple
[Any
]]],Expression
,None
]) – Add row filter to process when reading parquet. The filter should follow pyarrow methods. See examples in the docs: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset. Defaults to None.kwargs (
Any
) – Additional arguments to pass to the underlying Pandas “read_*()” method.
- Raises:
ValueError – If multiple paths are provided for non-parquet formats.
- Return type:
Union
[DataFrame
,Series
]- Returns:
A Pandas DataFrame containing the selected dataset.
- write_pandas(df, gcs_path, file_format='parquet', **kwargs)¶
Convenience method for writing a Pandas DataFrame to a given GCS path.
- Parameters:
df (
DataFrame
) – The Pandas DataFrame to write to file.gcs_path (
str
) – The GCS path to the destination file. Must have an extension that corresponds to the file_formatfile_format (
str
) – The expected file format. All file formats other than “parquet” are delegated to Pandas**kwargs (
Any
) – Additional arguments to pass to the underlying Pandas “to_*()” method.
- Raises:
ValueError – If the file format is invalid.
ValueError – If the path does not have an extension that corresponds to the file format.
- Return type:
None
dapla.pubsub module¶
- exception EmptyListError¶
Bases:
Exception
Empty list error.
- trigger_source_data_processing(project_id, source_name, folder_prefix, kuben=False)¶
Triggers a source data processing service with every file that has a given prefix.
- Parameters:
project_id (str) – The ID of Google Cloud project containing the pubsub topic, this is normally the standard project.
folder_prefix (str) – The folder prefix of the files to be processed.
source_name (str) – The name of source that should process the files.
kuben (bool) – Whether the team is on kuben or legacy.
- Return type:
None