Reference

dapla_pseudo package

Subpackages

dapla_pseudo.constants module

This module defines constants that are referenced throughout the codebase.

class Env(*values)

Bases: str, Enum

Environment variable keys.

PSEUDO_CLIENT_MAX_TOTAL_PARTITIONS = 'PSEUDO_CLIENT_MAX_TOTAL_PARTITIONS'
PSEUDO_CLIENT_ROWS_PER_PARTITION = 'PSEUDO_CLIENT_ROWS_PER_PARTITION'
PSEUDO_SERVICE_AUTH_TOKEN = 'PSEUDO_SERVICE_AUTH_TOKEN'
PSEUDO_SERVICE_URL = 'PSEUDO_SERVICE_URL'
class MapFailureStrategy(*values)

Bases: str, Enum

UnknownCharacterStrategy defines how encryption/decryption should handle non-alphabet characters.

RETURN_NULL = 'RETURN_NULL'
RETURN_ORIGINAL = 'RETURN_ORIGINAL'
class PredefinedKeys(*values)

Bases: str, Enum

Names of ‘global keys’ that the Dapla Pseudo Service is familiar with.

PAPIS_COMMON_KEY_1 = 'papis-common-key-1'
SSB_COMMON_KEY_1 = 'ssb-common-key-1'
SSB_COMMON_KEY_2 = 'ssb-common-key-2'
class PseudoFunctionTypes(*values)

Bases: str, Enum

Names of well known pseudo functions.

DAEAD = 'daead'
FF31 = 'ff31'
MAP_SID = 'map-sid-ff31'
REDACT = 'redact'
class PseudoOperation(*values)

Bases: str, Enum

Pseudo operation.

DEPSEUDONYMIZE = 'depseudonymize'
PSEUDONYMIZE = 'pseudonymize'
REPSEUDONYMIZE = 'repseudonymize'
class UnknownCharacterStrategy(*values)

Bases: str, Enum

UnknownCharacterStrategy defines how encryption/decryption should handle non-alphabet characters.

DELETE = 'delete'
FAIL = 'fail'
REDACT = 'redact'
SKIP = 'skip'

dapla_pseudo.exceptions module

Common exceptions for the Dapla Pseudo package.

exception ExtensionNotValidError(message)

Bases: Exception

Exception raised when a file extension is invalid.

Parameters:

message (str)

Return type:

None

exception FileInvalidError(message)

Bases: Exception

Exception raised when a file is in an invalid state.

Parameters:

message (str)

Return type:

None

exception MimetypeNotSupportedError(message)

Bases: Exception

Exception raised when a Mimetype is invalid.

Parameters:

message (str)

Return type:

None

exception NoFileExtensionError(message)

Bases: Exception

Exception raised when a file has no file extension.

Parameters:

message (str)

Return type:

None

dapla_pseudo.models module

The models module contains base classes used by other models.

class APIModel(**data)

Bases: BaseModel

APIModel is a base class for models that are used for communicating with the Dapla Pseudo Service.

It provides configuration for serializing/converting between camelCase (required by the API) and snake_case (used pythonically by this lib). It also provides some good defaults for converting a model to JSON.

model_config: ClassVar[ConfigDict] = {'alias_generator': <function camelize>, 'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

to_json()

Convert the model to JSON using camelCase aliases and only including assigned values.

Return type:

str

dapla_pseudo.types module

Type declarations for dapla-toolbelt-pseudo.

dapla_pseudo.utils module

Utility functions for Dapla Pseudo.

build_pseudo_field_request(pseudo_operation, mutable_df, rules, custom_keyset=None, target_custom_keyset=None, target_rules=None)

Builds a FieldRequest object.

Return type:

list[PseudoFieldRequest | DepseudoFieldRequest | RepseudoFieldRequest]

Parameters:
  • pseudo_operation (PseudoOperation)

  • mutable_df (MutableDataFrame)

  • rules (list[PseudoRule])

  • custom_keyset (PseudoKeyset | str | None)

  • target_custom_keyset (PseudoKeyset | str | None)

  • target_rules (list[PseudoRule] | None)

convert_to_date(sid_snapshot_date=None)

Converts the SID version date to the ‘date’ type, if it is a string.

If None, simply passes the None through the function.

Return type:

date | None

Parameters:

sid_snapshot_date (date | str | None)

encode_datadoc_variables(variables, indent=2)

Encore datadoc variables to a fromatted json list.

Return type:

str

Parameters:
  • variables (list[Variable])

  • indent (int)

find_multipart_obj(obj_name, multipart_files_tuple)

Find “multipart object” by name.

The requests lib specifies multipart file arguments as file-tuples, such as (‘filename’, fileobj, ‘content_type’) This method searches a tuple of such file-tuples ((file-tuple1),…,(file-tupleN)) It returns the fileobj for the first matching file-tuple with a specified filename.

Parameters:
  • obj_name (str) – The name of the object

  • multipart_files_tuple (set[Any]) – The multipart tuple

Return type:

Any

Returns:

The fileobject associated with the matched tuple

Example

``` multipart_tuple = ((‘filename1’, fileobj1, ‘application/json’), (‘filename2’, fileobj2, ‘application/json’))

find_multipart_obj(“filename2”, multipart_tuple) -> fileobj2 ```

get_file_format_from_file_name(file_path)

Extracts the file format from a file path.

Return type:

SupportedOutputFileFormat

Parameters:

file_path (str | Path)

redact_field(request)

Perform the redact operation locally.

This is in order to avoid making unnecessary requests to the API.

Return type:

tuple[str, list[str | None], RawPseudoMetadata]

Parameters:

request (PseudoFieldRequest)

running_asyncio_loop()

Returns the asyncio event loop if it exists.

Return type:

AbstractEventLoop | None

dapla_pseudo.v1 package

dapla_pseudo.v1.client module

Module that implements a client abstraction that makes it easy to communicate with the Dapla Pseudo Service REST API.

class PseudoClient(pseudo_service_url=None, auth_token=None, rows_per_partition=None, max_total_partitions=None)

Bases: object

Client for interacting with the Dapla Pseudo Service REST API.

Parameters:
  • pseudo_service_url (str | None)

  • auth_token (str | None)

  • rows_per_partition (str | None)

  • max_total_partitions (str | None)

async static is_json_parseable(response)

Check if response content is JSON parseable.

Return type:

bool

Parameters:

response (ClientResponse)

async post_to_field_endpoint(path, timeout, pseudo_requests)

Post a request to the Pseudo Service field endpoint.

Parameters:
  • path (str) – Full URL to the endpoint

  • timeout (int) – Request timeout

  • pseudo_requests (list[PseudoFieldRequest | DepseudoFieldRequest | RepseudoFieldRequest]) – Pseudo requests

Returns:

A list of tuple of (field_name, data, metadata)

Return type:

list[tuple[str, list[str], RawPseudoMetadata]]

dapla_pseudo.v1.depseudo module

Builder for submitting a pseudonymization request.

class Depseudonymize

Bases: object

Starting point for depseudonymization of datasets.

This class should not be instantiated, only the static methods should be used.

dataset: DataFrame
static from_pandas(dataframe)

Initialize a depseudonymization request from a pandas DataFrame.

Return type:

_Depseudonymizer

Parameters:

dataframe (DataFrame)

static from_polars(dataframe)

Initialize a depseudonymization request from a polars DataFrame.

Return type:

_Depseudonymizer

Parameters:

dataframe (DataFrame)

schema: Series | Schema

dapla_pseudo.v1.pseudo module

Builder for submitting a pseudonymization request.

class Pseudonymize

Bases: object

Starting point for pseudonymization of datasets.

This class should not be instantiated, only the static methods should be used.

dataset: DataFrame
static from_pandas(dataframe)

Initialize a pseudonymization request from a Pandas DataFrame.

Parameters:

dataframe (DataFrame) – A Pandas DataFrame

Returns:

An instance of the _Pseudonymizer class.

Return type:

_Pseudonymizer

static from_polars(dataframe)

Initialize a pseudonymization request from a Polars DataFrame.

Parameters:

dataframe (DataFrame) – A Polars DataFrame

Returns:

An instance of the _Pseudonymizer class.

Return type:

_Pseudonymizer

schema: Series | Schema

dapla_pseudo.v1.result module

Common API models for builder packages.

class Result(pseudo_response, pseudo_operation=None, targeted_columns=None, user_provided_metadata=None, schema=None)

Bases: object

Result represents the result of a pseudonymization operation.

Parameters:
  • pseudo_response (PseudoFieldResponse)

  • pseudo_operation (PseudoOperation | None)

  • targeted_columns (list[str] | None)

  • user_provided_metadata (Datadoc | None)

  • schema (Series | Schema | None)

property datadoc: str

Returns the pseudonymization metadata as a formatted json string.

Returns:

A JSON-formattted string representing the datadoc metadata.

Return type:

str

Raises:

ValueError – If list of a variables is malformed.

property datadoc_model: dict[str, Any] | list[Any]

Returns the pseudonymization metadata as a dictionary.

Returns:

A dictionary representing the datadoc metadata.

Return type:

dict

Raises:

ValueError – If list of a variables is malformed.

property metadata: dict[str, Any]

Returns the aggregated metadata for all fields as a dictionary.

Returns:

A dictionary containing the pseudonymization metadata, where the keys are field names and the values are corresponding pseudo field metadata. If no metadata is set, returns an empty dictionary.

Return type:

Optional[dict[str, str]]

property metadata_details: dict[str, Any]

Returns the pseudonymization metadata as a dictionary, for each field that has been processed.

Returns:

A dictionary containing the pseudonymization metadata, where the keys are field names and the values are corresponding pseudo field metadata. If no metadata is set, returns an empty dictionary.

Return type:

Optional[dict[str, str]]

to_file(file_path, **kwargs)

Write pseudonymized data to a file, with the metadata being written to the same folder.

Parameters:
  • file_path (str) – The path to the file to be written. If writing to a bucket, use the “gs://” prefix.

  • **kwargs (Any) – Additional keyword arguments to be passed the Polars writer function if the input data is a DataFrame. The specific writer function depends on the format of the output file, e.g. write_csv() for CSV files.

Raises:

ValueError – If the result is not of type Polars DataFrame or if the output file format does not match the input file format.

Return type:

None

to_pandas(**kwargs)

Output pseudonymized data as a Pandas DataFrame.

Parameters:

**kwargs (Any) – Additional keyword arguments to be passed the Pandas reader function if the input data is from a file. The specific reader function depends on the format of the input file, e.g. read_csv() for CSV files.

Raises:

ValueError – If the result is not of type Polars DataFrame.

Returns:

A Pandas DataFrame containing the pseudonymized data.

Return type:

pd.DataFrame

to_polars(**kwargs)

Output pseudonymized data as a Polars DataFrame.

Parameters:

**kwargs (Any) – Additional keyword arguments to be passed the Polars “from_dicts” function if the input data is from a file.

Raises:

ValueError – If the result is not of type Polars DataFrame.

Returns:

A Polars DataFrame containing the pseudonymized data.

Return type:

pl.DataFrame

aggregate_metrics(metadata)

Aggregates logs and metrics. Each unique metric is summarized.

Return type:

dict[str, Any]

Parameters:

metadata (dict[str, dict[str, list[Any]]])

dapla_pseudo.v1.supported_file_format module

Classes used to support reading of dataframes from file.

class SupportedOutputFileFormat(*values)

Bases: Enum

SupportedOutputFileFormat contains the supported file formats when outputting the result to a file.

Note that this does NOT describe the valid file extensions of _input_ data when reading from a file.

CSV = 'csv'
JSON = 'json'
PARQUET = 'parquet'
XML = 'xml'
ZIP = 'zip'
read_to_pandas_df(supported_format, df_dataset, **kwargs)

Reads a file with a supported file format to a Pandas Dataframe.

Return type:

DataFrame

Parameters:
read_to_polars_df(supported_format, df_dataset, **kwargs)

Reads a file with a supported file format to a Polars Dataframe.

Return type:

DataFrame

Parameters:
write_from_df(df, supported_format, file_like, **kwargs)

Writes to a file with a supported file format from a Dataframe.

Return type:

None

Parameters:
write_from_dicts(data, supported_format, file_like)

Writes data from a list of dicts to a file of the given format.

Return type:

None

Parameters:

dapla_pseudo.v1.validation module

Builder for submitting a validation request.

class Validator

Bases: object

Starting point for validation of datasets.

This class should not be instantiated, only the static methods should be used.

static from_file(file_path_str, **kwargs)

Initialize a validation request from a pandas dataframe read from file.

Parameters:
  • file_path_str (str) – The path to the file to be read.

  • **kwargs (Any) – Additional keyword arguments to be passed to the file reader.

Raises:

FileNotFoundError – If no file is found at the specified local path.

Returns:

An instance of the _FieldSelector class.

Return type:

_FieldSelector

Examples

# Read from bucket from dapla_pseudo import Validator bucket_path = “gs://ssb-staging-dapla-felles-data-delt/felles/smoke-tests/fruits/data.parquet” field_selector = Validator.from_file(bucket_path)

# Read from local filesystem from dapla_pseudo import Validator

local_path = “some_file.csv” field_selector = Validator.from_file(local_path)

static from_pandas(dataframe)

Initialize a validation request from a pandas DataFrame.

Return type:

_FieldSelector

Parameters:

dataframe (DataFrame)

static from_polars(dataframe)

Initialize a validation request from a polars DataFrame.

Return type:

_FieldSelector

Parameters:

dataframe (DataFrame)