Dapla (statnorway)¶

Functions for reading and writing GeoDataFrames in Statistics Norway’s GCS Dapla.

class GCSFileSystem[source]¶

Bases: object

Placeholder.

check_files(folder, contains=None, within_minutes=None)[source]¶

Returns DataFrame of files in the folder and subfolders with times and sizes.

Parameters:

folder (str) – Google cloud storage folder.
contains (str | None) – Optional substring that must be in the file path.
within_minutes (int | None) – Optionally include only files that were updated in the last n minutes.

Return type:

DataFrame

expression_match_path(expression, path)[source]¶

Check if a file path match a pyarrow Expression.

Return type:

bool

Parameters:

expression (Expression)
path (str)

Examples:¶

>>> import pyarrow.compute as pc
>>> path = 'data/file.parquet/x=1/y=10/name0.parquet'
>>> expression = (pc.Field("x") == 1) & (pc.Field("y") == 10)
>>> expression_match_path(path, expression)
True
>>> expression = (pc.Field("x") == 1) & (pc.Field("y") == 5)
>>> expression_match_path(path, expression)
False
>>> expression = (pc.Field("x") == 1) & (pc.Field("z") == 10)
>>> expression_match_path(path, expression)
False

get_bounds_series(paths, file_system=None, use_threads=True, pandas_fallback=False)[source]¶

Get a GeoSeries with file paths as indexes and the file’s bounds as values.

The returned GeoSeries can be used as the first argument of ‘read_geopandas’ along with the ‘mask’ keyword.

Parameters:

paths (list[str | Path] | tuple[str | Path]) – Iterable of file paths in gcs.
file_system (GCSFileSystem | None) – Optional instance of GCSFileSystem. If None, an instance is created within the function. Note that this is slower in long loops.
use_threads (bool) – Default True.
pandas_fallback (bool) – If False (default), an exception is raised if the file has no geo metadata. If True, the geometry value is set to None for this file.

Return type:

GeoSeries

Returns:

A geopandas.GeoSeries with file paths as indexes and bounds as values.

Examples:¶

>>> import sgis as sg
>>> import dapla as dp
>>> all_paths =  GCSFileSystem().ls("...")

Get the bounds of all your file paths, indexed by path.

>>> bounds_series = sg.get_bounds_series(all_paths, file_system)
>>> bounds_series
.../0301.parquet    POLYGON ((273514.334 6638380.233, 273514.334 6...
.../1101.parquet    POLYGON ((6464.463 6503547.192, 6464.463 65299...
.../1103.parquet    POLYGON ((-6282.301 6564097.347, -6282.301 660...
.../1106.parquet    POLYGON ((-46359.891 6622984.385, -46359.891 6...
.../1108.parquet    POLYGON ((30490.798 6551661.467, 30490.798 658...
                                                                                                        ...
.../5628.parquet    POLYGON ((1019391.867 7809550.777, 1019391.867...
.../5630.parquet    POLYGON ((1017907.145 7893398.317, 1017907.145...
.../5632.parquet    POLYGON ((1075687.587 7887714.263, 1075687.587...
.../5634.parquet    POLYGON ((1103447.451 7874551.663, 1103447.451...
.../5636.parquet    POLYGON ((1024129.618 7838961.91, 1024129.618 ...
Length: 357, dtype: geometry

Make a grid around the total bounds of the files, and read geometries intersecting with the mask in a loop.

>>> grid = sg.make_grid(bounds_series, 10_000)
>>> for mask in grid.geometry:
...     df = sg.read_geopandas(
...         bounds_series,
...         mask=mask,
...         file_system=file_system,
...     )

read_geopandas(gcs_path, pandas_fallback=False, file_system=None, mask=None, use_threads=True, filters=None, **kwargs)[source]¶

Reads geoparquet or other geodata from one or more files on GCS.

If the file has 0 rows, the contents will be returned as a pandas.DataFrame, since geopandas does not read and write empty tables.

Note

Does not currently read shapefiles or filegeodatabases.

Parameters:

gcs_path (str | Path | list[str | Path] | tuple[str | Path] | GeoSeries) – path to one or more files on Google Cloud Storage. Multiple paths are read with threading.
pandas_fallback (bool) – If False (default), an exception is raised if the file can not be read with geopandas and the number of rows is more than 0. If True, the file will be read with pandas if geopandas fails.
file_system (GCSFileSystem | None) – Optional file system.
mask (GeoSeries | GeoDataFrame | Geometry | tuple | None) – If gcs_path is a partitioned parquet file or an interable of paths. Only files with a bbox intersecting mask will be read. Note that the data is not filtered on a row level. You should either use clip or sfilter to filter the data after reading.
use_threads (bool) – Defaults to True.
filters (Expression | None) – To filter out data. Either a pyarrow.dataset.Expression, or a list in the structure [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in]. More details here: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
**kwargs – Additional keyword arguments passed to geopandas’ read_parquet or read_file, depending on the file type.

Return type:

GeoDataFrame | DataFrame

Returns:

A GeoDataFrame if it has rows. If zero rows, a pandas DataFrame is returned.

write_geopandas(df, gcs_path, overwrite=True, pandas_fallback=False, file_system=None, partition_cols=None, existing_data_behavior='error', **kwargs)[source]¶

Writes a GeoDataFrame to the speficied format.

Note

Does not currently write to shapelfile or filegeodatabase.

Parameters:

df (GeoDataFrame) – The GeoDataFrame to write.
gcs_path (str | Path) – The path to the file you want to write to.
overwrite (bool) – Whether to overwrite the file if it exists. Defaults to True.
pandas_fallback (bool) – If False (default), an exception is raised if the file can not be written with geopandas and the number of rows is more than 0. If True, the file will be written without geo-metadata if >0 rows.
file_system (GCSFileSystem | None) – Optional file sustem.
partition_cols – Column(s) to partition by. Only for parquet files.
existing_data_behavior (str) – ‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’. Defaults to ‘error’. More info: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
**kwargs – Additional keyword arguments passed to parquet.write_table (for parquet) or geopandas’ to_file method (if not parquet).

Return type:

None