Dapla (statnorway)

Functions for reading and writing GeoDataFrames in Statistics Norway’s GCS Dapla.

check_files(folder, contains=None, within_minutes=None)[source]

Returns DataFrame of files in the folder and subfolders with times and sizes.

Parameters:
  • folder (str) – Google cloud storage folder.

  • contains (str | None) – Optional substring that must be in the file path.

  • within_minutes (int | None) – Optionally include only files that were updated in the last n minutes.

Return type:

DataFrame

exists(path)[source]

Returns True if the path exists, and False if it doesn’t.

Parameters:

path (str) – The path to the file or directory.

Return type:

bool

Returns:

True if the path exists, False if not.

get_bounds_series(paths, file_system=None, threads=None, pandas_fallback=False)[source]

Get a GeoSeries with file paths as indexes and the file’s bounds as values.

The returned GeoSeries can be used as the first argument of ‘read_geopandas’ along with the ‘mask’ keyword.

Parameters:
  • paths (list[str | Path] | tuple[str | Path]) – Iterable of file paths in gcs.

  • file_system (dp.gcs.GCSFileSystem | None) – Optional instance of dp.gcs.GCSFileSystem. If None, an instance is created within the function. Note that this is slower in long loops.

  • threads (int | None) – Number of threads to use if reading multiple files. Defaults to the number of files to read or the number of available threads (if lower).

  • pandas_fallback (bool) – If False (default), an exception is raised if the file has no geo metadata. If True, the geometry value is set to None for this file.

Return type:

GeoSeries

Returns:

A geopandas.GeoSeries with file paths as indexes and bounds as values.

Examples:

>>> import sgis as sg
>>> import dapla as dp
>>> file_system = dp.FileClient.get_gcs_file_system()
>>> all_paths = file_system.ls("...")

Get the bounds of all your file paths, indexed by path.

>>> bounds_series = sg.get_bounds_series(all_paths, file_system)
>>> bounds_series
.../0301.parquet    POLYGON ((273514.334 6638380.233, 273514.334 6...
.../1101.parquet    POLYGON ((6464.463 6503547.192, 6464.463 65299...
.../1103.parquet    POLYGON ((-6282.301 6564097.347, -6282.301 660...
.../1106.parquet    POLYGON ((-46359.891 6622984.385, -46359.891 6...
.../1108.parquet    POLYGON ((30490.798 6551661.467, 30490.798 658...
                                                                                                        ...
.../5628.parquet    POLYGON ((1019391.867 7809550.777, 1019391.867...
.../5630.parquet    POLYGON ((1017907.145 7893398.317, 1017907.145...
.../5632.parquet    POLYGON ((1075687.587 7887714.263, 1075687.587...
.../5634.parquet    POLYGON ((1103447.451 7874551.663, 1103447.451...
.../5636.parquet    POLYGON ((1024129.618 7838961.91, 1024129.618 ...
Length: 357, dtype: geometry

Make a grid around the total bounds of the files, and read geometries intersecting with the mask in a loop.

>>> grid = sg.make_grid(bounds_series, 10_000)
>>> for mask in grid.geometry:
...     df = sg.read_geopandas(
...         bounds_series,
...         mask=mask,
...         file_system=file_system,
...     )
read_geopandas(gcs_path, pandas_fallback=False, file_system=None, mask=None, threads=None, **kwargs)[source]

Reads geoparquet or other geodata from one or more files on GCS.

If the file has 0 rows, the contents will be returned as a pandas.DataFrame, since geopandas does not read and write empty tables.

Note

Does not currently read shapefiles or filegeodatabases.

Parameters:
  • gcs_path (str | Path | list[str | Path] | tuple[str | Path] | GeoSeries) – path to one or more files on Google Cloud Storage. Multiple paths are read with threading.

  • pandas_fallback (bool) – If False (default), an exception is raised if the file can not be read with geopandas and the number of rows is more than 0. If True, the file will be read with pandas if geopandas fails.

  • file_system (dp.gcs.GCSFileSystem | None) – Optional file system.

  • mask (GeoSeries | GeoDataFrame | shapely.Geometry | tuple | None) – Optional geometry mask to keep only intersecting geometries. If ‘gcs_path’ is an iterable of multiple paths, only the files with a bbox that intersects the mask are read, then filtered by location.

  • threads (int | None) – Number of threads to use if reading multiple files. Defaults to the number of files to read or the number of available threads (if lower).

  • **kwargs – Additional keyword arguments passed to geopandas’ read_parquet or read_file, depending on the file type.

Return type:

GeoDataFrame | DataFrame

Returns:

A GeoDataFrame if it has rows. If zero rows, a pandas DataFrame is returned.

write_geopandas(df, gcs_path, overwrite=True, pandas_fallback=False, file_system=None, write_covering_bbox=False, **kwargs)[source]

Writes a GeoDataFrame to the speficied format.

Note

Does not currently write to shapelfile or filegeodatabase.

Parameters:
  • df (GeoDataFrame) – The GeoDataFrame to write.

  • gcs_path (str | Path) – The path to the file you want to write to.

  • overwrite (bool) – Whether to overwrite the file if it exists. Defaults to True.

  • pandas_fallback (bool) – If False (default), an exception is raised if the file can not be written with geopandas and the number of rows is more than 0. If True, the file will be written without geo-metadata if >0 rows.

  • file_system (dp.gcs.GCSFileSystem | None) – Optional file sustem.

  • write_covering_bbox (bool) – Writes the bounding box column for each row entry with column name “bbox”. Writing a bbox column can be computationally expensive, but allows you to specify a bbox in : func:read_parquet for filtered reading. Note: this bbox column is part of the newer GeoParquet 1.1 specification and should be considered as experimental. While writing the column is backwards compatible, using it for filtering may not be supported by all readers.

  • **kwargs – Additional keyword arguments passed to parquet.write_table (for parquet) or geopandas’ to_file method (if not parquet).

Return type:

None