Reading a .shp and .shx file from an Azure Data Lake/Blob Container

82 Views Asked by At

I am using our Azure Data Factory to load a ZIP from a public api. I then unpack that ZIP using a copy activity, resulting in a bunch of .shp/.shx files.

From a Python Script, I then want to use the geopandas package to read the data into a variable. To achieve that I use the following packages:

azure-datalake-store

azure-storage-blob

import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

storage_conection_string = "myconnectionstring"
blob_service_client = BlobServiceClient.from_connection_string(storage_conection_string)


local_path = r"mypath"
local_file_name = "file.shp"

download_file_path = os.path.join(local_path, local_file_name)
container_client = blob_service_client.get_container_client(container= "mycontainer") 


with open(file=download_file_path, mode="wb") as download_file:
download_file.write(container_client.download_blob("fileonblobstorage").readall())

This correctly downloads the named file to my local storage. However, I would prefer to directly load it into the geodataset, rather than saving it locally, and then reading it in again.

This works using a .csv file, but the .shp file returns an engine error. Since it is stored as a binary file, I assume it is an encoding issue. But I can't seem to figure out a way to solve it. Ultimately, I'd like to get to this:

import geopandas as gpd

gdf = gpd.read_file(container_client)

The above returns a plain error, same one I got with the .csv file. However, wrapping into BytesIO solved the issue for the .csv file into a pandas df, but returns an engine error for the .shp file in a geodataframe.

Lastly, to load properly, the .shx file accompanying the .shp file also has to be loaded. This is usually done automatically by the geopandas package when the files are in the same folder (which is the case in the blob container as well). However, the second file would probably have to be parsed as well.

EDIT: We don't have DataBrick or a Spark Engine on our infrastructure. The Python Script runs on a local data warehousing software on a virtual machine.

1

There are 1 best solutions below

3
Venkatesan On

However, I would prefer to directly load it into the geodataset, rather than saving it locally, and then reading it again

To read .shp files from Azure Blob Storage (private container) without saving them locally, you need to use the Azure Databricks environment.

First, mount your storage account to Databricks and read the shapefile (.shp).

Portal: enter image description here

Code for mount:

dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/blob/",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<Account_key>"})

Now, you can use the following code to read the shape files in Databricks:

Code:

gdf = geopandas.read_file("/dbfs/mnt/blob/map/gadm41_IND_0.shp")  #`dbfs` to path mountpoint+folder+filename
gdf

Output: enter image description here

Alternatively, to read .shp files from Azure Blob Storage (public container) without saving them locally, you can use the following code:

Code:

import fiona
import geopandas as gpd

# Read the shapefile from the URL using fiona
url = "https://<Storage account name>.blob.core.windows.net/<container-name>/map/gadm41_IND_0.shp"
with fiona.open(url) as src:
    features = list(src)
gdf = gpd.GeoDataFrame.from_features(features)
print(gdf.head())

Output:

                                            geometry GID_0 COUNTRY
0  MULTIPOLYGON (((76.97542 8.38514, 76.97486 8.3...   IND   India
1  POLYGON ((75.07161 32.48296, 75.06268 32.48213...   Z01   India
2  POLYGON ((78.65135 32.09228, 78.65241 32.08826...   Z04   India
3  POLYGON ((80.08794 30.79071, 80.08796 30.79026...   Z05   India
4  POLYGON ((94.19125 27.49632, 94.18690 27.49081...   Z07   India

enter image description here