I am using our Azure Data Factory to load a ZIP from a public api. I then unpack that ZIP using a copy activity, resulting in a bunch of .shp/.shx files.
From a Python Script, I then want to use the geopandas package to read the data into a variable. To achieve that I use the following packages:
azure-datalake-store
azure-storage-blob
import os
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
storage_conection_string = "myconnectionstring"
blob_service_client = BlobServiceClient.from_connection_string(storage_conection_string)
local_path = r"mypath"
local_file_name = "file.shp"
download_file_path = os.path.join(local_path, local_file_name)
container_client = blob_service_client.get_container_client(container= "mycontainer")
with open(file=download_file_path, mode="wb") as download_file:
download_file.write(container_client.download_blob("fileonblobstorage").readall())
This correctly downloads the named file to my local storage. However, I would prefer to directly load it into the geodataset, rather than saving it locally, and then reading it in again.
This works using a .csv file, but the .shp file returns an engine error. Since it is stored as a binary file, I assume it is an encoding issue. But I can't seem to figure out a way to solve it. Ultimately, I'd like to get to this:
import geopandas as gpd
gdf = gpd.read_file(container_client)
The above returns a plain error, same one I got with the .csv file. However, wrapping into BytesIO solved the issue for the .csv file into a pandas df, but returns an engine error for the .shp file in a geodataframe.
Lastly, to load properly, the .shx file accompanying the .shp file also has to be loaded. This is usually done automatically by the geopandas package when the files are in the same folder (which is the case in the blob container as well). However, the second file would probably have to be parsed as well.
EDIT: We don't have DataBrick or a Spark Engine on our infrastructure. The Python Script runs on a local data warehousing software on a virtual machine.
To read
.shpfiles from Azure Blob Storage (private container) without saving them locally, you need to use the Azure Databricks environment.First, mount your storage account to Databricks and read the shapefile (.shp).
Portal:
Code for mount:
Now, you can use the following code to read the shape files in Databricks:
Code:
Output:
Alternatively, to read
.shpfiles from Azure Blob Storage (public container) without saving them locally, you can use the following code:Code:
Output: