How to setup the netCDF4 package in multistage docker build?

43 Views Asked by At

I have an existing dockerfile that runs a python program involving netCDF4. Here's a simplified version:

ARG BASE_IMG=python:3.11-slim
ARG VENV="/opt/venv"

# ------------------------------ #
FROM $BASE_IMG
ARG VENV

RUN apt-get update && \
    apt-get upgrade && \
    apt-get install -y python3-dev libhdf5-dev libnetcdf-dev

RUN python -m venv $VENV
ENV PATH="$VENV/bin:$PATH"

RUN pip install numpy~=1.23.5 netcdf4~=1.6.4 h5py~=3.9.0

COPY test.py test.py

ENTRYPOINT ["python", "-m", "test"]

My full dockerfile involves some c++ compilation as well, and I want to covert this into a multistage build so the compilation tools don't end up in my final image. While I'm at it, I figured I could also pip install my python packages in the compile stage as well, and move the whole venv over to the final stage like so:

ARG BASE_IMG=python:3.11-slim
ARG VENV="/opt/venv"

FROM $BASE_IMG as compile-image
ARG VENV

RUN apt-get update && \
    apt-get upgrade && \
    apt-get install -y python3-dev libhdf5-dev libnetcdf-dev

RUN python -m venv $VENV
ENV PATH="$VENV/bin:$PATH"

RUN pip install numpy~=1.23.5 netcdf4~=1.6.4 h5py~=3.9.0

# ------------------------------ #
FROM $BASE_IMG
ARG VENV

RUN apt-get update && \
    apt-get upgrade && \
    apt-get install -y libhdf5-dev libnetcdf-dev

COPY --from=compile-image $VENV $VENV
ENV PATH="$VENV/bin:$PATH"

COPY test.py test.py

ENTRYPOINT ["python", "-m", "test"]

This works great, except copying the netCDF4 package over this way seems to result in a large slow down in netcdf read/write operations. I can make an identical Dockerfile to the one above where I just install netCDF4 directly in the final stage, and I don't see this slow down, so I'm thinking there is some sort of external c lib the netCDF4 package is using that I also need to copy over. Does anyone know how to determine whether netCDF4 has linked to all its libs correctly, or what I need to copy over specifically to make this work?

1

There are 1 best solutions below

0
datawookie On

Using test_echam_spectral-deflated.nc as a test file. Not sure what you are doing with the data, but my test script loads all variables from the .nc file:

test.py

import time
import numpy as np
from netCDF4 import Dataset

netcdf_file_path = '/data/test_echam_spectral-deflated.nc'

start_time = time.time()

dataset = Dataset(netcdf_file_path, mode='r')

for var in dataset.variables:
    np.array(dataset.variables[var][:])

end_time = time.time()

elapsed_time = end_time - start_time

print(f"Time taken to load the NetCDF file: {elapsed_time} seconds")

dataset.close()

The data are shared with a container via a volume mount.

I don't see a significant difference in load times. Here's your first Dockerfile:

enter image description here

And this is the second, multistage Dockerfile.

enter image description here

Can you please provide more information which can be used to replicate the issue?