I want to set up federated learning with the python package flwr (https://flower.dev/). For this purpose I use Docker Swarm on Google Cloud (3 VMs, 1 server/manager, 2 clients/workers, all based on the docker image python:3.9-slim). I use an overlay network for the communication between server and clients.
To connect the client to the server I user the docker service name of the server, like 'fl_server'. In that configuration, everything works. The clients can connect to the server. It also works with certificates.
Now I want to make the communication safer by encrypt the overlay network with --opt encrypted. But suddenly the client can't reach the server anymore.
I tried some gRPC options (flower is based on gRPC), like GRPC_DNS_RESOLVER="native", but no avail. I logged everything with GRPC_VERBOSITY="DEBUG" and GRPC_TRACE="all": Looks like some kind of timeout to me.
The icmp and esp ports are open in Google Firewall. I also tried this with all ports open.
Here's the client code snippet
fl.client.start_numpy_client(
server_address = "fl_server:8080",
root_certificates=root_certificates,
client=FLClient(model,
X_train, y_train, X_test, y_test,
model_attributes))
And here's the server snippet
fl.server.start_server(
server_address = "0.0.0.0:8080",
strategy=strategy,
config=fl.server.ServerConfig(num_rounds=5),
certificates=certificates
)
Heres the client log: https://pastebin.com/raw/ty2xrXQS
(I don't know what to cut out because I don't know what's important. Sorry)
The question boils down to: why does encryption in the docker overlay network break gRPC communication?
Has anyone a clue what to do? Thanks in Advance