I want to set up federated learning with the python package flwr (https://flower.dev/). For this purpose I use Docker Swarm on Google Cloud (3 VMs, 1 server/manager, 2 clients/workers, all based on the docker image python:3.9-slim). I use an overlay network for the communication between server and clients.

To connect the client to the server I user the docker service name of the server, like 'fl_server'. In that configuration, everything works. The clients can connect to the server. It also works with certificates.

Now I want to make the communication safer by encrypt the overlay network with --opt encrypted. But suddenly the client can't reach the server anymore.

I tried some gRPC options (flower is based on gRPC), like GRPC_DNS_RESOLVER="native", but no avail. I logged everything with GRPC_VERBOSITY="DEBUG" and GRPC_TRACE="all": Looks like some kind of timeout to me.

The icmp and esp ports are open in Google Firewall. I also tried this with all ports open.

Here's the client code snippet

    fl.client.start_numpy_client(
        server_address = "fl_server:8080", 
        root_certificates=root_certificates,
        client=FLClient(model,
                        X_train, y_train, X_test, y_test,
                        model_attributes))

And here's the server snippet

    fl.server.start_server(
        server_address = "0.0.0.0:8080",
        strategy=strategy,
        config=fl.server.ServerConfig(num_rounds=5),
        certificates=certificates
    )

Heres the client log: https://pastebin.com/raw/ty2xrXQS

(I don't know what to cut out because I don't know what's important. Sorry)

The question boils down to: why does encryption in the docker overlay network break gRPC communication?

Has anyone a clue what to do? Thanks in Advance

0

There are 0 best solutions below