On-premise Azure Service Fabric Services upgrade fails with no obvious error messages

50 Views Asked by At

I am trying to upgrade a 3 node on-premise Service Fabric cluster. It is currently running 9.0.1553.9590 to 9.1.1883.9590 (previously sucesfully upgraded from 7.x > 8.x and onto 9.0)

I am able to trigger the upgrade as I have done on the previous ones and the first node appears to upgrade successfully, but then the healthchecks, before starting the 2nd upgrade domain VM seem, to fail and the upgrade is rolled back.

Noting that

  • The cluster appears otherwise healthy
  • I have tried doubling the upgrade timeout from 10 minutes to 20 minutes, I get the same result
  • The Windows Event Log shows no obvious errors, and all the various Service Fabric processes appears to load correctly, as you might expect given the first point

So does anyone have any suggestion how to debug this, or where to find any detailed upgrade logs?

1

There are 1 best solutions below

0
Richard Fennell On

The solution it turned out was to reboot each of the VMs in the cluster in turn, so as the cluster remained operational i.e. 2 of the 3 nodes were running at all time.

We then reran the cluster upgrade and it completed successfully. The only strange difference was that the first node in the cluster did an extra reboot during the upgrade, usually the OS is not rebooted during a Service Fabric upgrade. The 2nd and 3rd nodes did not do an extra reboot.

The assumption is that the issue was a lock on some file that was released by the reboot.