Azure VM Scale Set : how to allocate, run program, deallocate reliably?

57 Views Asked by At

My task is to run several compute intensive models in a batch. This seems like an ideal cloud use case. I basically want to spin up 100 computers, run a model, and then shut them down. The model might be different each time so I don't really want the VMs to auto-run on start. I want to tell them which program to run.

My solution is to use an Azure fileshare and VM scale set. The various programs sit compiled as EXEs on the fileshare. Model data and file output also sit on the file share.

I want to control the whole process from .NET (F# code). I don't want to use powerscript or the portal or whatever. I want other tools to be able to fire off the VMs automatically.

I'm trying to use the Azure.ResourceManager API to control the process:

  • power on the computers
  • Use RunCommand to run a script to map the fileshare to a drive and then launch my .exe
  • power off and deallocate the computers

This works. But strangely, and not reliably.

The biggest single issue is that the provisioning is extremely flaky. Starting with all VM scaleset instances in Deallocated state, when I try to power on, many succeed but a meaningful number get stuck in 'Updating (Running)' status for a long time (I never saw one finish in 30 minutes). The whole point of the exercise is to run a model quickly. If it takes an hour to turn on the computers it defeats the point.

The other odd thing is that the machines that do turn on seem to immediately start running my script (I can see the model output being produced). However my code doesn't attempt to run scripts on the VMs until they are all on - which happens rarely. It looks as though if a VM ever manages to get a script running then next time it's powered on, it remembers the script and runs it without being told to. Is this expected? I.e. across a power-off and deallocation it seems to remember the previous state. I assumed that on allocation the VM started fresh from the initial image.

Questions:

  1. Is there a reliable way to allocate / run program / deallocate VMs?
  2. What is the VM state after allocation / power on? Is it 'blank slate' or does it remember previous state prior to deallocation?
  3. I'm open to other approaches, e.g Azure batch or something. But I prefer to keep it as simple as possible. I find the Azure documentation extremely hard to follow. This seemed like a minimally complex solution.

My basic control program (run locally, or indeed anywhere) looks roughly like this:

let vmss = resourceGroup.GetVirtualMachineScaleSet("myScaleSet").Value
let powerOn = vmss.PowerOn(Azure.WaitUntil.Completed)
let vms =
    vmss.GetVirtualMachineScaleSetVms()
    |> Seq.cast<VirtualMachineScaleSetVmResource>
    |> List.ofSeq
let scripts =
    vms
    |> List.map (fun vm ->
        let name = vm.Id.Name
        let command = Models.RunCommandInput("RunPowerShellScript")
 
        command.Script.Add(@"net use S: /delete")
        command.Script.Add(@"Net use S: \\fileshare etc.")
        command.Script.Add(@"& S:\MyModel.exe "+name)

        Console.WriteLine("    "+vm.Id.Name+" starting script")
        vm.RunCommand(Azure.WaitUntil.Started, command)
    )

Console.Write("Waiting for scripts to complete... ")
let results = scripts |> List.map (fun op -> op.WaitForCompletionResponse())

// Some code to check for when the model has run

let powerOff =
    vms
    |> List.map (fun vm ->
        let data = vm.Data
        Console.WriteLine("    "+vm.Id.Name+" powering off")
        vm.PowerOff(Azure.WaitUntil.Started)
    )
Console.Write("Waiting for power off... ")
powerOff |> List.iter (fun op -> op.WaitForCompletionResponse() |> ignore)
Console.WriteLine("completed")

Console.Write("Deallocating VMs... ")

vmss.Deallocate(Azure.WaitUntil.Completed) |> ignore
1

There are 1 best solutions below

0
Paul Whiting On

I found a solution which works for me:

Make a VMSS from a custom image (https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/tutorial-use-custom-image-powershell)

The custom image I set up with a batch file to run automatically on power on with Task Scheduler: Running a Powershell script from Task Scheduler

Now on powering on the scale set, all the computers reliably run my program.