Does anyone know how sample size work in SD's VAE and UNet? All I know is the SD v1.5 was trained with 512*512, so it can generate 512*512 more properly. But when I set the pipeline like 384*384 or even 768*768, it seems it can generate it as well (but less correctly).
I have been search in official github all around, and I found that the sample size setting in UNet and VAE seems doesn't matter, as it didn't directly use it. (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py#L163C9-L163C20)
I wondering could the SD (or the LDM) have ability of generalization to different sample size, so it's possible to inference in any width and height? If so, how its work in training and inference?