According to Intel Skylake architecture figure, one port can be linked with multiple execution units. Can these units work simultaneously?
For example, if an "integer vector multiplication instruction" is launched from port0, it will use the "Vect ALU" unit. This instruction has a latency of 5 according to Agner. At next cycle, can port0 launch another instruction to the "ALU" unit before previous vector instruction finish its execution stage? What about ALU and FMA unit?

Yes, but you'd have a write-back conflict if two uops would produce a result from the same port on the same cycle. e.g. if an
addstarted on port0 two cycles after anaddps, they'd both be ready to produce a result the cycle after that.I think the scheduler tries to avoid that, and/or something stalls if it happens. With
sqrtsdlatency being slightly variable and long (15-16 cycles), the scheduler can't be perfect so I assume at least the div/sqrt unit needs a way to stall.Intel's optimization manual may mention some of this; Andy (Krazy) Glew mentioned in an SO comment having written about some of the complexities in the first version of Intel's compiler-writer's guide for P6.
You can test this if you have a Skylake by running a mix of mostly
addinstructions with the occasionaladdps, and see how close you still get close to 4 uops per clock.Or maybe better, shift instructions (p06 only) and
fmul(p0 only), so you aren't also bumping into the front-end bottleneck of 4 uops / clock. Orimul(p1) andbzhi(p15).On Skylake, port 1 is the only port that can handle scalar integer uops with 3 cycle latency; the rest only handle 1-cycle integer uops. That's why imul, lzcnt, and slow-LEA are all on that port. (It's also the port whose vector ALUs are shut down when there are 512-bit uops in flight, since those presumably work by combining the 256-bit units on p0 and p1 into a 512-bit unit.)