Can execution units belong to same port work simultaneously?

72 Views Asked by Frontier_Setter At 22 August 2023 at 13:30

According to Intel Skylake architecture figure, one port can be linked with multiple execution units. Can these units work simultaneously?

For example, if an "integer vector multiplication instruction" is launched from port0, it will use the "Vect ALU" unit. This instruction has a latency of 5 according to Agner. At next cycle, can port0 launch another instruction to the "ALU" unit before previous vector instruction finish its execution stage? What about ALU and FMA unit?

Original Q&A

There are 2 best solutions below

Peter Cordes On 22 August 2023 at 19:07

Yes, but you'd have a write-back conflict if two uops would produce a result from the same port on the same cycle. e.g. if an add started on port0 two cycles after an addps, they'd both be ready to produce a result the cycle after that.

1                      addps    starts
2                         |
3    add                  v
4   ready (1c lat) |  ready (4c latency)

I think the scheduler tries to avoid that, and/or something stalls if it happens. With sqrtsd latency being slightly variable and long (15-16 cycles), the scheduler can't be perfect so I assume at least the div/sqrt unit needs a way to stall.

Intel's optimization manual may mention some of this; Andy (Krazy) Glew mentioned in an SO comment having written about some of the complexities in the first version of Intel's compiler-writer's guide for P6.

You can test this if you have a Skylake by running a mix of mostly add instructions with the occasional addps, and see how close you still get close to 4 uops per clock.

Or maybe better, shift instructions (p06 only) and fmul (p0 only), so you aren't also bumping into the front-end bottleneck of 4 uops / clock. Or imul (p1) and bzhi (p15).

On Skylake, port 1 is the only port that can handle scalar integer uops with 3 cycle latency; the rest only handle 1-cycle integer uops. That's why imul, lzcnt, and slow-LEA are all on that port. (It's also the port whose vector ALUs are shut down when there are 512-bit uops in flight, since those presumably work by combining the 256-bit units on p0 and p1 into a 512-bit unit.)

John D McCalpin On 22 August 2023 at 20:03

The "port" only controls the dispatch of uops to the set of functional units behind that port, limiting that dispatch to 1 uop per cycle per port.

Pipelined multi-cycle uops will continue executing in their respective functional units while subsequent uops issued to the same port can go to the same or different functional units. Since only one uop per cycle can come through a port, no two uops can be executing in the same pipeline stage of a functional unit, but otherwise interleaving is common - including uops that belong to different thread contexts.

Non-pipelined multi-cycle uops will block the functional unit until they complete, but will not block the port (and therefore not block the other functional units behind the port).

Can execution units belong to same port work simultaneously?

There are 2 best solutions below

Related Questions in X86

Related Questions in CPU

Related Questions in CPU-ARCHITECTURE

Related Questions in INTEL

Related Questions in ALU

Trending Questions

Popular # Hahtags

Popular Questions