When a team at IBM has coined the word "superscalar" in 1987, they have written ...

When a team at IBM has coined the word "superscalar" in 1987, they have written a widely cited research paper in which they have argued that a "superscalar" CPU is better than a "vector CPU" ("vector" CPUs were well known at that time and had been used for more than a decade, especially in supercomputers), therefore all vector CPUs should be replaced with superior "superscalar" CPUs.

Their theory was in essence that when you have 16x scalar ALUs it is always better to be able to use them individually, instead of having them used by a single instruction.

If you have the hardware that allows using the 16 scalar ALUs individually, by 16 independent instructions, that is obviously much more powerful. It can handle all the cases that can be handled by SIMD, but also many other cases.

The IBM research paper has been extremely influential. Before it, the target for most CPU design teams was to make a pipelined CPU able to execute 1 instruction per clock cycle, after it everybody has switched to attempting to design CPUs with an IPC as high as possible. Even Intel, who prioritized the CPU production cost over the CPU performance, has evolved through 80486 (1989), Pentium (1993) and Pentium Pro (1995) eventually reaching the stage of having a high-performance superscalar CPU.

Nevertheless, around 1995 the initial hype about superscalar CPUs has begun to dissipate. It became understood that even if a superscalar CPU would always be faster than a vector CPU with the same number of ALUs, the cost of the superscalar CPU increases very quickly and super-linearly with the number of ALUs. For high enough values, doubling the number of ALUs in a superscalar CPU increases both the area and the power consumption by a factor much greater than 2.

The result of understanding the limitations of the superscalar CPUs was a smaller step back, resurrecting the vector CPUs, but in combination with the superscalar CPUs. Now all modern CPUs use this combination, instead of using only one of the two design variants.

For instance, Zen 5 contains 32 FP64 adders. However, it can do only 4 independent FP64 operations simultaneously.

Therefore it can do 4 scalar FP64 additions per cycle. Or it can gang pairs of ALUs and it can do 4 additions of length-2 FP64 vectors per cycle. Or it can gang groups of 4 ALUs and it can do 4 additions of length-4 FP64 vectors per cycle. Or it can gang groups of 8 ALUs and it can do 4 additions of length-8 FP64 vectors per cycle.

Only in the last case all the 32 existing FP64 adders are used.

The same is in all modern CPUs. All have 2 limits. One limit is the number of ALUs and the second limit is the number of simultaneous independent operations a.k.a. the number of execution ports.

The number of execution ports limits the number of independent scalar operations. The total number of ALUs limits the number of equivalent scalar operations when they are not independent, but the ALUs are ganged for vector operations.