|
This is wrong on so many levels that I just can't keep quiet about it... sorry charge-n-go, nothing personal.
| QUOTE | | P4E needs very long pipeline stages in order to ramp up to high clock speed in the future. |
very long pipeline stages Prescott aka P4 E, has a long pipeline. A long pipeline has more stages. "very long pipeline stages" is a contradiction.
P4 E needs "Need" isn't the right word to use here. The Intel designers could have used lots of other strategies to extend the P4 design. They just chose to lenghten the pipeline because it's the easiest way to ramp up the clock.
| QUOTE | | It also uses larger cache to compensate the long pipeline penalty. |
Being the 2nd level away from the processor execution core, any instruction cache misses there means that the 12K-micro-operation trace cache had already missed in the first place, which probably already caused a significant delay. The expanded 16Kbyte data cache doesn't help either since pipeline penalties are instruction related, not data. So, expanded caches have little to help or impede performance in a CPU with a long pipeline.
| QUOTE | | It's normal for longer pipeline CPU to perform weaker due to the pipeline latency in every stages, if 1 stage causes 1ns latency, 31 stages causes 31ns latency compare to 20ns latency on northwood. |
Latency is normally measured in clock-ticks, not nanoseconds. Furthermore, because the P4 "netburst" architecture features a double-pumped ALU, so the total in-flight time of an instruction is harder still to calculate.
| QUOTE | | Moreover, longer pipeline means a task can only be accomplished after 31 stages of pipeline, compare to 20 stages of northwood. |
This is only true for the first instruction in an instruction stream... Every following instruction after that is retired with each consecutive cycle after that, so long as there are no pipeline stalls in between and no branch mispredictions/pipeline flushes. The whole point behind having a pipeline is that when an instruction "clears" a stage, that stage is ready to accept the next one, like the stations in a production line.
| QUOTE | | Main advantage of having long pipeline is to ramp up clock speed , because each task are divided into more stages, where each stage equal to 1 clock cycle. |
This part, you got right  It's a misconception that every instruction needs to go through all the 21/31 pipeline stages while executing. The whole purpose of having the "trace cache" is to store decoded instructions, so that, in the case of a loop, the whole decode section of the pipeline can be ignored for the duration of the loop. If roughly 1/3 of the pipeline is solely for the decoding of instructions, that means that 1/3 can be ignored completely in a highly repeated loop, which is a significant amount of cycles saved. One last point is that the P4 is a superscalar architecture, meaning that after the initial decode stage, the pipeline "splits" into different execution units for integer, floating point, SSE, and MMX instructions. That means the "31 stages" of the pipeline might actually represent the longest path through the pipeline, which is normally the floating point path. Less complicated instructions might fly through the pipeline in less stages. I don't have any evidence to back up this claim yet.. I'll update later if I find anything to prove or disprove this. This post has been edited by silkworm: Apr 9 2004, 03:47 PM
|