Announcement

**Indiana** · 13 July 2001, 13:08

This thing is quite old.
But then though he is a bitof an arguable guy he's also a quite gifted programmer and definitely knows what he talks about. You should especially read the thing about the P-IVs missing piplines to feed all it's integer/fpu cores at once - now that's really a triple D'oh for Intel: giving the CPU those extra cores and then scrapping out the needed pipelines for them all to really operate at once....

**Kosh Naranek** · 13 July 2001, 13:23

Well, I would be pissed to if all my hard work suddenly didn't look so good anymore due to a new technology.
Here is a guy who has been optimizing his emulators for a specific architecture and suddenly a new one comes along which is totally different !

About the claims he makes .... there is a lot of different ways of doing / programming things and somewhere on the Web IO have seen an article where he and Tim Sweeney debates about ways to optimize things !

I'm not saying the current P4 is perfect, far from, but it's a step in the right direction and hopefully NorthWood will be another step in the right direction.

**Indiana** · 14 July 2001, 13:29

But no optimization can make up for the fact that Intel spent the PIV with all those mighty (anc costly!) CPU/FPU cores and then afterwards stripped out the needed pipelines to feed them all at once (btw, those pipelines apparently were there in the early tech-papers....).

So if you really optimize your code making sure that the FPU/INT cores get used in an optimal way, you still might be trapped due to the pipeline issues (and pipeline stalls are a MAJOR slowdown on the PIV).
Again: The current implementation of the PIV seems a bit "beta" to me. Maybe they get the real thing out soon.

**Kosh Naranek** · 14 July 2001, 15:26

If Intel decide to put two FPU's in Northwood instead of the one they currently have in Willamatte, things could look very different and pipeline stalls could be less of a problem.If they don't then I think the only difference could be in the trace cache + large L1 + L2

And yes .. the current P4 looks much like the Pentium Pro which was replaced by the PII.

If he so dislikes this issue then why dosen't he write code to take advantage of SSE2 ?

**Technoid** · 15 July 2001, 03:34

SSE2 (Screaming Sandy Extensions twice) Is not a magic bullet!

It is'nt the glorified solution to all speed problems with P4!!

**Kosh Naranek** · 15 July 2001, 04:01

Never said it was, but it can fix the specific problem with the FPU in this case.

**Technoid** · 15 July 2001, 05:58

It's just that this reminds me so much of K6-2 days:

"If everybody uses "3DNOW" Our K6-2 will be faster than a P2 on the same MHz speed" Said AMD

"A real CPU, as our Has a real FPU" said Intel!

Joe Average:
"Duh.. I haven't seen one game that has 3dDown optimising or what they calll it, I's cheating anyway"

Now Intel comes out with a FPU weak CPU and they say the same thing but I havent heard one reviewer that makes the conection that Intel actualy uses the same argument that AMD has stopped using.

Many (but not all) just screams " But why don't they use SSE2 optimised applications?

Intel condemned that aproach at the time but is now embrasing it!

And optimizing for SSE2 would in his case only benefit a very tiny part of his userbase.

I have A big dislike for "spiffy new instructions that need special compailing or optimising" because it means that I could be left out if I use the wrong CPU!

And the above is aimed at Both Intel and AMD!!

**Indiana** · 15 July 2001, 06:14

AMD might adopt SSE2 for their next CPUs, then we'll quite likely see SSE2 optimizations.
But if you hear game-developpers talk, they don't care that much about SSE2 ("too complicated to use", "not the instructions needed for games",...), nearly all of them prefer and really like NVidias programmable vertex shaders.

**Kosh Naranek** · 15 July 2001, 07:02

I don't dislike new things ... normally new things means progress.
In the case of the FPU vs ANYTHING ELSE, x86's was getting to a point where the only way to get a better FPU was to increase the number of FPU's ( 3 in AMD's) so instead *ntel decided to use another approch -> SSE2 which also means that more instructions can be executed simultaneously. ( under the right circumstances )

Because one of the big differences on RISC vs x86 is FPU performance, but not anymore.

I also remembers the 3DNow debate.

And yes, game developers are saying that it's not the CPU which is holding a game back but the performance of the GFX card ( G550 )

**Nuno** · 15 July 2001, 16:13

The bright side of Pentium IV is that it´s a real evolution on the memory bandwidth requirements. I can dig also the extended SIMD instructions because they are the future. Pure x86 fpu has severe limitations. You´d need a major brute force approach to have the same power with x86 fpu as with SIMD intructions.

What I don´t feel so confortable with is the weak x86 fpu power, compared with the Athlon. Legacy support anyone? It wouldn´t hurt to give it roughly the same fpu power as the PIII. What saves the PIV is the insane clock speeds their running at. But I still find amusing reviewers raving about PIV 1800 Mhz finally scoring better than the Athlon 1400 in some benchmarks. Yes, big thing for a chip running 400 mhz faster and being 3 times more expensive.

RAMBUS was also a very very bad choice. Intel is finnaly getting it, though. Dual channel DDR SDRAM (a la nForce) for the PIV, anyone?

As a side note, why oh why such a low amount of L1 cache??? Unbelievable. How hard could it be to throw at it at least 32 Kb of L1? Not a million more transistors required, for sure.

**Kosh Naranek** · 15 July 2001, 18:02

About the L1 Cache .... The smaller the size of the L1 cache the LOWER the latency.

The P4 can boast THE lowest latency to date on it's L1/L2 cache.

It's all a matter of design decisions ... Larger L1 cache -> Higher latency... fewer cache misses yes, but it really doesn't matter because the P4 is not as dependant on x86 decoder registers as other x86 CPU's due to it's trace cache.

The P4 L1 cache is not divided into the normal data and instruction parts, instead there is a small Data cache and a special trace cache.
The Trache cache sorts instructions so they lie in a sequential order.Furthermore it doesn't store instructions but micro-ops and it doesn't store addresses of instructions as normal CPU's do but it stores the EXPECTED program-flow.
The BIG thing here is, that the micro-ops stored in the trace cache can be executed an infinite number of times without the need of tracing them every time.This gives the P4 a very good performance in loops as compared to the P3.Furthermore the trace cache makes the number of concurrent instructions the P4 can execute independant of the number of x86 decoders.

The L1 cache reads data in 128 bit chunks ( P3..32 bit ) this means that the content of the cache is changing rapidly.This is a great advantage when processing huge amounts of sequential data but a disadvantage when the content changes constantly ( 128 bit need to be changed at a time ).
So Intel needed to design a L1 which wouldn't hold the rest of the CPU back and to do so they needed a L1 with a low Latency and ended up with a 8Kb L1-data with a latency of 2 clocks and since data can be delivered on every clock cycle the bandwidth of the P4 L1-cache is 48 GB/S at 1.5 Ghz.This bandwidth can only be achieved when using 128 Bit sse2 instructions otherwise it's about half of that.
The L2 cache has a 45 GB/S bandwidth at 1.4 GHz with a Latency of 7 and it can also transfer data on every clock cycle.So L1 Latency + L2 Latency = 9 clock latency as compared to a 20 clock latency on The Athlon.

**Kosh Naranek** · 15 July 2001, 18:11

Reaching 3 Ghz with a L1 cache size of 64 Kb is almost impossible without holding the rest of the CPU back.
Double pumping .... 1.5 Ghz cpu -> 3Ghz ALU ... 3Ghz cpu -> 6 Ghz ALU.

Announcement

This guy does NOT like the Pentium 4

This guy does NOT like the Pentium 4

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment