specialized HW is fast, but new CPUs are fast as well...

This is a quick topic, I was looking a while ago at the specs of the Pentium 4 3.06 GHz...

4.2 GB/s of bandwidth to memory

8x AGP

>12 GFLOPS

>6 GIPS ( should be over than that, I am assuming a ROUGH metric of 2 integer isntructions/cycle... the double pumped ALU should be able to push more, but still I am trying to be conservative )

512 KB of L2 ( 256 bits bus, up to 256 bits each cycle )


Especially the GFLOPS rating was interesting... the Pentium 4 has a good caching scheme, more bandwidth than the PS2 ( 1 GB/s more than the PS2 which has a theoretical max of 3.2 GB/s )... so with anything bad you can say about SSE, the Pentium 4 has enough bandwidth to sustain a good portion of those 12 GB/s...

The AGP 8x has a top bandiwdth of 2.1 GB/s... ok it is not as efficient as people would like it to be, but again this is 900 MB higher than the ma theoretical value for the GIF-to-GS bus which while being quite efficient ( or so I heard ) tops at 1.2 GB/s...


How come nobody even from Intel has tried to showcase the 3D capabilities of the latest Pentium 4s ???
 
Doing operations like bilinear texture sampling in x86 asm takes a lot of instructions. You've got to calculate the D level, index the texture 4 times, convert each float texture coordinate to an integer offset, read the texture, blend all four together... and, of course, note that the index-and-read operation isn't well suited to SIMD optimisation.

If you think that the P4 3GHz is clocked less than 10 times faster than an R9700, and the card generates 8 pixels per clock cycle, you need to generate one pixel per 1.2 clocks on the P4 to match the rate. It ain't gonna happen :)

Don't get me wrong, the P4 would be capable of doing some pretty capable 3D - but we're talking a different order of magnitude to even the 'cheap' cards available now in performance.
 
Prescott should also see the new 800 MHz FSB... amongst other news ( bigger cache ? something justifies those 100 Millions Transistors... )

800 MHz FSB can do 8 bytes/cycle * 800 MHz = 6.4 GB/s of FSB bandwidth... plus the L2 cache might have a good chance to go up to 1 MB...

32 bytes/cycle * 3 GHz = 96 GB/s of MAX L2 bandwidth ( theoretical )...

The bandwidth is there and the speed is there too... I think a very optimized software rendered could do impressive results on such a CPU...



P.S.: CPU can still hang around with GPUs ;) Prescott will have 100+ MTransistors like the new DX9 GPUs, while running at 3+ GHz ( which is 6x the speed of the GeForce FX )...
 
well if used well the Double Speed ALUs should proove to be quite useful as they can do two dependant instructions in 1 clock cycle without creating a bottleneck...


plus in the age of vertex and pixel shaders using loop, branches ( conditional branches as well ), having stalls here and there, etc... the flexibility of such an aggressive OOO machine ( with advanced branch prediction ) could be put at good use...
 
I am not saying you would not need at all a rasterizer... Intel has one already... push the clock of that rasterizer up and do most of pixel shading math and vertex shading with optimized functions on the Pentium 4 and to the texture smapling and filtering on the graphics card...
 
The 'same' cpu is now also facing greater loads on AI & physics compared to say software rendering in Half-life v DX7(+ a bit) in UT2003.

Look at Dave Baumanns software TnL v Hardware TnL comparisons in recent GPU reviews on this site. CPU's just cant replace GPU's/VPU's in performance terms even on TnL let alone all the other rendering calcualtions.
 
Panajev:

First off, AGP8x is not a part of the P4's specs, it's a part of the chipset connected to the P4, and not even remotely the same thing.

Second, the P4's FSB may be somewhat impressive compared to previous CPUs (Prescott will supposedly be *200* MHz, not 800. Remember, the P4 bus uses quad-datarate signalling), but that's not the important bit. Like Dio already told you, it's instruction execution speed which is going to make the chip fall flat on its face, not the bus bandwidth. Anyway, you won't ever come close to that bandwidth in reality since you can't arrange data with locality in mind in main RAM like you can in the on-board framebuffer memory of a graphics card. There's going to be page breaks all over the place with lots of first-access penalties (especially since graphics card and peripheral busmasters like PCI cards, IDE/USB controllers etc will also be mucking around in main memory), the 128-byte cacheline burst of a P4 is probably poorly suited to graphics tasks also.

Furthermore... As for the "double-pumped" (what a stupid marketing term! :)) ALUs, that is of little importance since the 'double-pumping' only cuts down (slightly) on latency in that incredibly long integer pipe the chip's got, it does not increase instruction throughput at all. An Athlon's ALUs aren't double-pumped, and on a per-clock basis they're MORE capable than a P4s, so just forget all about that okay? :)

To be honest, I doubt a P4-3GHz could even out-render a TNT2, so why bother at all?


*G*
 
I know how the FSB of the Pentium 4 works ( it doesn't change that data wise they can send up to 32 bytes/"effective cycle"... like 32 bytes on both edges of the two slightly skewed clocks achieving basically an effective 4x higher clockspeed for data transfers that is equal to say we are transferring data on a single edge of a 4x faster base clock )... and I was thinking about having a Pentium 4 system with AGP 8x I was not including the new AGP 3.0 in the Pentium 4 specs...


An Athlon's ALUs aren't double-pumped, and on a per-clock basis they're MORE capable than a P4s

... uhm... 3 fully 32 bits ALUs beating 2 basically sixteen-bits ALU clock by clock... I picture myself SHOCKED :D hehe ( ... at first I was under the impression that each of the double pumped ALUs could execute two dependent instructions in 1 external clock cycle, but after reading more about the current ALUs and the new ideas for Prescott ALUs being fully 32 bits... and after thinking about how they realize the execution of two dependent instruction in a single clock cycle [only if each ALU were fully 32 bits and capable to execute 1 ADD/SUB with 1 fast clock pulse then we could have a max of 4 instructions executed per cycle counting both ALUs, well assuming 32 bits ADDs I don not know if when we add/sub 16 bits numbers the situation changes...]... I have to clear my mind on this... ) still it also depends on how the code was compiled...



And again, texture lookups, filtering and rasterization can be done mostly on a separate graphics chip ( Intel has one which clocked higher could be a nice performer ), having a fast and optimized T&L engine running on one of those 3.06 GHz Pentium 4's should not be THAT bad... ;) FP and bandwidth heavvy ? SpecFP2k fits that requirement too and see how the Pentium 4 aces on that one ;)
 
From even a simple perspective a CPU is a general purpose processing device that is basically mostly single threaded in its execution outside its small FPU.

The P4 has around 44 million transistors - but over half of them are cache. So it ends up having around 22 million transistors devoted to 32 or 64 bit operations. At any time at peak use between 5% - 15% of the execution tranisitions are in operation - say 1-3 million transistors and remember that small FPU.

A GF FX has 125 Million transistors, all execution units, all addressing 64 - 128 bit operations and very, very deep pipelines giving very power parallel processing units - nearly all of which are capable of working simultaneously. I read somewhere at peak operation between 80% - 95% of a GPUs transistors are in use - so for the GF FX that's around 85 - 110 Million transistors all firing in parallel.

You can't even say the CPU is 6 times faster by clock speed than the fastest GPU so that cuts a 100 : 1 advantage down to a 15 : 1 advantage. Why? Well a CPU is often data limited - the main difference between a CRAY and a PC is the CRAY has tremendous computers throwing it masses of data all the time and it has super fast memory - sort of like a GPU and onboard Video ram - but alot more :) Look at the design of Sledgehammer - not more FPUs - more smarts to throw more data to the FPUs it already has.

I would guess that a GF FX would be well over a 1,000 times faster than the fastest P4 doing OpenGL calls all in hardware than a general CPU doing the full software equivalents. GPU's also have more powerful instruction sets (than CPUs instruction sets) geared to solving 3D equations one typically expects.

Rendering farms vs next generation rendering done on the descendents of GF FX and R300 in multi chip configurations like the simFusion 6000 maybe.

Can anyone estimate for a game like Quake 3 how many times faster a GF4 is to a 3GHz P4 doing exactly the same level of rendering fully in software?
 
Panajev2001a said:
And again, texture lookups, filtering and rasterization can be done mostly on a separate graphics chip
The majority of the cost of current chips is in the texture lookup, filtering and pixel shading. You wouldn't reduce transistor counts by more than 20-30% at best.

Passing vertex shading off to the host can easily be done, but because it's somewhat slower than doing it in hardware, uses more CPU cycles, requires massively more data to be transferred across the AGP bus, and it doesn't cost much to put the hardware on, it's only used for chips where low cost is the key goal.
 
well the key would be saving as much transistors as we can on the GPU if it is an embedded solution...

I believe that at 3+ GHz the Multi-Threaded P4 ( especially Prescott ) could make a very flexible and powerful Vertex Shader if optimized very well... a fast CPU ( serial sped ) with Integer and FP SIMD capabilities and TONS of bandwidth and an quite impressive caching scheme doesn't look so bad ;)...

the GPU could be a deferred renderer and only need cheap ( compared to 500 MHz DDR-II :) ) Video Memory and a SRAM based tile buffer... this chip could be clocked fairly high ( doing less in parallel compared to chips like R200 or R300, but doing more serially ) as Intel cetainly has the technology to make it ( of course it depends if they think it's worth the time... ) basically the principle behind the GPU would be a relatively simple but FAST processor ( in terms of clockspeed )...

The CPU in this case would have to worry about the rest... A.I. ( in integer code that chip is not "too": slow ;) ok let's think Pentium 4 friendly code ;) ), physics, input processing and T&L... which I am sure it could do fine... the puppy has a max of 12+ GFLOPS, you tell me it cannot pass the Emotion Engine if pushed equally hard ( I am not saying outclass the EE, I said pass it... it would not be bad for the EE... as that Pentium 4 would be 10x the clock speed with ~9x the transistors )...

I am sure the GPU would not end up costing too much in this design and the CPU, well you would use it decently finally :)





g_day, I will point one flaw of what you said... the GeForce FX is not 120+ MTransistors dedicated to ONLY execution units... I am sure the occlusion detection HW, the cross-bar based memory controller, the caches take quite a bit of transistor logic as well....

And by looking at how each chip handles conditional branches for example... well the GeForce FX follows both paths of the if-then-else branch and select the result when it is done... the Pentium 4 can in most cases ( misprediction rate is high relative to the pipeline depth, but it is not an absolutely huge number... wasn't their branch predictor like 95-97% of the time correct ? --> misprediction rate under 5% of the branches predicted ) predict the branch and execute only one path avoiding the need idle execution units...

Or maybe we should call Itanium 2 into the equation ? naah, not for now, it is a higher end chip :)
 
Intel is not the only one with a advanced chip...I think the Hammer will bring new value and speed to the desktop much like the Athlon did years ago, multi-threading in the chip reviews shows little performance gain and is just another marketing gimick IMO...I.e Double Pumped... Who runs multiple tasks while playing a game ??, do we have Norton Antivirus and Adaware running in the background while playing a game scanning our drives??
Maybe in high end applications where the user may have 5 different appz running at the same time but for us Gamers, as proven here..its of little use.

http://www.guru3d.com/review/intel/pentium4-3060/

CPU's are more limited by the BUS than anything and I would like to see more boards using Hypertransport or other exotic means to increase the Data coming out of the AGP bus to the CPU.
 
Who is it that only uses computers for games?

Most reports that I have read say that running a WinXP / Win 2K on a hyperthreaded CPU is just more "reponsive."

I don't care by what means the CPU makers go after more performance : hypertransport, hyperthreading, whatever....the end results are what matters.
 
My point is that the majority of users on this forum utilize their PC for games, of course its used for other things like Word and Excel and professional rendering apzz, but for the interest in the gamer in Hyperthreading...the results are not there to say it will improve gameplay.

Again most users here run one game, and usually I hope ensure they are not compressing a 300 meg file while playing...so in real world use as a gamer the results are mediocre to nothing.

Yes its nice to have that feature implemented, but its not a selling feature to gamers..more importantly better would be a similar to approach to the Xbox..

Chipset.jpg


To me the above is more important than slight performance increases running a couple of tasks...the Hyperstransport traces coming off that lowly PIII 733 is one of the good reasons why X-box graphics blow away alot of PC titles.

Let start seeing some smart and intelligent thinking done on the motherboard side..AMD solo chipset is the only one I know off that supports HT.

AMD-8151 HyperTransport AGP3.0 Graphics Tunnel with AGP-8X

Notice the Data Traces...

8151.jpg
 
Again most users here run one game, and usually I hope ensure they are not compressing a 300 meg file while playing...so in real world use as a gamer the results are mediocre to nothing.

Or, they are web browsing and running several things in the background at the same time...like me.

Or, they use apps which can specifically benefit from multi-threaded (or multi-cpu in general) configs, like video and photo editing.

Yes its nice to have that feature implemented, but its not a selling feature to gamers..

True, hyperthreading itself is not a likely selling point for those that only do gaming. But then neither is anything else except that which gives faster frame rates. Any improvments in raw processor speed or memory bandwidth are the big selling points...such as intel's upcoming move to the 200 Mhz FSB and DDR400 support.

the Hyperstransport traces coming off that lowly PIII 733 is one of the good reasons why X-box graphics blow away alot of PC titles.

Disagree. It's because X-Box is a closed platform...and one that's essentially fixed at 640x480 resolution to boot.
 
Or, they are web browsing and running several things in the background at the same time...like me.

Only if the APP has multi-threading support.



Or, they use apps which can specifically benefit from multi-threaded (or multi-cpu in general) configs, like video and photo editing.

Agreed, already stated that...good for them..not a selling point for me..I ensure all background appz are disabled when gaming like messenger, Norton etc for system resources to stay optimal.
The conclusion Aces sums up my feelings exactly... yes its good..it will probably make little use to PC gamers...

http://www.aceshardware.com/read.jsp?id=50000332

Disagree. It's because X-Box is a closed platform...and one that's essentially fixed at 640x480 resolution to boot.

Your are entitled to your opinion Joe :LOL: ...Yes its a closed platform but having the ability to transfer 800 mbits/sec on the data side and 6.4 gb/s on the AGP side certainly is helping keeping the Data pipeline moving.

Link

So to sum my opinion..yes its a decent technology..would I go out and buy a expensive P4 so I can play Dungeon Siege and listen to MP3Z in the background...nope.
 
This should be moved to the "hardare forum" at this point, but I think we're nearly done anyway.

Only if the APP has multi-threading support.

Not true at all.

If the OS (Like XP) supports multiple CPUs, it will "see" the hyperthreaded CPU as 2 CPUS, and can take advantage of that when more than one app is competing for CPU time. So, when multiple applications are running in a muti-cpu aware OS, a hyper-threaded CPU will help, whether or not each individual application is multi-threaded. This is actually one of the biggest areas where a hyper-threaded CPU will help overall performance.

So to sum my opinion..yes its a decent technology..would I go out and buy a expensive P4 so I can play Dungeon Siege and listen to MP3Z in the background...nope.

Agreed. Nor would I go out any buy a hyper transport system for the same reason. I would look at each platform and see which one actually delivers the performance in the situations I care about.
 
According to Aces the applications that recieved 15% BOOST was multithreaded appz..the other single threaded apzz got a 1-5%.. so IMO the only real gain is if the app has native multithreading support...(5% is better than nothing though).

IMO the real problem today is platform bandwidth..modern CPU's are so fast now..ATA 133 drives, SCSII 160 , memory bus and overall data transfer including AGP 3.0 is the bottleneck...and improved throughput like Hypertransport is far more interesting to me as it helps alleviate some of those issues.
 
According to Aces the applications that recieved 15% BOOST was multithreaded appz..the other single threaded apzz got a 1-5%.. so IMO the only real gain is if the app has native multithreading support...(5% is better than nothing though).

Again, that is completely expected if you are only running one app at a time.. (And isn't the sort-of purpose of WINDOWS is to be able to run more than one app at a time?) Yes, when you are GAMING, you typically only have one app going for CPU resources. But then again, it's not uncommon to have some background services running that occaisionally hit the CPU. I suspect we'll see a lot less "unexplained stuttering" during games with HT CPUs....

I repeat: If you are running a multi-cpu aware OS, and are running MULTIPLE APPS SIMULTANEOUSLY (that are competing for CPU resources), hyperthreading is a major benefit regardless if the apps themselves are mutithreaded or not.
 
Doomtrooper said:
To me the above is more important than slight performance increases running a couple of tasks...the Hyperstransport traces coming off that lowly PIII 733 is one of the good reasons why X-box graphics blow away alot of PC titles.

Doom, the P3 in the XB does NOT support hyperthreading! Don't know whatever the heck gave you that silly idea. :)

Besides, hypertransport only connects north and south bridges in the XB, and there are zero high bandwidth clients in that system, even the harddrive is slow as f*ck.


*G*
 
Back
Top