specialized HW is fast, but new CPUs are fast as well...

I reckon Intel has a market with their double pumped, hyperthreaded CPU's. In a way Intel is trying to increase IPC rather than go for clockspeed overkill bollox and for that Intel should be applauded. They took a leaf out of AMD's book ;) j/k
I need some of that hyperthreading love especially when my system could do with it when I am converting files to .avi (.mp3? that is so passe!)... trying to browse the beyond3d forums and having a conversation with a hot chick via IM.

When I run games I close all additional tasks that I can and go... even my brain needs to be single tasking at that point.
 
eek13.gif




:LOL:
 
Xbox uses Hypertransport on the IGP, I never stated it uses Hyperthreading.

Although you wouldn't be too far off. Just not 'Hyperthreading' (an Intel trademark), but the GPU does implement a form of SMT at the VALU stage...
 
Doomtrooper said:
Xbox uses Hypertransport on the IGP, I never stated it uses Hyperthreading.

Aack. I meant hypertransport. Too much silly hype in those stupid marketing names...

P3s have never, and will never support hypertransport. Only such connection is between north and south bridges, and they don't need it. Your argument that is what made XB powerful is bogus.

*G*
 
Panajev2001a said:
g_day, I will point one flaw of what you said... the GeForce FX is not 120+ MTransistors dedicated to ONLY execution units... I am sure the occlusion detection HW, the cross-bar based memory controller, the caches take quite a bit of transistor logic as well....

Code or data - execution logic or cache - they are your choices - by mentioning occlusion detection HW, the cross-bar based memory controller you are arguing my case :) these are logic circuits that control something, they make decisions - they are execution units not data.

Game, set, match. If you want to mention cache - find an article that say if and how much cache a GF FX has - that'd be interesting. How can you compare a 3GHz CPu having 12GFLPOS max with a 500MHz chip having more like 400 GFLOPs counting the vertex and pixel shaders combined? So clock per clok cycle is 400/12*3/0.5 = 200 more powerful, and its doing 128 bit precision maths - not 32 bit precision - isn't this picture starting to become clear yet?

The CPU logic hardware is generally only 5% busy because it follows a linear standard Neuman based design; fetch instruction, decode, fetch memory addresses execute store etc. Its a powerful general computing device - not an incredibley parallel, specialised chip. A GPU doing 3D processing always has thirty things happening at once in comparision, its a far more parallel peice of hardware.
 
Yes Grall and the IGP is the 'effective Northbridge' of the X-box....a CPU doesn't need to support Hypertransport....Hypertransport is the BUS and doesn't care whats feeding it...does a T-bird support Hypertransport..before Hypertransport was even released..yes as long as it is placed in a Nforce motherboard that supports Hypertransport...doesn't matter what CPU is used.

Both Xbox and nForce, the north-bridge (IGP) and south-bridge (MCP-Audio) are connected by an identical speed HyperTransport point-to-point link.

streamthru.gif


igpoverview.jpg
 
Doomtrooper said:
Yes Grall and the IGP is the 'effective Northbridge' of the X-box....a CPU doesn't need to support Hypertransport....

Thanks for that lengthy and completely superflous explanation of things I already knew, DT. :) If you actually go back to the previous page and study what you wrote, you'll see this: "To me the above is more important than slight performance increases running a couple of tasks...the Hyperstransport traces coming off that lowly PIII 733 is one of the good reasons why X-box graphics blow away alot of PC titles."

The hypertransport traces aren't actually coming off the P3, because the P3 LACKS A HYPERTRANSPORT CONTROLLER. What you're seeing in the pic is not the P3, it's the XGPU... :D Hint: P3 lacks RAM controller also, so why would there be DDR chips positioned around the *P3*? Conclusion: pic does not show P3.

Another thing I fail to get is why you credit the hypertransport link for the graphics superiority compared to PC titles, since it has absolutely nothing to do with any graphics processing in the XB at all. CPU connects to XGPU via the standard GTL+ bus to the XGPU, which handles all the graphics stuff internally. Hypertransport only connects to auxiliary devices like sound and I/O, and they don't even need all that bandwidth.

I'd say the reason XB titles looks better is because they actually target DX8-level graphics, which ye average PC title does not. But hey, that's just me of course. ;)


*G*
 
The reason because the XBox architecture 'could' be better than a normal PC architecture is that the video memory and main memory are unified. You don't have to send all the data through the AGP bottleneck.

Of course after NForce and other integrated chipsets come around in the PC world that is no longer a win for the XBox. However they still lack two conditions that make the XBox architecture 'better'. The integrated graphics are still pre DX8 and the XBox has far more memory bandwidth that what it can suck the CPU (that P3 is still using a normal 'SDR' 133 MHz bus) so the integrated graphics isn't sucking bw from the CPU all the time as happens with normal PC integrated chipsets.

Of course the main reason because the XBOX performs better is that is designed only for games, it doesn't have a bloated Windows OS, nor it has to go through the DX API and then the drivers, and support multiple devices, or be compatible. The software can be exactly tunned to a non-moving unknown machine. But that is way it is a console.
 
Grall said:
I'd say the reason XB titles looks better is because they actually target DX8-level graphics, which ye average PC title does not. But hey, that's just me of course. ;)


*G*


I also said the same thing in my previous post about being 'a closed box' that Joe so kindly mentioned :) And I still state that the Hypertransport is one of the reasons why the X-box can produce great graphics with basically a PIII 733 PC .
All platforms today on the PC are bandwidth limited and the AMD solo chipset with hypertransport on the AGP tunnel is one of features that I'm hoping will alleviate that.
My main reason why I posted that was the comment about hyperthreading CPU's doing vertex operations when the mainboard would certainly be the bottleneck on todays motherboard...heck even AGP 3.0 is slow.
So essentially we are saying the same thing, yet you feel Hypertransport is not helping the x-box much...which I disagree.
 
Doomtrooper:

I still can't fathom how you can credit hypertransport for any kind of graphics advantage in XB since it has NOTHING to do with graphics. Do the numbers yourself. Assume 256 channels of 16-bit 48kHz sound going, that's less than 24MB/s, harddrive burst speed is certainly no more than say, 30MB/s (which likely is a huge overstatement), and in reality seldom even near the theoretical limit. It's a 5400rpm unit with low areal density, so it won't be very fast. Ethernet and USB traffic added on top of that (even max bandwidth utilized, which will never happen in practice), it's all WELL within PCI bus limits!

All hypertransport is good for in the XB is reduce pincount, thus cutting cost slightly! In a computer, especially NForce2, you'll actually have use for the hypertransport link, since you'd have a FAST harddrive, PCI devices that sucks a lot of bandwidth like TV tuners, RAID adapters, video capture cards, gigabit ethernet maybe etc. Plus the onboard USB2 and firewire controllers of course... With all of them going at once you could max out and exceed the hypertransport link capacity at least in one direction, but in the XB you can't exceed even the 90-110MB-ish practical speed limit of PCI!


*G*
 
Code or data - execution logic or cache - they are your choices - by mentioning occlusion detection HW, the cross-bar based memory controller you are arguing my case these are logic circuits that control something, they make decisions - they are execution units not data.

How am I arguing your case ? Then we should have all the branch prediction logic, the reaname registers and the ReOrder Buffer, the prefetching unit, the cache tag logic, the Control Unit and all the logic that goes with it ( which includes micor-memory for the VISCier instructuions... the trace cache receives only a "pointer" to Control Memory for those to avoid filling the trace cache with u-ops for these very uncommon instructions... ) etc... those are logic circuits which control something and that make decisions...

Game, set, match.

Tournament ;)

How can you compare a 3GHz CPu having 12GFLPOS max with a 500MHz chip having more like 400 GFLOPs counting the vertex and pixel shaders combined?

First, in order to reach that number you're counting everything down everywhere in the pipeline from texture fecth to set-uo to rasterization to texture filtering, T&L, etc... every single op everywhere...

My comparison is pretty much on the T&L side as I can see myself that purely software rasterization would bring the Pentium 4 down quite a bit ( still if you did do rasterization with integer based math [maybe FIXED point arithmetic] you should go quite fast... nowhere as fast as a modern GPU in that regard as the Pentium 4 was NOT designed for it )... but T&L should be handled pretty well by the Pentium 4 including dynamic tesselation and deferred T&L if you wanted to program it... the Pentium 4 could sort the incoming vertex stream and try to T&L only the visible polygons... if HOS were used you could try to tessellate oly the visible patches by sorting at the control points level and then tessellating... sorting the incoming stream of vertices or of HOS control points is a good serial task that requires speed and lots of bandwidth... which the Pentium 4 has...

if you manage to code deferred T&L on the CPU you would eliminate all those triangles that are not visible up-front and the GPU would not need an occlusion detection mechanism except a standard Z-buffer ( either assuming that the deferred T&L wasn't 100% efficient or for compatibility reasons with older titles )...

So clock per clok cycle is 400/12*3/0.5 = 200 more powerful, and its doing 128 bit precision maths - not 32 bit precision -

no it is ot doing 128 bits precision math... what you see in the specs are the total precision of the pixel pipline or numbrs related to the vertex shaders...

A vertex usually has four cordinates: x, y, z and w each of them is a single precision 32 FP value ( generally )...

as far as pixel math is concerned, each of the four components ( R, G, B and A ) can be 16 bits FP or 32 bits FP... 16 * 4 = 64

32 * 4 = 128...

For the same reason the EE is not doing 128 bits math when processing the vertices... each VU is SIMD based using 4 parallel 32 bits FMACs to do the job...

The Pentium 4 can do the same 128 bits math as the VS if we go by your definition... SSE works with 128 bits vectors as well ;)
 
And I still state that the Hypertransport is one of the reasons why the X-box can produce great graphics with basically a PIII 733 PC .
The Xbox doesn't have to rely on the CPU for graphical computations. Pretty much all the work is done on the NV2A and the CPU is left for physics, AI, etc.

I'm not so sure the Hypertransport bus really helps that much with the graphical prowess. The bus handles I/O and sound. You basically need the highspeed bus for the sound so that the audio DSP's in the MCPX can encode a DD5.1 AC3 audio signal in realtime during gameplay with little to no latency.
 
Panajev2001a said:
First, in order to reach that number you're counting everything down everywhere in the pipeline from texture fecth to set-uo to rasterization to texture filtering, T&L, etc... every single op everywhere...

All of that hardware and power is there for a reason. It might not all get used all the time, but in order to change that 400GFlop figure to a 12 GFlop fligure you are assuming that the hardware is only running at about 3% utilisation - anyone who creates 3D hardware that runs at that sort of efficiency is insane, and going out of business real soon.

And that's assuming that the processor runs at 100% utilisation. Unlikely.

My comparison is pretty much on the T&L side as I can see myself that purely software rasterization would bring the Pentium 4 down quite a bit ( still if you did do rasterization with integer based math [maybe FIXED point arithmetic] you should go quite fast... nowhere as fast as a modern GPU in that regard as the Pentium 4 was NOT designed for it )... but T&L should be handled pretty well by the Pentium 4 including dynamic tesselation and deferred T&L if you wanted to program it... the Pentium 4 could sort the incoming vertex stream and try to T&L only the visible polygons... if HOS were used you could try to tessellate oly the visible patches by sorting at the control points level and then tessellating... sorting the incoming stream of vertices or of HOS control points is a good serial task that requires speed and lots of bandwidth... which the Pentium 4 has...

Yes - you can run geometry pretty fast on a modern CPU - current processors are capable of rates in the same ballpark (within 1 order fo magnitude, certainly) to current VPUs, if you just look at calculation throughput. However efficient data movement is a big problem, and you can easily find yourself limited by data transfer rates.

The Pentium 4 can do the same 128 bits math as the VS if we go by your definition... SSE works with 128 bits vectors as well ;)

Yup , but it's strictly vertical SIMD and never horizontal. By this I mean that it always does x*x, y*y, z*z, w*w (or swizzled versions of this). It doesn't natively support the most useful instructions (dot products x*x+y*y+z*z+w*w), although this can be worked around.

In the general case it's just difficult to come close to the execution efficiency of dedicated hardware vertex shaders, which is why such a huge clock rate advantage is required. In addition you then still need to transfer the transformed data over to whatever rasterizer you are using, so the effective rate of data transfer required overall is now doubled (unless you have some local interface). Since it is easy to become data-limited, this doubling of the transfer bandwidth can be very bad.
 
All of that hardware and power is there for a reason. It might not all get used all the time, but in order to change that 400GFlop figure to a 12 GFlop fligure you are assuming that the hardware is only running at about 3% utilisation - anyone who creates 3D hardware that runs at that sort of efficiency is insane, and going out of business real soon.

And that's assuming that the processor runs at 100% utilisation. Unlikely.

no but I do not see the point of GFLOPS rating inflation because we are counting the FP ops done by the set-up engine, by the clipping engine and the rasterizer if we are focusing more on T&L...


Yes - you can run geometry pretty fast on a modern CPU - current processors are capable of rates in the same ballpark (within 1 order fo magnitude, certainly) to current VPUs, if you just look at calculation throughput. However efficient data movement is a big problem, and you can easily find yourself limited by data transfer rates.

my point is that since we already still go ask help to the CPU for HOS even in modern GPUs ( I refuse to call them VPUs... that name stands for Vector Processing Unit and not Visual Processing Unit... GPU is enough, after all they do graphics... )... if the CPU sorts the HOS data and tesselate only the visible portions and then T&L the triangles just created it should save time compared to do the same operation but sending the triangles to the GPU for T&L as it will steal main memory bandwidth and we will need a T&L unit on the GPU...


Yes I know that efficient data movement is a key topic in modern and future computing ( in which moving data around will be a bigger concern than the processing speed )... why do you think I like IBM's approach with CELL ? Data Movement as one of the principal priorities... but looking at current memory, at the CPU's caches and busses I'd say there is enough to do quite a bit of things ;)


A 200x4 MHz FSB and memory ( same speed, thanks to RAMBUS ;) ) yeld 6.4 GB/s of total bandwidth...

which is basically the same as NV2A rendering bandwidth ( Xbox )... plus in this case the CPU would have 1 MB of cache compared to 128 KB of the XCPU...

Prescott is also rumored to have 1 MB of L2 cache... this can help keep the efficiency up by softening the main memory contention by keeping all the data that can benefit from local caching in the cache... this should make a nice pre-T&L vertex cache ;) ( 1 MB.... yummy :) )...

In the age of more and more versatile T&L, especially dealing with dynamic flow control ( conditional branching through predication basically... execute both paths of the branch and save only the right result ) we can make use of efficient dealing with branches... and being able to predict them >95% of the time is not a bad thing as we need much less logic for those operations... if you used predication the way NV30 does, you need twice the execution units ( or more ) to deal with a "simple" if-then-else branch... of course the misprediction will hurt the Pentium 4 a lot because of longer pipeline refill, but the misprediction rate is very small and when dealing correctly ( a good deal of the time ) with the branch we are doing with 1 unit what the NV30 does with 2 as it has to work on both paths of the branch... the transistor logic disadvantage diminishes a bit...

Branching through predication and making use of cheaper ( it costs quite a bit of money to design and manufacture a 3-4 GHz processor, if you have a small enough process and can afford the extra die-area you can use much more of serially slower logic... again CELL rings a bell... ) logic in massive quantity is another approach... both have their advantages deoending on thr manufacturer resources and requirements of the final product...


If Intel wanted they could design a kick ass GPU, but I do not think it would fit their current business model... they have much greater resources than ATI and nVIDIA and much better manufacturing technology and experience in manufacturing high performance chips... they alreayd have a lot of markets to spread themselves on and they do not see making high-end GPUs as making them a big revenue... still they're slowly making steps in the business as their latest integrated core is not completely awful... and that is a market that makes money...
 
Yup , but it's strictly vertical SIMD and never horizontal. By this I mean that it always does x*x, y*y, z*z, w*w (or swizzled versions of this). It doesn't natively support the most useful instructions (dot products x*x+y*y+z*z+w*w), although this can be worked around.

Of course you both are using a pretty explicitly bad example in terms of arch. Practically every other ISA supports saturation math instructions (and in the case of PowerPC, it has two very nice SIMD architectures that support saturation math instructions).

In the general case it's just difficult to come close to the execution efficiency of dedicated hardware vertex shaders, which is why such a huge clock rate advantage is required.

Not really... AltiVec has pretty much the same execution efficiency as vertex shader hardware.

The CPU logic hardware is generally only 5% busy because it follows a linear standard Neuman based design; fetch instruction, decode, fetch memory addresses execute store etc. Its a powerful general computing device - not an incredibley parallel, specialised chip. A GPU doing 3D processing always has thirty things happening at once in comparision, its a far more parallel peice of hardware.

I think 5% is a bit on the low side. A lot (in both cases of CPU and GPU) can depend on what you're actually doing at any moment. Deep pipelining and OOE (and SMT to a lesser extent) largely exist to keep execution resources constantly at work regardless of the instruction mix. Also the ability achieve high computational numbers in todays GPUs is largely due to data limitations not really so much of any hardware design philosophy.

One could arguably fashion a CPU in the same manner as say an NV30, but it'd be pretty useless as it would be prohibitively expensive to design a large scale memory implementation necessary to really exploit it computational abilities. Also the types problems one could work on would be relatively limited.

However, we are sort of part of the way there when you look at something like IA-64 where you pretty much have an arch where most of it is basically parallel execution resources the rely on the compiler/programmer to sort, schedule and format data for it in order for it to achieve any sort of reasonable performance. And arguably cellular computing gets you even closer...
 
How am I arguing your case ? Then we should have all the branch prediction logic, the reaname registers and the ReOrder Buffer, the prefetching unit, the cache tag logic, the Control Unit and all the logic that goes with it ( which includes micor-memory for the VISCier instructuions... the trace cache receives only a "pointer" to Control Memory for those to avoid filling the trace cache with u-ops for these very uncommon instructions... ) etc... those are logic circuits which control something and that make decisions...

Yes you should think exactly that way - they are logic - not data circuits. I believe they make up part of your 22 million transistor budget that sits in the logic section of the P4.

We both agree on the 4*32 - gives you 128 bits you are doing precision maths on, but its throughout the chip - especially in the pixel shaders - not just the vertex shaders.

I have never compared the maths and logic instructions a P4 can do versus a modern GPU nor how many cycles each takes to complete complex instructions. Don't assume their instruction sets are equal. I remember back in the early 80's a VAX could calculate a polynominal with a single instruction, so remember its not safe to assume both a P4 and a GPU calculate trig functions with the same throughput per cycle. A GPU would leave a P4 in its dust.
 
Back
Top