specialized HW is fast, but new CPUs are fast as well...

Dio · Dec 22, 2002

Panajev2001a said:
A 200x4 MHz FSB and memory ( same speed, thanks to RAMBUS ) yeld 6.4 GB/s of total bandwidth... which is basically the same as NV2A rendering bandwidth ( Xbox )...

But you have to get that data to the renderer. AGP8x doesn't have 6.4GB/s.

Then, you also have to have redundant extra reads (in a pull model, unless you have a very large cache on the chip) or use up local video memory AND bandwidth (in a push model, to store the temp buffers) in order to get the data into the chip.

In addition, your post-transform data is probably substantially larger than the pre-transform data.

This probably pretty much kills AGP texturing too, because you've saturated the bus with geometry traffic.

So, if your vertex shader hardware costs just a few percent of your silicon area - generally true - why not just put it on....?

Panajev2001a · Dec 22, 2002

Dio, the point is that a Vertex Shader as powerful and versatile as 12+ GFLOPS Pentium 4 will not cost a tiny portion of silicon, but it will cost quite a bit more than that... and will also impact the final clock-speed we can ran our simple GPU at...

AGP8x = 2+ GB/s which is respectable...

Going to the PS2 example, you have the GIF-to-GS bus ( connecting the EE to the GS ) which is rated as 1.2 GB/s max ( SDR bus )... AGP8x exceeds that max figure by at least 800 MB/s...

I did not ask to rely completely on AGP Texturing, but if we had to we could still put a small pool of relaively fast memory attached to the GPU and use Virtual Texturing thanks to S3TC and the AGP8x bus... after all the Gamecube follows this path... 1 MB of local texture cache and the bus to main RAM is 2.4 GB/s ( very small latency, but I was under the impression that in the graphics pipeline latenchy concerns were a bit more relaxed )...

I'd say that putting a small block of DDR memory with a 64 bits bus ( no way clocked as high as a NV30 ) attached to the GPU would suffice as external frame-buffer and as texture buffer ( textures would sit compressed... something like 16-32 MB of memory [should not be THAT expensive... we need space not too much bandwidth] ) and additional texture could be streamed from memory...

And you said that post Transform data would be much bigger than pre-transform... well that depends on depth complexity/overdraw of the scene too... remember, we would do deferred T&L ? Basically in my model you would see lots of polygons, but they would be only the visible ones...

Panajev2001a · Dec 22, 2002

Yes you should think exactly that way - they are logic - not data circuits. I believe they make up part of your 22 million transistor budget that sits in the logic section of the P4.

I do not understand... you keep shifting from logic to data to logic... both Pentium 4 and NV30 have circuits that do not DO effective processing ( TL ) but that do help the rest of the chip do what they're supposed to do... cache logic, caches, Control Unit, Occlusion Detection NV30 ), etc...

We both agree on the 4*32 - gives you 128 bits you are doing precision maths on, but its throughout the chip - especially in the pixel shaders - not just the vertex shaders.

but again even in the pixel shader and the framebuffer... it is all 128 bits packed... as several 32 bits values packed together ( RGBA )... there is NO 128 bits address processed, there is no scalar 128 bits operand IIRC... I was comparing repsective T&L portions...

And if we really want good precision I'd no go with FP... I waste more bandwidth and performance but bI nwould go with a FIXED point format...

32 : 32.... 32 bits integer for the fraction and 32 bits integer for the rest...

that would solve TWO problems: 1) higher precision

2) we would have the precision well distribuited over the available range and not suffer from what FP brings us ( back to the non linear precision rnage of the Z buffer )...

I remember back in the early 80's a VAX could calculate a polynominal with a single instruction,

And this would limit the speed to what ? c'mon there is a reason why that kind of approach has been abbandoned... and besides that was not implemented in HW and was achieved through micro-code ( complex instruction decoded in TONS of u-ops, in this case... )... that kind of complex instructions have their advantage as x86 is showing... great code density... and besides inside is basically all RISCy

so remember its not safe to assume both a P4 and a GPU calculate trig functions with the same throughput per cycle. A GPU would leave a P4 in its dust.

I am no so sure about that

Dio · Dec 22, 2002

<sigh>

There seems point in continuing this further... if this was the best way to go, people would be going that way.

It isn't, so they aren't.

Panajev2001a · Dec 23, 2002

Dio,

I apologize if I got you very annoyed... it was not my intention...

I was not pretending that was the BEST way, but a possibly cheap way of doing things...

I never said that a Pentium 4 would outclass or even match the NV30 or the R300, I just said it would be not be LIGHT years away...

My point was a price related point with the observation that the performance would not suck that badly...

Some of the ideas I was prsenting about deferrd T&L based on tesselating only the visible portion of HOS are things I have heard PS2 programmers thinking about researching if my memory doesn't fail me... the EE is not a GPU, the EE is a normal CPU and the two VUs are normal VLIW processors ( with a SIMD engine in them )... what is going wrong is that this process is difficult because the EE lacks more serial performance ( sorting benefits of much faster clock speed ) and it lacks good caching killing integer performance...

g__day · Dec 23, 2002

Pan, a clarification - anything that is not cache or data registers I consider instruction circuitry. A P4 has about 22 Million processors dedicated to this a GPU is nearly all instruction circuitry - my position has never changed - I don't know how to say it any clearer sorry.

As a theortical execrise fine - its hard to keep an open mind. Perhaps read teh Stamford papers on 3D raytracing on modern GPUs (rtongfx.pdf) to get a feel for is and isn't possible.

http://216.239.33.100/search?q=cach...tongfx/rtongfx.pdf+rtongfx.pdf&hl=en&ie=UTF-8

Dio · Dec 23, 2002

Panajev2001a said:
I apologize if I got you very annoyed... it was not my intention...

I wasn't annoyed - but my reply was in danger of being reasonably close to outright flaming and counterproductive for everyone. And, as noted below, I'd misunderstood your point somewhat

I never said that a Pentium 4 would outclass or even match the NV30 or the R300, I just said it would be not be LIGHT years away...

I misunderstood this point and must apologise. You are quite right, performance is reasonably close.

Some of the ideas I was prsenting about deferrd T&L based on tesselating only the visible portion of HOS are things I have heard PS2 programmers thinking about researching if my memory doesn't fail me... the EE is not a GPU, the EE is a normal CPU and the two VUs are normal VLIW processors ( with a SIMD engine in them )... what is going wrong is that this process is difficult because the EE lacks more serial performance ( sorting benefits of much faster clock speed ) and it lacks good caching killing integer performance...

The PS2's a pretty complex beast. While I've never programmed one I went to a lecture at EGDC about PS2 optimisation, which was very interesting.

From what I saw there it very much has to be a holistic approach to programming, which is difficult for a modern 10-man team working in C... one example from the lecture was that in some large % of games (80%, was it?), one of the VU's is never used because the programmers just couldn't work out how to make it do anything useful.

Grall · Dec 23, 2002

Panajev2001a said:
I was not pretending that was the BEST way, but a possibly cheap way of doing things...

Uh, excuse me for laughing out loud, but CHEAP way of doing things?!? Have you checked recently the price of the P4 3GHz? If it's one thing that chip is NOT, it's cheap, man!

And while your approach is sorting polygons to do deferred transforms, it's not transforming. And while it is transforming, it is not running game code. You're clogging the CPU with lots of data shuffling tasks and floating-point calculations which will make a high-end system perform outright badly when all things are considered.

I fail to see the point of it all. From a cost-effectiveness POV, it's going to SUCK. Plain and simple!

*G*

Panajev2001a · Dec 23, 2002

Uh, excuse me for laughing out loud, but CHEAP way of doing things?!? Have you checked recently the price of the P4 3GHz? If it's one thing that chip is NOT, it's cheap, man!

Well you are mistaking manufacturing price ( the one I am referring at and the price that 3 GHz Pentium 4 will fly once Prescott releases and the same will happen to Prescott when the new AMD chip shipts to market with good benchmarks and lower price... ) with sale price...

Intel is selling its chips at considerably more than it costs them to produce them ( they have huge volumes too ), in the industry they are appreciated for the high ASPs ( Average Sale Price, IIRC ), appreciated on the economic point of view...

It doesn't mean they could lower the final price of the chip much...

plus you're forgetting something, those Pentium 4's are going to find their way sooner or later into pre-built PCs in stores, at DELL, HPaq online shops... before NV30s and R300s... I have seen in the past a lot of computers shipping to the masses with high-end CPUs and quite low-end graphics cards ( selling on the CPU name mainly )...

I was considering a system with a Pentium 4 already in it and I was thinking about the integrated graphic hip you could pair it to to support DX9+ for example... and as far as T&L is concerned I saw that an optimized implementation could run decently on the host CPU saving quite a bit of silicon area on the GPU and allowing the GPU to be clocked higher and increase its fill-rate...

And while your approach is sorting polygons to do deferred transforms, it's not transforming. And while it is transforming, it is not running game code. You're clogging the CPU with lots of data shuffling tasks and floating-point calculations which will make a high-end system perform outright badly when all things are considered.

HT... buzzword or useful feature if we have several tasks that want to run at the same time and we want as less pain from context switching as possible...

ANSWER: BEEEP-> useful feature

The SSE/SSE2 units would run T&L and I think that sorting the vertex stream could be something that the Pentium 4 chip could do quite fast ( 3+ GHz gibve you a lot of cycles to spend

)

The two ALUs and the x87 FPU could be dedicated to work on mostly game code and physics...

trust me it is possible... and if you do not believe me well look at the Dreamcast...

PVR GPU with no T&L engine + SH-4 processor with SSE type Vector Unit ( of course it is not so similar, but it is built on top of the RISC FPU and the two cannot coexist at the same exact time... )... that ended up working ok... the SH-4 was clocked at 200 MHz and had PC100 SDRAM as main RAM basically...

Deferred transforming was thought, in this example, to help reduce the effective T&L load... control points of the HOS are considerably less than the vertices of those surfaces when fully tesselated... if we can eliminate the hidden surfaces or portions of them we will reduce the amount of surfaces to tesselate ( less load on the CPU as far as tesselation is concerned ) and the amount of triangle we have to run complex vertex shaders on

And BTW, if we do deferred T&L we also eliminate the need of a complex occlusion detection mechanism in our GPU simplfying it design even further ( clock speed going up

) saving more money on the GPU

Dio · Dec 23, 2002

Panajev2001a said:
And BTW, if we do deferred T&L we also eliminate the need of a complex occlusion detection mechanism in our GPU simplfying it design even further ( clock speed going up ) saving more money on the GPU

What is deferred rendering if not a complex occulsion detection mechanism?

Panajev2001a · Dec 23, 2002

well Dio ( you know that "Dio" is "God" in Italian right ?

) ,

We are doign that complex thingy on already existing silicon, the Pentium 4...

With .13u or .09u ( technologies Intel has both... while NV30 is struggling to come out in .13u, at the same time [approx IIRC] Intel will be ready to ship the .09u Prescott chip... Intel is a whole generation and a leg ahead of whatever process ATI and nVIDIA have access to with good yelds... sorry I digressed

), Intel could push fast the integrated chipset...

the GPU would not need complex occlusion detection HW ( simple Z-buffer for legacy titles support ), it would not need T&L and clipping logic ( well the VUs in the EE do the clipping and part of set-up IIRC ) and by aiming for a fast clock-rate it would not even need many parallel pixel pipelines as it could only need a Pixel Shading Unit, one or two pixel pipelines and single-pass multi-texturing through loopbacks as much as it needs

The simplicity in the logic design of the GPU would help us increase the clock rate and with Intel manufacturing we can use this advantage well...

And with deferred T&L performed by the CPU, the GPU doesn't need full display lists in order to render with max efficiency... we can do texture streaming through Virtual Texturing having a normal sized SRAM cache ( not to limit the clock-rate as much as e-DRAM would ) and normal VRAM ( not 500 MHz DDR-II being a MUST )...

KimB · Dec 23, 2002

Panajev2001a said:
the GPU would not need complex occlusion detection HW ( simple Z-buffer for legacy titles support ), it would not need T&L and clipping logic ( well the VUs in the EE do the clipping and part of set-up IIRC ) and by aiming for a fast clock-rate it would not even need many parallel pixel pipelines as it could only need a Pixel Shading Unit, one or two pixel pipelines and single-pass multi-texturing through loopbacks as much as it needs

There's a reason modern graphics chips have multiple pipelines. It results in higher performance than increasing the clock speed. Why? Because it's easier to double the number of pipelines than it is to double the clock speed.

There are only two reason to go for high clock speeds instead of a highly-parallel architecture (which, by the way, are not mutually-exclusive, either...):

1. PR. People think higher clock speed means higher performance.
2. The instruction set of the x86 architecture doesn't lend itself well to parallelism (hopefully HyperThreading and/or multi-core chips will help to change this).

Panajev2001a · Dec 23, 2002

There's a reason modern graphics chips have multiple pipelines. It results in higher performance than increasing the clock speed. Why? Because it's easier to double the number of pipelines than it is to double the clock speed.

There are only two reason to go for high clock speeds instead of a highly-parallel architecture (which, by the way, are not mutually-exclusive, either...):

1. PR. People think higher clock speed means higher performance.
2. The instruction set of the x86 architecture doesn't lend itself well to parallelism (hopefully HyperThreading and/or multi-core chips will help to change this).

1 might be true... funny though that another x86 vendor thinks that x86 can be improoved a lot on the IPC side as well...

2 it would not matter much as after the decoder which sits before the Trace Cache all the rest is RISC-like u-ops on which high parallelization and deep pipelining, SMT and OOOe, etc... ) is possible... or would you call a chip like the K7 a speed-racer design ? 3 full ALUs with 1 AGU each, 1 FP Add/Sub pipe and 1 FP Mul/Div pipe ( both with MMX and 3DNow capabilities ) and 1 L/S unit... it seems parallel to me and another thing the K7 packs the u-ops to be executed in what could be called a VLIW packed form...

Another reason modern graphics chip do not have very high clock speeds is that they pack a lot of transistors, that the transition from VHDL description to actual silicon is not optimised too well ATI guys were commenting how they got 300+ MHz out of a 100+ MTransistors beast in a 0.15u process by saying they improoved on actual circuit design )... going parallel is a bit easier in this case then pushing a much higher clock-speed...

I trust Intel to have more resources and more experience in high speed circuit design than ATI and nVIDIA...

Intel would have the processes and know-how to manufacture a high speed GPU ( the design that I had in mind and exposed here ) and pushing the clock-speed high enoguh would help us save silicon space on the chip because we would have to parallelize less...

Plus the GPU in this case would only have to run shaders on visible pixels and since we could expect in the future longer and longer shaders we have two paths... put a decent Pixel Shader and ramp up the clock-speed or put several Pixel Shaders unit in parallel... both can be exploited...

and yes a highly parallel architecture ( a powerful one ) and very high clock-speed often become mutually exclusive as it becomes hard to build teh chip and get it stable... pipeline must be stretched further only to allow signal to go to one part of the chip to the other, transistor count is high and so cna be power consumption...

it can be done, but it needs an immensive amount of resources and a breakthrough in manufacturing technology...

Novdid · Dec 23, 2002

You're forgetting something Panajev. The CPU has already a lot of work to do, it has to calculate physics, AI or whatever else that you can't offload to the GPU.

I trust Intel to have more resources and more experience in high speed circuit design than ATI and nVIDIA...

Don't be so sure, remember that this is basically the only thing ATI and Nvidia has done and they are believe it or not quite good at it. Designing a CPU is arguably very different compared to designing a highly parallell GPU.

Panajev2001a · Dec 24, 2002

Designing a CPU is arguably very different compared to designing a highly parallell GPU.

yes it is harder IMHO ( on HW and software side )... on GPUs you know you are working with a dataset which contains a nice degree of parallelism, extracting it is not the principal issue and you are much more relaxed about latency issues and branching...

and btw, one of the many advantages Intel has as far as high speed circuit design is as I said manufacturing... something Intel is quite farther ahead than both nVIDIA and ATI... nVIDIA and their partners are struggling to ship a .13u part while Intel is putting the final touches to a .09u 100+ MTransistors beast...

not to say nVIDIA and ATI do not have experience in designing 3D processors... but Intel is no slouch at chip design and manufacturing ( and in the last two years, in addition to all the teams they have had for a long time, they added two ex HP design teams and the whole Alpha team from Compaq/DEC that was working on th EV8 code name aranha )

KimB · Dec 24, 2002

Panajev2001a said:
1 might be true... funny though that another x86 vendor thinks that x86 can be improoved a lot on the IPC side as well...

I was attempting to state that this was the major reason Intel is able to compete at all with a chip that has higher clock speed, but far fewer execution units.

2 it would not matter much as after the decoder which sits before the Trace Cache all the rest is RISC-like u-ops on which high parallelization and deep pipelining, SMT and OOOe, etc... ) is possible...

Sure it does. Management at the compiler level yields itself to far better optimizations for parallelism than does runtime optimization.

I trust Intel to have more resources and more experience in high speed circuit design than ATI and nVIDIA...

But Intel doesn't have experience in producing processors that are even close to as massively-parallel as ATI's and nVidia's. Their own failure to produce a graphics chip even close to the performance of competitors shows this.

and yes a highly parallel architecture ( a powerful one ) and very high clock-speed often become mutually exclusive as it becomes hard to build teh chip and get it stable...

Right, there are engineering concerns that keep them from working well together, but there is nothing fundamental from keeping a processor from being both. When I say "mutually exclusive" I don't mean it's hard to get them to work well together, I mean impossible.

Panajev2001a · Dec 24, 2002

Sure it does. Management at the compiler level yields itself to far better optimizations for parallelism than does runtime optimization.

Well Intel is trying to do something along these lines with EPIC, but still there is quite a lot of small stuff that the chip has to manage at the run-time level ( I think LOAD/STORES were for example processed in a re-ordered fashion if necessary )...

Anyways I disagree on your statement, compilers for x86 and the latest incarnations of the IA-32 ISA do a good enough job to parallelize code for the CPU... plus the CPU is given the power to do decisions at run time which would be difficult for a compiler to achieve as some things are only known at run-time and in general this allows the CPU to make better decisions than what a general compiler would do... or to build on top of them and improove performance even futher...

run-time optimization is NEEDED... even EPIC needs heavvy FDO ( Feedback Driven/Directed Optimization [profiling, etc...] ) to be analyzed to optimize the code for the final "final" release...

x86 ISA has grown and more RISC-like instructions have shown up already and compliers do take advantage of this and they do it well as Intel's own compiler shows...

But Intel doesn't have experience in producing processors that are even close to as massively-parallel as ATI's and nVidia's. Their own failure to produce a graphics chip even close to the performance of competitors shows this.

no it shows that they do not care enough at this moment and they do not want to take off the teams that work on Prescott, Madison and the brand new IA-64 cores, the Xscale line or the newest breakthrough in their manufacfturing tech... which again is quite FAR ahead than what nVIDIA has access to...

Simon F · Dec 24, 2002

Panajev:
I haven't read through the whole thread (struth, life's too short), but can I just say the following?

Yes, the new CPUs are fast (well faster than their predecessors) and they are general purpose, however, dedicated hardware is much faster at its role simply because it is specialised.

Analogies are always dangerous but, FWIW, you can use a set of general purpose tools (eg screwdriver, hammer, spanner) to do a job, but it'll be quicker with specialised equipment.

Entropy · Dec 24, 2002

Chalnoth said:
But Intel doesn't have experience in producing processors that are even close to as massively-parallel as ATI's and nVidia's. Their own failure to produce a graphics chip even close to the performance of competitors shows this.

Intel produces massively parallell supercomputers. Look at their site for some info, plus of course the usual suspects for high-performance computing.

Saying that "Intel doesn't have experience in producing processors that are even close to as massively-parallel as ATI's and nVidia's" is simply flat out wrong. You have confused the mouse with the lion.

Intel doesn't compete in discrete gfx chips right now, instead they are eating the gfx chip market from the bottom up. And quite successfully too, they are the major supplier of integrated gfx solutions. Never forget they control the platform (with Microsoft). Speculation: they don't want to kill off gfx card vendors innovating and creating extra value for niche users anymore than they want to kill off high end sound cards - the niche these vendors fill will get progressively narrower, that's all, as the integrated functionality fulfills a larger share of user needs.

Besides, the discussion here is too one-sided, it only looks at the gfx-side of the equation. If you want a more entertaining and interesting(?) discussion, consider the benefits for the host system if it could use that fast gfx memory that just sits useless unless you are doing 3D-rendering. That's a different motivation for a re-think of current paradigms. The cost angle is already handled adequately by integration into the chipset.

Entropy

Dio · Dec 24, 2002

Panajev2001a said:
well Dio ( you know that "Dio" is "God" in Italian right ? ) ,

The Dio I picked the handle of is only a god of Rock

specialized HW is fast, but new CPUs are fast as well...

Dio

Panajev2001a

Panajev2001a

Dio

Panajev2001a

g__day

Dio

Grall

Invisible Member

Panajev2001a

Dio

Panajev2001a

KimB

Panajev2001a

Novdid

Panajev2001a

KimB

Panajev2001a

Simon F

Tea maker

Entropy

Dio

Similar threads