Design 2 will be better than Design 1, IF they are bandwith limited. If they are fillrate limited, than in a case such as Doom 3 and stencil shadows, the extra bandwith when not texturing will remain unused anyway.
I'm talking about future games with heavy use of stencil bufferpascal said:Tons of untextured polygons? Are you talking about the multiple passes per pixel needed by some old hardware ?
Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation, that's a fact.edited: AFAIK the major problem Doom3 has is the texture bandwith. JC had to use compressed texture and disable anisotropic with GF3.
Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.All this crossbar talk will not improve the real memory latency and page hit, memory controllers and cache both architectures have
Umh..don't talk about design problems, u just doubled bus widthAlso many problems designers have today is the high frequency used by memory, right? And it will get worse.
If we have a GF4 core like that coupled with a working frequency like that my guess is nvidia engineers are drunkedited: suppose we have an improved GF4 core (8 level multitexturing in an single pass) and at lower frequency (200MHz core) my guess the design1 could be done today and very cheap with great performance almost all situations.
Not if the next generation hw allocates more resources (ie <virtual> pixel pipes) devoting texture iterators and stuff to fill more pixels at the same time. And even if next designs will not be as flexible, a 350 mhz 8 pixel pipes core can theoretically write more than 10 Giga/s to memory (1 pixel = 24 bit zbuffer + 8 bit stencil), so Design1 loose anyway, where a single 128 bit bus coupled with a fast (800+) mhz ram could satisfy even 8 pipes in such a case.Colourless said:Design 2 will be better than Design 1, IF they are bandwith limited. If they are fillrate limited, than in a case such as Doom 3 and stencil shadows, the extra bandwith when not texturing will remain unused anyway.
Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation
Umh, were I wrote I was referring to Design2? In fact I didn't. With current technology one could expect a single 128 bit DDR bus to reach more than 12 Giga/s. Why one hw design should implement a 256 bit splitted asynchronous bus vs a single fast 128 bit bus I don't know, but that wasn't my idea If one want to go with a 256 bit ddr bus, just do it with 4 or 8 memory controllers. Increasing costs doubling the bus and then trying to reduce them using slow memory doesn't make any sense at all. Splitting buses could make sense if one is a LOT larger and faster than the other bus like in gpu with edram + slow external memory, but I wouldn't call that a splitted bus architecture.Teasy said:13.3gbps vs 9.6gbps, almost equal? Design 1 has 40% more bandwidth then design 2.. I wouldn't call that almost equal.
Teasy said:Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation
13.3gbps vs 9.6gbps, almost equal? Design 1 has 40% more bandwidth then design 2.. I wouldn't call that almost equal.
nAo:
IMHO, you're wrong. Design1 will be more expensive than Design2 (128+128 bit buses vs 128 bit bus) and in some circumstances even slower,
Please, could you explain? thanksI'm talking about future games with heavy use of stencil buffer
Splitting frame buffer and texture bandwith vs an unified bandwith scheme with almost equal aggregate total bandwith is giving you bad performance in almost every possible situation, that's a fact.
four words: performance with low priceIf don't u like my stencil buffer examples, what about heavy texturing with low fillrate (limited by 'campled' texture bandwith) ? Your fb bandwith will be wasted meanwhile your texture bandwith will be saturated.
2 words: load balancing
Off course they help.Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.
But at lower frequencyUmh..don't talk about design problems, u just doubled bus width
If we have a GF4 core like that coupled with a working frequency like that my guess is nvidia engineers are drunk
Entropy:
The comparison is fundamentally flawed. in order to be meaningful, Design 1 should be compared to a datapath of 256 bits which can be used for either purpose.
pascal said:Can you proof that? Numbers pleasenAo:
IMHO, you're wrong. Design1 will be more expensive than Design2 (128+128 bit buses vs 128 bit bus) and in some circumstances even slower,
Please, could you explain? thanksI'm talking about future games with heavy use of stencil buffer
I am comparing performance AND price.
four words: performance with low price
Off course they help.Sorry, but imo u're plain wrong. Those memory architectures do help latency and page hits.
Yeah, total latency = l1latency x l1hit + l2latency x l1miss , etc, etc, etc.
It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.I am talking about the latency and page hit BEFORE it, the real one in the memory access.
But at lower frequencyUmh..don't talk about design problems, u just doubled bus width
Every 3d card is bandwith limited. And rolling your eyes 10 time doesn't give to your words more credibility.They are drunk now. The simplest GF3 is completelly bandwith limited and we all know that. It is not capable to sustain the MARVELOUS / FANTASTIC / UNBELIEVABLE 1.4GTexels/s
This a simple/basic formula to calculate latency with a memory hierarchy.Dunno how u came up with that formula, but it's a non-sense.
Every 3d card in bandwith limited. And rolling your eyes 10 time doesn't give to your words more credibility.
They have already added more than that to go from 200MHz to 300MHz DDR with fast unusable coreOh my god. Take a current gpu, expand is area to fit extra bonds, add 200 pins on package and traces on the PCB, then return back to me to say it's cheap. If it was cheap u'd see 256 bit bus now.
Unfurtunately, I can't see them.
It doesnt explain how/why the Doom3 (is it enough next generation for you) performance doubled with compressed textures and disabled anisotropic. Simple Amdhal´s Law.Next generations games will do a lot of geometry passes with untextured polygons to create correct shadow volumes
Adding more pins to power, ground, shielding, driving fast memory, some expensive packging (almost one thousand balls), etc...Yeha, and I'm saying to you that a 128 bit ddr bus coupled with fast memory it's a lot cheaper than a splitted 256 bit bus, and in most cases it's even faster.
Prove thatYour solution has no low price.
At what cost?It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.
What vapour ???Current designs has shown that this is not a huge problem, meanwhile we have still to show a 256 bit bus consumer card. talking about vapour...
pascal said:One more thing about this Wonderfull 128bits LMA and LMA II.
http://www.3dvelocity.com/reviews/3dblaster/ti4400_2.htm
They say it has 128bits data bus right?
How many bits to address bus ?????
If it is really capable to do four indenpendent memory access it need at least four memory bus. Each one with at least 27 bits (128MB), total 108bits for addressing Is that right???
This a simple/basic formula to calculate latency with a memory hierarchy. Get some computer architecture book. Study before you say nonsense.
How do u know all the added pins are devoted to memory access I don't know. U're making a lot of assumptions.They have already added more than that to go from 200MHz to 300MHz DDR with fast unusable core
It doesnt explain how/why the Doom3 (is it enough next generation for you) performance doubled with compressed textures and disabled anisotropic. Simple Amdhal´s Law.
Sometimes I believe u live on another planetAdding more pins to power, ground, shielding, driving fast memory, some expensive packging (almost one thousand balls), etc...
U made an extraordinary statement, u have to prove it, not me.Prove thatYour solution has no low price.
At what cost?It can be mostly hidden, on a good memory controller, via banks precharge. A lot of memory accesses on 3d chips can be prescheduled and so u know in advance what u'll need and when u'll need it.
No, it's not. This is the clear evidence that you should cut&past from books less and start to think with your brain more, cause u haven't understood the basics here. What's the purpose to have N memory controllers+crossbar if a single mem controller has access to ALL the addressable memory? Obviously this is not the case, each controller has access to 2 memory modules at time,with the same address lines shared.They say it has 128bits data bus right?
How many bits to address bus ?????
If it is really capable to do four indenpendent memory access it need at least four memory bus. Each one with at least 27 bits (128MB), total 108bits for addressing Is that right???
3dcgi said:So then there is the tradeoff of which part takes longer the untextured polygons or the visible scene data. In general I believe textures will still be the bottleneck, but it's hard to predict how the next generation engines will actually perform.
nAo:
That's your problem Pascal, having read some formula on a book or on the internet doesn't make u an expert. Maybe u should understand before use math where that math makes sense. In this case it doesn't, cause we're speaking about multiple memory controllers, no l2, and we have no informations on how the L1 design works. A GPU is not a CPU, on a GPU u have very different hit/miss ratios if u are speaking about textures, or zbuffer, or framebuffer, etc..
The same 4k cache could have 75% hit ratio for textures and 25% hit ratio for the framebuffer (and u'll end with very different cache design, customized ot each kind of data)
Can you tell what they are doing with those pins?How do u know all the added pins are devoted to memory access I don't know. U're making a lot of assumptions.
But you were talking about next generation games. Do you have a better data?I don't pretend to use uncertain data about a single unfinished game engine as a benchmark, and u shouln't too.
Oh nAo, I am really tired of you. You really want it to get personall.Sometimes I believe u live on another planet
Pascal, tell me cause 256 bit DDR bus GPUs are only in high end market at the moment, and cause all those smart engineers at Nvidia/Ati/etc...are just dumb and they started to develops faster mem controllers when it does exist a simpler, cheaper and faster (to you) solution.
And I am neglecting yoursU made an extraordinary statement, u have to prove it, not me.
I'm just neglecting that statement.
Very low cost. Prefetching is not a martian technology. I believe all the current GPUs already does that. You can do it on CPU where u can just try to analyze memory accesses patterns, u can do it much better on a GPU where u can know almost predict everything but visibilty.
No, it's not. This is the clear evidence that you should cut&past from books less and start to think with your brain more, cause u haven't understood the basics here. What's the purpose to have N memory controllers+crossbar if a single mem controller has access to ALL the addressable memory? Obviously this is not the case, each controller has access to 2 memory modules at time,with the same address lines shared.
From the article above:
The idea of NVIDIA'a crossbar memory architecture is that rather than have a single "lane" along which data can travel, four "lanes" are used, each of which is able to carry 64 bits of data at a time. This way, if say four 64 bit data chunks need to be transferred, it can be done in a single transfer while the traditional approach would mean four separate transfers
Just my guesses Everything but geometry is stored in a tiled fashion.MfA said:Just curious, does anyone know how exactly data is distributed accross the 4 sections? (Maybe someone tried to analyze pathological cases, or the people with the expensive logic analyzers perhaps? Based on tiling? And how about vertices?
Mfa:
Just curious, does anyone know how exactly data is distributed accross the 4 sections? (Maybe someone tried to analyze pathological cases, or the people with the expensive logic analyzers perhaps? Based on tiling? And how about vertices?
Quad Cache architecture uses four independent high speed cache memory locations to store primitive, vertex, texture, and pixel information.