PDA

View Full Version : Adding full random memory of a CPU


Reverend
20-Dec-2004, 02:17
I've been talking to a few games developer about writeback from vertex shaders and this led to talk to about the good and the bad of adding full random memory system of a CPU. The following is what a developer (who used to work for a IHV as a hardware engineer) said :

This is a complex question (general) that may require a complex answer.
First of all, being able to output vertices directly out of the vertex
shader without having to go through a bypass in the pixel shader doesn't
mean that you will access memory randomly. Generally, you would work on an
input vertex array, do some math on it, and then output to another array.
That said, with future shader models (4.0 and up), it will be possible to
generate new vertices and possibly output them anywhere in memory (with
relative efficiency).
That said, assuming you want to add random accesses to the memory and make
it relatively efficient (emphasis on the "and"), several measures have to be
taken.
GPU are not very efficient at rendering 1 pixel triangles randomly on the
screen but CPU are not much better at writing double words randomly in
memory either (assuming the memory size is big enough (to make the
probability of caching low)). If you want to improve your behavior in that
respect, again assuming that there is no pattern, you need to increase your
ability to do many random accesses. Modern GPUs are handling those
scenarios (somewhat) by interleaving memory with different memory
controllers, each one able to access different memory chips. The XBOX chip
128 bits memory interface is interleaved 4 ways (4 controllers) so
statistically, you can multiply your random access throughput by a factor of
4. That said, memory has been increasing in frequency largely at the
expense of the burst length and generally larger cost when you have a miss
or a bank conflict (somewhat compensated by the ability to have 2 or more
banks pre-opened at the same time).
For vertices, this technique may not work as well because each vertex, after
transformation, could have a size quite different than the memory
interleaving possibly making it useless. That said, if the interleaving
size is big enough to accommodate at least 1 vertex, this technique could
help a lot.
The cost of this is quite significant in term of transistors and complexity
in the chip. You basically need to implement the equivalent of a high speed
network switch to redirect memory accesses to the right controller not even
mentioning that you need more memory controllers.
Generally speaking, the only drawback to being (more) efficient with random
accesses is that it's costing transistors and these transistors are not
going to be spent on other things like more shader units. One of the main
laws of design is that you should do your normal case well and handle your
exception with ok quality which basically means that it depends if your
applications are taking advantage of it. Applications will determine the
right balance for the # of controllers.
He talked about the significant transistor cost involved and that this cost may not be of (more) value than to possibly have more shader units. My question is whether the cost (on a GPU) is worth it wrt the thread subject matter.

arjan de lumens
20-Dec-2004, 02:37
Another issue with adding full random memory writes is that you need to impose, for each memory location, a memory access ordering on the writes with respect to both reads and other writes. This adds quite a bit to GPU complexity and may even limit parallellism, as every attempted memory access (both read and write) must be matched against every queued/buffered/cached write before it.

Humus
20-Dec-2004, 02:59
In general, it's my opinion that this is not the way we should go in the near future, if ever, and that's mostly for the reasons arjan de lumens mentioned. I'm very sceptical to anything that could possibly break parallelism or must dictate a standard for in which order pixels of a particular primitive must be rendered. And I'm also not so sure this would even be a particularly useful feature.

Dave Baumann
20-Dec-2004, 02:59
The cost of this is quite significant in term of transistors and complexity in the chip. You basically need to implement the equivalent of a high speed network switch to redirect memory accesses to the right controller not even mentioning that you need more memory controllers.

Do you need more memory controllers or a single more coherant one? Also, need this actually be more expensive?

I'm not entirely sure yet, but this topic may need to be revisited later on...

http://images.techonline.com/images/community/content/old/features/review/2_2/dsp_fig18.gif

arjan de lumens
20-Dec-2004, 03:12
The cost of this is quite significant in term of transistors and complexity in the chip. You basically need to implement the equivalent of a high speed network switch to redirect memory accesses to the right controller not even mentioning that you need more memory controllers.

Do you need more memory controllers or a single more coherant one? Also, need this actually be more expensive?

I'm not entirely sure yet, but this topic may need to be revisited later on...
If you have only 1 memory controller, your random write performance will be limited very quickly by the rate at which a single DRAM chip is able to perform page breaks. Assuming you have coherency solved, your random write performance will be roughly proportional to number of memory controllers multiplied by the number of banks in your DRAM chips.

Dave Baumann
20-Dec-2004, 03:14
Sorry, I should say (internal) "bus" not controller.

Geo
20-Dec-2004, 03:49
I've been talking to a few games developer . . . .

You've seemed awfully active of late Rev, on these theoretical forward-looking issues. Planning to Commit Article?

Geo
20-Dec-2004, 04:07
Where'd you get that diagram, Dave? For some reason I was reminded of token-ring. I'm sure that says more about me, than the idea. . .particularly as I don't think token ring ever got faster than 16mbps. .

Inane_Dork
20-Dec-2004, 04:45
Sounds like a job for HyperThreading! :P

Seriously, it might help. I have a feeling that GPUs will be made to hide even dynamic memory accesses in the future.

arjan de lumens
20-Dec-2004, 05:03
Sounds like a job for HyperThreading! :P

Seriously, it might help. I have a feeling that GPUs will be made to hide even dynamic memory accesses in the future.Shader units are in a sense already heavily multithreaded and to large extent already hiding memory access latencies; the problem here is that we need to set up a coherent memory access ordering across multiple threads in addition to just within the individual thread, which is kinda difficult.

Reverend
20-Dec-2004, 05:19
I've been talking to a few games developer . . . .

You've seemed awfully active of late Rev, on these theoretical forward-looking issues. Planning to Commit Article?
What, from someone who isn't in the business of making games? My grouses about many things 3D (wrt to games) have been shot down (mostly by ATI employees here). I'm a nobody to these IHV employees.

We should be talking more about "realistic forward-looking" stuff here... without IHV employees being on the defensive. This isn't Voodooextreme or AnandTech... this is B3D.

The sooner the IHV employees realize this, the better it is. I prefer they STFU sometimes, as that maintains my respect for them on a personal basis.

Geo
20-Dec-2004, 05:37
I've been talking to a few games developer . . . .

You've seemed awfully active of late Rev, on these theoretical forward-looking issues. Planning to Commit Article?
What, from someone who isn't in the business of making games? My grouses about many things 3D (wrt to games) have been shot down (mostly by ATI employees here). I'm a nobody to these IHV employees.

We should be talking more about "realistic forward-looking" stuff here... without IHV employees being on the defensive. This isn't Voodooextreme or AnandTech... this is B3D.

The sooner the IHV employees realize this, the better it is. I prefer they STFU sometimes, as that maintains my respect for them on a personal basis.

Not that there's anything *wrong* with committing Article. I've been known to do it myself from time to time, tho not on anything that would interest anyone here.

And conflict and struggle are *good*, if often uncomfortable. At least short of bloodshed. :) Very little progress in human affairs has been made without it to one degree or another. We seem to be built that way. I bet if you had visibility into the engineering groups at the major IHVs you'd find they have their share of red-faced arm-waving "discussions".

Dave Baumann
20-Dec-2004, 08:57
What, from someone who isn't in the business of making games? My grouses about many things 3D (wrt to games) have been shot down (mostly by ATI employees here). I'm a nobody to these IHV employees.

If those areguments are "shot down" because they are look at the bigger picture then they are valid point, both from a current and future design perspective. Also Game Dev's don't necessarily know much about hardware other than what they been told to expect in the future.

Reverend
20-Dec-2004, 09:27
Of course, that's obvious. Different IHVs have different priorities.

DeanoC
20-Dec-2004, 11:19
Another issue with adding full random memory writes is that you need to impose, for each memory location, a memory access ordering on the writes with respect to both reads and other writes. This adds quite a bit to GPU complexity and may even limit parallellism, as every attempted memory access (both read and write) must be matched against every queued/buffered/cached write before it.

Why do you need memory access ordering? Its a nice feature but not required for some algorithms, an oft forgotten thing.

Examples are boolean output, clear a section of memory to zero and let the gpu mark words true. If you get two writes to the same location, the order doesn't matter..

Of course it makes life a lot harder... But the gains may be worth it, the option for coherance on a few memory locations would be nice though.

Dave Baumann
20-Dec-2004, 11:26
Where'd you get that diagram, Dave? For some reason I was reminded of token-ring. I'm sure that says more about me, than the idea. . .particularly as I don't think token ring ever got faster than 16mbps. .

The principle is the same, but a token ring is an external network connecting muliple network adapters together. What if that was internal to a chip, running at chip speeds?

nAo
20-Dec-2004, 13:22
The principle is the same, but a token ring is an external network connecting muliple network adapters together. What if that was internal to a chip, running at chip speeds?
like this (http://aiw1.uspto.gov/.aiw?docid=us20040111546ki&PageNum=2&IDKey=9505479 BB393&HomeUrl=http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2%2526Sect2=HITOFF%2526p=1%2526u=% 25252Fnetahtml%25252FPTO%25252Fsearch-bool.html%2526r=5%2526f=G%2526l=50%2526co1=AND%252 6d=PG01%2526s1=hofstee%2526s2=ring%2526OS=hofstee% 252BAND%252Bring%2526RS=hofstee%252BAND%252Bring) or this (http://aiw1.uspto.gov/.aiw?docid=us20040111546ki&PageNum=3&IDKey=9505479 BB393&HomeUrl=http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2%2526Sect2=HITOFF%2526p=1%2526u=% 25252Fnetahtml%25252FPTO%25252Fsearch-bool.html%2526r=5%2526f=G%2526l=50%2526co1=AND%252 6d=PG01%2526s1=hofstee%2526s2=ring%2526OS=hofstee% 252BAND%252Bring%2526RS=hofstee%252BAND%252Bring)? ;)

For a SPU would be no problem to make full random memory accesses to the local memory. Too bad local memory is a small pool of memory, 128kb of sram per SPU. Anyway it would be enough for a lot of stuff..
(ok..it's not a GPU but a SPU can act as a very fast GPU vertex shader engine)

ciao,
Marco

Geo
20-Dec-2004, 13:38
Where'd you get that diagram, Dave? For some reason I was reminded of token-ring. I'm sure that says more about me, than the idea. . .particularly as I don't think token ring ever got faster than 16mbps. .

The principle is the same, but a token ring is an external network connecting muliple network adapters together. What if that was internal to a chip, running at chip speeds?

It'd have to be faster than the rest of the "chip speeds" to hide the latency of "waiting for your turn". But I could see it handling the ordering issues without breaking parallelism --if the speed was there; otherwise you have a sort of functional break if all your pipes can't run at something near their theoretical for being memory-starved.

Dave Baumann
20-Dec-2004, 14:18
If your "nodes" that most frequently need access to the memory interface "node(s)" (being the ROP's) are fairly close on the ring then latency may not much more of an issue in the performance aspects as it is on current internal memory interfaces; yeah, if you have to traverse the entirety of the ring it'll be slower, but if the parts that require accees to external memory intefaces are the furthest around the ring then that won't be as much of an issue. Of course the advantage is that, if architected in such a manner then your vertex shaders and pixels shaders are just nodes on that bus and it raises the possability of passing the data back and forth freely between them without having to access external memory at all.

andypski
20-Dec-2004, 19:03
My grouses about many things 3D (wrt to games) have been shot down (mostly by ATI employees here). I'm a nobody to these IHV employees. We should be talking more about "realistic forward-looking" stuff here... without IHV employees being on the defensive. This isn't Voodooextreme or AnandTech... this is B3D.
Taking things a bit personally? If your grouses have been shot down, then perhaps some of them deserved to be. No doubt some of them have been perfectly reasonable and certainly they have seemed so to me, but I would think that you should welcome participation in your debates from IHV employees, even if you feel that perhaps your opinions aren't being given enough weight.

I agree that B3D should be a place for debate of reasonable forward-looking stuff. If you want IHV employees to be involved then you need to understand that not everything that you think is realistic coming from a software development standpoint really seems reasonable to us at all in high-performance 3D hardware, whether now or in a forward-looking sense. We will give our opinions on these subjects just as you give out yours, whether we agree or not.

And just because software developers think that something would be really, really nice to have doesn't magically make it realistic, either today or tomorrow. We receive a continual wish list of desirable features that are either totally infeasible, or would involve massive and inappropriate expense to implement in hardware when compared to any advantages accrued.

The sooner the IHV employees realize this, the better it is. I prefer they STFU sometimes, as that maintains my respect for them on a personal basis.
No comment.

Anyway, moving on let's talk about 'realistic forward looking' stuff like random access memory for VPUs.

VPUs are designed to be high-throughput, latency tolerant devices with respect to memory. Provided some coherency in the accesses can be maintained then this is generally possible. If the accesses become incoherent (like a large negative LOD-bias, or no mipmaps when texturing) then performance on all current 3D hardware will tank straight into the floor. Easily demonstrated.

Why is this so much more of a problem for VPUs than CPUs?

Modern CPUs are designed with massive data caches (relative to VPUs) and can, after some startup overhead, start to tolerate pretty much random access into a reasonably large subset of memory provided it fits within these caches. VPUs have small caches, so any randomly-distributed memory accesses into reasonably sized datasets will thrash, and you then have to tolerate the latency of going to main memory on many transactions. This sort of latency simply can't easily be hidden - modern memories are designed for high performance with longish bursts of data, random accesses are a terrible model - not only do you end up limited entirely by the memory performance, but you end up limited to the performance of the memory operating in its least efficient mode. One question you then have to ask is : "If you become entirely memory constrained then would a CPU necessarily be any slower than the VPU when performing this task?"

CPUs are designed to operate well in random-access scenarios. VPUs are not - why try to fit a square peg into a round hole?

Anyway, my first questions to you here are - "What level of random access do you want from the VPU, and what performance target do you want to achieve for it to be useful?"

In essence, what performance hit is reasonable for random memory accesses, and given that it's probably an A OR B scenario then how much of the possible peak performance of the VPU in the 'normal' rendering case are you willing to sacrifice in order to accelerate random memory accesses?

Dio
20-Dec-2004, 23:48
Actually, I would disagree a little with that. VPU's are no worse for random accesses than CPU's are, but the definition of 'random' is different for the two.

There are fundamental limits here. If accesses are truly utterly random, it doesn't matter how much cache or effort you spend in 'intelligent' banking, you'll end up on average with one page break per Nth memory access where N is your number of banks (that is, assuming that the 'cache' is much smaller than the memory size). Your effective burst lengths will be limited to your 'struct size' (if you see what I mean) and/or you may also incur an overfetch cost, if you do not make use of all the data in a cache line. These are impossible to get around. (Usually cache lines and burst lengths are tuned to be reasonably similar, so overfetch and bank switches are the key costs).

In order to improve things, you choose to optimise looking for coherence in your input data and optimise for certain classes of pseudorandom access pattern.

At the moment, VPU's and CPU's assume different kinds of coherence. CPU's use LRU caches, which one could view as assuming temporal coherence. VPU's make less use of these and more of latency compensation 'caches', prefetching, FIFOs and arranging data in memory to avoid bank switches - basically, we assume spatial coherence. Of course, there is some overlap: we have some LRU caches, and CPU's are now doing things like automatic prefetch in order to get spatial coherence on large arrays.

Neither has any optimisation for overfetch other than assuming that a particular bit of data will all get used before it's discarded (and, given the cache line / burst length relation, it seems unlikely that there is much in this area, at least while we use SDRAM-type technology).

Coherence optimisations are fundamental cost/performance tradeoffs. It would not be hard to add large LRU caches in a VPU, but it will cost area. That cost would be expended if there was a convincing business model behind the decision, but there already is a highly expensive, high performance, general purpose CPU in the device... I mean, what else is it going to do if the VPU ends up doing everything?

andypski
21-Dec-2004, 01:22
Coherence optimisations are fundamental cost/performance tradeoffs. It would not be hard to add large LRU caches in a VPU, but it will cost area. That cost would be expended if there was a convincing business model behind the decision, but there already is a highly expensive, high performance, general purpose CPU in the device... I mean, what else is it going to do if the VPU ends up doing everything?
Interesting - my question would be, do these large LRU caches support the maximum throughput of the VPU device per-clock, in terms of reads and writes, and if so then does it have an effect on the cost of the silicon against, say, the equivalent sized CPU cache? To support full-speed, fully random access on current hardware you could potentially need to read and write multiple pixels in a single clock to unrelated memory locations in the cache. Of course, if the cache is allowed to throttle the performance of the overall system this would seem to be less of a problem, but it introduces a potential bottleneck.

This also seems a rather different case to the CPU cache situation, where typically you would not expect to generate a large number of writes and reads to unrelated locations in the space of a single clock, or even a small number of clocks. This is why my instinct tells me that it's not as simple as slamming a large cache on the back end of a VPU.

My immediate concern would be - do you end up in one or more of these cases:

- The VPU system goes relatively slowly with random access anyway, so it wasn't worth the extra area for the cache, and a CPU ends up just as quick for the algorithm in question.
- The cache slows down the maximum throughput of the device when rendering simple pixels to consecutive memory locations, and thus impacts the basic performance of the device when it's working on its most standard function.
- The cache can't support the full speed transactions of the VPU per-clock and thus is bypassed in some way for simple rendering, effectively becoming a big chunk of wasted area in this case (ie. when rendering typical non random-access applications).

Maybe I'm wrong here and I'm inventing issues where none exist - I must admit to not having thought the problem through in too much detail.

Reverend
21-Dec-2004, 03:46
andypski, don't worry, I don't take these things too personal (this -- 3D, games -- is still a hobbby for me). Eric, Jeff and Chris got Christmas greetings from me (sorry, forgot about you! :oops: ) recently. I do however tend to speak what's on my mind without thinking I should be polite.

What irks me about some of the postings by your fellow ATI'ers is that they feel the need to be defensive (the "I feel we design to a sweetspot" or "We believe we did the right engineering decision" or a sarcastic "Anyway, I’m glad you think R300 is a “Fine” part." when I paid one of your hardware with a compliment deemed not high enough).

I talk about 3D, using facts and available hardware. I provide my complaints (and, less frequently, compliments... complaints gets things going more, compliments make you complacent, look at NVIDIA :) ). Your fellow ATI'ers should address my (usually valid) complaints and not defend your company. If my wishlist isn't what you think is feasible, let us know by way of facts and your company's design and business policies. Not by defending your company's existing products. I know it's a natural thing to do, but I don't like it.

Not that you or your fellow ATI'ers would care what I like and don't like, of course.

Humus
21-Dec-2004, 06:26
What ATI employee was being defensive here? Alright so it's pretty obvious you mean me, but I don't get WTF you're talking about.

I added a simple opinion on the topic, without trying to defend or even relate to any current, former or future hardware or ATI business, competitors or market situation. It was a comment entirely on the technical merits of such functionality. In what way does IHV employees shutting up raise the standard of a technical discussion? Do you have a personal problem with me or what?

Reverend
21-Dec-2004, 09:01
Wasn't talking about you or your comment in this thread.

[maven]
21-Dec-2004, 09:01
Rev, I think you're getting a bit out of line...

I do however tend to speak what's on my mind without thinking I should be polite.
But then don't expect anyone else to stay polite when you're telling them (indirectly) to STFU.

What irks me about some of the postings by your fellow ATI'ers is that they feel the need to be defensive (the "I feel we design to a sweetspot" or "We believe we did the right engineering decision" or a sarcastic "Anyway, I’m glad you think R300 is a “Fine” part." when I paid one of your hardware with a compliment deemed not high enough).
I think you're mistaking simple, factual explanations for defensiveness.

Your fellow ATI'ers should address my (usually valid) complaints and not defend your company.
Disagree.

Reverend
21-Dec-2004, 09:13
But then don't expect anyone else to stay polite when you're telling them (indirectly) to STFU.
Hey, that's fine with me!

I think you're mistaking simple, factual explanations for defensiveness.
Oh, I think the ATI'ers provided good explanations for why they disagree with some of the things I wish for. For example, everyone knows my arguments for FP32. The ATI'ers have provided good explanations why this wasn't good for their current (and perhaps near-future) products for various valid reasons. But they spoil it (for me, and me only) with persistent "last comments" (that has nothing to do with the technical or business decisions) by defending their company. Now, they're allowed to do that here of course but I tend to dislike company-defending comments in a technical discussion. For example, I really enjoyed the debate I had with sireric in that DX9-vs-IEEE32-regarding-FP32 thread... until he threw in his marketing and defend-my-product/company comments (although I should blame myself for using the R300 as the basis for my comments). Again, they can continue to defend their company and/or products here but it's not something I care for. I treasure their knowledge (and would like to thank them for providing valuable insights into some of the things that happen at ATI), I just don't like marketing pieces. However, just as they're allowed to do so, I am allowed to voice my (very personal) dislike for such practice.

Disagree.
That's fine and okay (although I believe every single argument I've made for FP32 -- which is usually the topic of arguments between me and the ATI'ers -- are factually correct). That's what makes discussions lively.

And to address your first comment last :

Rev, I think you're getting a bit out of line...
I believe I am although I do not feel I need to treat IHVs participating here any differently nor with greater respect than Tom, Dick and Harry, and that's why I'll STFU myself about this as it's just my opinion. Later on, I'll PM the ATI'ers to clarify my position.

Now, lets' get back to being on-topic or this thread will be locked.

Dio
21-Dec-2004, 09:35
My immediate concern would be - do you end up in one or more of these cases:

- The VPU system goes relatively slowly with random access anyway, so it wasn't worth the extra area for the cache, and a CPU ends up just as quick for the algorithm in question.
- The cache slows down the maximum throughput of the device when rendering simple pixels to consecutive memory locations, and thus impacts the basic performance of the device when it's working on its most standard function.
- The cache can't support the full speed transactions of the VPU per-clock and thus is bypassed in some way for simple rendering, effectively becoming a big chunk of wasted area in this case (ie. when rendering typical non random-access applications).
I didn't think it through in that kind of detail either :D

I'd have said ordering and coherence might be a bigger issue than the second of those. The first I already covered. The third is simply a cost/benefit issue, if there are scenarios where it helps significantly then the cost might be worthwhile even if it isn't worth it for 'rendering' cases.

Dave Baumann
21-Dec-2004, 09:36
For example, I really enjoyed the debate I had with sireric in that DX9-vs-IEEE32-regarding-FP32 thread... until he threw in his marketing and defend-my-product/company comments. Again, they can continue to defend their company and/or products here but it's not something I care for. I treasure their knowledge (and would like to thank them for providing valuable insights into some of the things that happen at ATI), I just don't like marketing pieces.

What marketing pieces?

nelg
21-Dec-2004, 16:41
What marketing pieces?
You know how those ATI guys bedazzle us with technical mumbo jumbo and then slip in “ Act now and we will include free shipping. Quantities are limited, operators are standing by." Sleazy bastards. :lol:

Reverend
22-Dec-2004, 06:27
For example, I really enjoyed the debate I had with sireric in that DX9-vs-IEEE32-regarding-FP32 thread... until he threw in his marketing and defend-my-product/company comments. Again, they can continue to defend their company and/or products here but it's not something I care for. I treasure their knowledge (and would like to thank them for providing valuable insights into some of the things that happen at ATI), I just don't like marketing pieces.

What marketing pieces?
I agree they're not immediately obvious.