Adding full random memory of a CPU

Discussion in 'General 3D Technology' started by Reverend, Dec 20, 2004.

  1. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    I've been talking to a few games developer about writeback from vertex shaders and this led to talk to about the good and the bad of adding full random memory system of a CPU. The following is what a developer (who used to work for a IHV as a hardware engineer) said :

    He talked about the significant transistor cost involved and that this cost may not be of (more) value than to possibly have more shader units. My question is whether the cost (on a GPU) is worth it wrt the thread subject matter.
     
  2. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Another issue with adding full random memory writes is that you need to impose, for each memory location, a memory access ordering on the writes with respect to both reads and other writes. This adds quite a bit to GPU complexity and may even limit parallellism, as every attempted memory access (both read and write) must be matched against every queued/buffered/cached write before it.
     
  3. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    In general, it's my opinion that this is not the way we should go in the near future, if ever, and that's mostly for the reasons arjan de lumens mentioned. I'm very sceptical to anything that could possibly break parallelism or must dictate a standard for in which order pixels of a particular primitive must be rendered. And I'm also not so sure this would even be a particularly useful feature.
     
  4. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    Do you need more memory controllers or a single more coherant one? Also, need this actually be more expensive?

    I'm not entirely sure yet, but this topic may need to be revisited later on...

    [​IMG]
     
  5. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    If you have only 1 memory controller, your random write performance will be limited very quickly by the rate at which a single DRAM chip is able to perform page breaks. Assuming you have coherency solved, your random write performance will be roughly proportional to number of memory controllers multiplied by the number of banks in your DRAM chips.
     
  6. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    Sorry, I should say (internal) "bus" not controller.
     
  7. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    You've seemed awfully active of late Rev, on these theoretical forward-looking issues. Planning to Commit Article?
     
  8. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    Where'd you get that diagram, Dave? For some reason I was reminded of token-ring. I'm sure that says more about me, than the idea. . .particularly as I don't think token ring ever got faster than 16mbps. .
     
  9. Inane_Dork

    Inane_Dork Rebmem Roines
    Veteran

    Joined:
    Sep 14, 2004
    Messages:
    1,987
    Likes Received:
    46
    Sounds like a job for HyperThreading! :p

    Seriously, it might help. I have a feeling that GPUs will be made to hide even dynamic memory accesses in the future.
     
  10. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Shader units are in a sense already heavily multithreaded and to large extent already hiding memory access latencies; the problem here is that we need to set up a coherent memory access ordering across multiple threads in addition to just within the individual thread, which is kinda difficult.
     
  11. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    What, from someone who isn't in the business of making games? My grouses about many things 3D (wrt to games) have been shot down (mostly by ATI employees here). I'm a nobody to these IHV employees.

    We should be talking more about "realistic forward-looking" stuff here... without IHV employees being on the defensive. This isn't Voodooextreme or AnandTech... this is B3D.

    The sooner the IHV employees realize this, the better it is. I prefer they STFU sometimes, as that maintains my respect for them on a personal basis.
     
  12. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    Not that there's anything *wrong* with committing Article. I've been known to do it myself from time to time, tho not on anything that would interest anyone here.

    And conflict and struggle are *good*, if often uncomfortable. At least short of bloodshed. :) Very little progress in human affairs has been made without it to one degree or another. We seem to be built that way. I bet if you had visibility into the engineering groups at the major IHVs you'd find they have their share of red-faced arm-waving "discussions".
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    If those areguments are "shot down" because they are look at the bigger picture then they are valid point, both from a current and future design perspective. Also Game Dev's don't necessarily know much about hardware other than what they been told to expect in the future.
     
  14. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    Of course, that's obvious. Different IHVs have different priorities.
     
  15. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Why do you need memory access ordering? Its a nice feature but not required for some algorithms, an oft forgotten thing.

    Examples are boolean output, clear a section of memory to zero and let the gpu mark words true. If you get two writes to the same location, the order doesn't matter..

    Of course it makes life a lot harder... But the gains may be worth it, the option for coherance on a few memory locations would be nice though.
     
  16. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    The principle is the same, but a token ring is an external network connecting muliple network adapters together. What if that was internal to a chip, running at chip speeds?
     
  17. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    like this or this? ;)

    For a SPU would be no problem to make full random memory accesses to the local memory. Too bad local memory is a small pool of memory, 128kb of sram per SPU. Anyway it would be enough for a lot of stuff..
    (ok..it's not a GPU but a SPU can act as a very fast GPU vertex shader engine)

    ciao,
    Marco
     
  18. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    It'd have to be faster than the rest of the "chip speeds" to hide the latency of "waiting for your turn". But I could see it handling the ordering issues without breaking parallelism --if the speed was there; otherwise you have a sort of functional break if all your pipes can't run at something near their theoretical for being memory-starved.
     
  19. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    If your "nodes" that most frequently need access to the memory interface "node(s)" (being the ROP's) are fairly close on the ring then latency may not much more of an issue in the performance aspects as it is on current internal memory interfaces; yeah, if you have to traverse the entirety of the ring it'll be slower, but if the parts that require accees to external memory intefaces are the furthest around the ring then that won't be as much of an issue. Of course the advantage is that, if architected in such a manner then your vertex shaders and pixels shaders are just nodes on that bus and it raises the possability of passing the data back and forth freely between them without having to access external memory at all.
     
  20. andypski

    Regular

    Joined:
    May 20, 2002
    Messages:
    584
    Likes Received:
    28
    Location:
    Santa Clara
    Taking things a bit personally? If your grouses have been shot down, then perhaps some of them deserved to be. No doubt some of them have been perfectly reasonable and certainly they have seemed so to me, but I would think that you should welcome participation in your debates from IHV employees, even if you feel that perhaps your opinions aren't being given enough weight.

    I agree that B3D should be a place for debate of reasonable forward-looking stuff. If you want IHV employees to be involved then you need to understand that not everything that you think is realistic coming from a software development standpoint really seems reasonable to us at all in high-performance 3D hardware, whether now or in a forward-looking sense. We will give our opinions on these subjects just as you give out yours, whether we agree or not.

    And just because software developers think that something would be really, really nice to have doesn't magically make it realistic, either today or tomorrow. We receive a continual wish list of desirable features that are either totally infeasible, or would involve massive and inappropriate expense to implement in hardware when compared to any advantages accrued.

    No comment.

    Anyway, moving on let's talk about 'realistic forward looking' stuff like random access memory for VPUs.

    VPUs are designed to be high-throughput, latency tolerant devices with respect to memory. Provided some coherency in the accesses can be maintained then this is generally possible. If the accesses become incoherent (like a large negative LOD-bias, or no mipmaps when texturing) then performance on all current 3D hardware will tank straight into the floor. Easily demonstrated.

    Why is this so much more of a problem for VPUs than CPUs?

    Modern CPUs are designed with massive data caches (relative to VPUs) and can, after some startup overhead, start to tolerate pretty much random access into a reasonably large subset of memory provided it fits within these caches. VPUs have small caches, so any randomly-distributed memory accesses into reasonably sized datasets will thrash, and you then have to tolerate the latency of going to main memory on many transactions. This sort of latency simply can't easily be hidden - modern memories are designed for high performance with longish bursts of data, random accesses are a terrible model - not only do you end up limited entirely by the memory performance, but you end up limited to the performance of the memory operating in its least efficient mode. One question you then have to ask is : "If you become entirely memory constrained then would a CPU necessarily be any slower than the VPU when performing this task?"

    CPUs are designed to operate well in random-access scenarios. VPUs are not - why try to fit a square peg into a round hole?

    Anyway, my first questions to you here are - "What level of random access do you want from the VPU, and what performance target do you want to achieve for it to be useful?"

    In essence, what performance hit is reasonable for random memory accesses, and given that it's probably an A OR B scenario then how much of the possible peak performance of the VPU in the 'normal' rendering case are you willing to sacrifice in order to accelerate random memory accesses?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...