AMD Architecture Discussion

Discussion in 'Architecture and Products' started by Anarchist4000, Dec 19, 2016.

Tags:
  1. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    Oops, typo. It should be: which ensures overlapping pixels gets packed separately with a different wavefront with a larger ID (that is in submission order).

    I am not quite sure about if Intel GenX works this way, but it was a thing I read in this paper, which was where the idea was kind of derived from.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    It might be in theory, but I don't recall it being done this way for GCN. Some of the proposed methods might locally batch or defer, but this provides no global guarantee.

    Did the Realworldtech tests of deferred tiling show a Z-order handling of screen space? The testing showed a more linear scan. The tiles are configured to match the stride length already, and the ROPs are set up to match up the physical channels. Is there a reference to the space filling curve being used for ROPs?

    I'm assuming you mean with the proposed new scheme?
    The current scheme is that no such reordering is done and the results can be unstable and inconsistent.
     
    #42 3dilettante, Dec 27, 2016
    Last edited: Dec 27, 2016
  3. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Currently they have LDS and GDS, which are roughly equivalent hardware, but with GDS having additional sync capabilities. The LDS moving away I don't see as a problem. I've been envisioning them more as the memory controllers, part of the export pipeline when joined with other CUs, and synchronization mechanisms. A significant amount of their workload moved to that proposed scalar. Having a full crossbar it would be reasonably efficient at repacking data and some of the interpolation tasks. It should work well for de/compression as well without blocking SIMD execution by my thinking.

    Along the lines of the current scalar unit being more flexible, it could even do some of the cache work if a thread was designed around alternating scalar and vector pathways. SIMDs for heavy lifting and scalar doing control flow and data sharing. An extension of the current capabilities with more robust hardware and SGPRs moved into VGPRs. Along the lines of pushing a workgroup onto a SIMD. Certainly not required, but an option to explore. Avoids register limitations, but some vector space is likely wasted.

    In theory it also has 16 memory channels which might line up well with that 16 CU group organization. Could be a baseline to extrapolate the organization as only the HBM designs had that many channels.

    Not directly, but I'm imagining a significant rework of the export pipeline along with ROP and LDS/GDS behavior with the mesh. At least with Navi, although I guess Vega could have it. OIT would be relevant when all the transparent samples got read back in and processed. This would be up to the programmer if they desired unlimited samples or some sort of compression like PixelSync. My theory might consider breaking the submission order in favor of binning and synchronization. ROVs with opaque samples, while not ideal from a performance standpoint, wouldn't necessarily care about the submission order. It should address certain scaling concerns, although you may need to be careful reading back samples. Tiled rasterization, where all geometry is known along with unknown samples, could help here.

    I'm still working this idea out myself. My definition of the second stage was more along the lines of aggregating results at a macro level. One ROP per memory channel arbitrating exports from each CU or cluster. It might be more than just a ROP along the lines of GDS/LDS with the heavy lifting having occurred in the first stage. As I mentioned before, I was expecting some of these ideas to be related to Navi and the "scalability" on the roadmap.

    The QR wouldn't be conditional, but an option for efficiently coalescing results within multiple CUs prior to export. Possibly part of the solution for that first stage. Partitioning cache resources, possibly in different locations, to a task allocated to a group of CUs. It would make more sense for a multi-chip solution, and with the paper presented would seemingly be a better match to that timeframe. It's definitely optimistic speculation as I wasn't expecting some of this to be occur with Vega. It doesn't seem out of the realm of possibility though, just need to rethink a lot of conventional wisdom on past designs.

    Not all of that 2048 needs pads though. The vast majority would likely exist within the die like most FPGAs as that is where most nodes on the mesh should reside. Selective grouping reducing the number further. The portion through the interposer reserved for multi-chip solutions, including the Zen APU. The interconnect I'm envisioning would be a full mesh, say 17 nodes all linked together through variably sized channels. All global channels totaling 2048(probably another channel or two for IO and communication). Local channels likely increasing that number within a fixed organization. For a point to point route all but one link within a channel would be disabled. All links enabled if broadcasting or requiring a shared bus for whatever reason. Four nodes for example might share a channel for some purpose. The paper alluded to this as "indirect links on the interposer".

    Two stage arbitration. LDS for local export and arbitration within a CU cluster, GDS (one per memory channel) arbitrating over global memory and IO resources from each cluster. PixelSync would only have to stall if the buffer was filled and blending/compression required. It would simply append the result, possibly opaque results, until read back and optimized. The ID buffer may stem from that as well if using it to bin geometry.

    LDS should also be closely tied to a GDS unit and hopefully prioritize traffic to memory in order to spill cache. Allow the LDS unit to reasonably efficiently coalesce results that won't fit into L2 prior to writing out results through GDS that may have to arbitrate through a large number of channels and stalls.

    With the ID buffer added to PS4 and "future AMD architectures" some reordering seems likely. Tiling or binning of geometry makes a lot of sense as demonstrated with Nvidia's solution. With the ROV capabilities described above and a theorized scalar that could efficiently sort the results it makes even more sense. Scaling should be easier if order is less of a concern.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    I have a more fundamental question as to why a higher-level consideration like API-level fragment submission order would be brought down to the level of memory consistency. OIT or related solutions are concerned about indeterminism between primitives due to data races, whose individual threads could very well be operating on the strongest memory model and still have problems with.

    GCN has an option for relaxing ordering requirements, and it already does so when it can. This specifically for scenarios where it is known that there will be no overlap, or it's something like a depth-only pass where order doesn't matter.

    Without the additional modifier of "deferred", the tiling does not change from the status quo. AMD's GPUs already tile their graphics hardware for the purposes of spatial locality.
    AMD has proposed some form of binning or deferment to a limited number of primitives, however, this is not a guarantee since the limits of the bin can be exceeded, special conditions or operations can force the GPU to fall back to immediate rendering.
    One such scenario may be when depth cannot cull output in the case of transparency.

    What is the more efficient method? With a tiled setup like the current raster output, the CU's are either writing to different locations that QR doesn't affect, or they do and they compromise the assumption QR makes that data is rarely read after writing.

    Pad area is baked in. The large via and pad make the area around them unable to be patterned for other uses, and around that area is the silicon hard-wired to interface with the pad. If those pads are not needed, the matching portion of the interface going off-die and related silicon need to be removed at the design level.

    Do you mean an FPGA like the models from Xylinx that have slices put on top of an interposer?
    If not, then there's no pad space to worry about because it's not trying to go off-die.
    If yes, then the die area used by the interface is there, and the hope would be that it gets used.

    The "one per memory channel" arbiter right now is the export/GDS path already as far as the GCN ISA is concerned.
    Using the LDS in this manner points to further changes, in light of the risk of forcing stalls simply due to being unable to allocate LDS storage and because the LDS isn't supposed to span different workgroups--which the export/GDS path does.


    The ID buffer as described is a write path that works in parallel with depth.
    It's not clear in this case what reordering is needed if it works in concert with depth.
    The ability to track the order and tag what primitive covers a pixel can be associated to some kind of binning, although the nearest proposals from AMD to the ID buffer that mention something like this pixels go into a wavefront for a given primitive after culling based opaque geometry coverage. I'm not sure I can safely interpret if it can handle transparent outputs that don't fall behind the nearest opaque sample.
     
    sebbbi likes this.
  5. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    The one tested by RWT was... Radeon HD 6670, which was pretty old.

    Here is what GCN looks like.
    https://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/


    This is quite surprising if it is true. I have read long ago that alpha blending is applied in primitive submission order. This is also why OIT is considered order independent.
     
  6. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Simplifying the submission model and some scaling consideration. Submitting fragments from two overlapping draws simultaneously. Cases where races are more likely to be encountered or certain restrictions relaxed. The solution could very well be something we haven't seen or is years out. Some async behaviors could also be problematic, so erring on the side of caution here. I'm also partial to using the theorized scalars for geometry processing and tessellation which changes the requirements a bit. Fiji had 4 geometry processors, my design in theory could be ~256 using the scalars.

    For example collect 16 samples for a single pixel. The ROP just collects the samples so order is irrelevant. Then have a programmable stage invoked to consolidate those results either through culling, compression, or streaming. Let the programmer decide what gets kept. Then start binning again. Once the shading finishes, finalize all the bins with a shader pass that makes sense for the programmer. Culling opaque geometry, highly transparent multiple samples, or whatever method suits the application best prior to fragment shading, postprocessing, or timewarp. That could be deferred, immediate or a hybrid like Doom. Might not be as efficient as the current design, but it's programmable along with the capabilities that come with it. The timewarp especially may benefit from overlapping opaque samples.

    What if the number of primitives wasn't limited? Close proximity to a CPU, dedicated cache, or some compression(on IDs) could relax those limits considerably. Better approximation of tile size could help as well. Moving away from the fixed function primitive engines with a significantly large bin could make it reasonable. This would be a significant evolution on the current method. In large part because of reorganizing the ROP design.

    The CUs "should" be writing to different locations. There could be some corner cases.

    Yes and no to the Xylinx model as this is a hybrid approach. For a MCM there will be some portion going off die. Either to the CPU, another GPU, FPGA, etc. All FPGAs I've encountered use a subtraction method for their interconnect. One to all then start disabling links at a specific node to create the desired connections. They're called Field Programmable Gate Arrays because you start gating off arrays to get the desired result. For example take a 32b data path, connect it to 10 separate 32 bit channels simultaneously, then gate off 9 of them to form a direct connection between points. Bit more complicated as they can generally route around conflicts by bridging at various points.

    This would be a significant change, one that I'm not currently expecting, but may be possible. Given increased data sharing this should reduce the burden on GDS for tasks falling within a single workgroup. Doesn't currently work this way, but with increasingly larger chips it should avoid some of the contention as the cache gets segmented and specialized towards local units. Along the lines of increasing cache levels with GDS becoming L3, LDS to L2, and actual local possibly rolled into a scalar.

    Other than a bin for IDs to index I'd agree the capabilities are a bit unclear. How they choose to handle it will be interesting. There are also motion vectors for the spacewarp techniques in VR. Even overlapping opaque geometry might be worth retaining along with the transparency for those techniques. Probably a handful of papers that could be written on techniques for that subject.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    That's tagging the execution pattern, but that's not necessarily going to give what regions of the 2D screen belong to a given rasterizer, which should be a static assignment. It gives a 512x32 scan band, but indicates that wavefront issue follows Z order.

    Apologies, I misread the sentence and missed the alpha portion. That should be applied in primitive order.

    Which "order" is this? Order of wavefront write-back, shader program order, or source primitive submission order?
    Hardware is nominally aware of the first. QR governs what happens with the second. The last exists at another level of abstraction where it's not really on the cache to know the difference.
    Having the ROP "collect" samples seems to take this out of QR's purview entirely, since this is an invalidation protocol, not a value broadcast one. The ROP would need to be running some kind of polling loop if it's still participating in the coherence protocol.

    At some point the finite storage available for the unit will be exceeded and the debate is how much the performance drops. If it's effectively unbounded, you can consider the whole screen just one big bin--rendering it moot or just transforming the whole thing into a TBDR.

    This would be mixing up nomenclature from the NoC paper, which only addresses links in terms of connections that go into the interposer. The memory controllers and cores are part of the interposer mesh. The internal concentrating networks on the main die fall outside the scope, so what you are describing is the on-die portion of the concentrated network scheme.
     
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Any order in the case of a large bin. QR, or quite possibly another technique, would still be applicable to any compression techniques where the data is still somewhat dependent on the other samples. It would mimic blending in requirements, but you would still be able to extract all the samples. It would be applicable if another tile or cluster had to export data for any reason. Keeping certain counters or metrics in cache should accelerate the process. Still a question of if a ROP is a fixed hardware unit or run atop the CU. Some form of export caching mechanism.

    TBDR is what I'm envisioning. I foresee designs with rather large caches in the future. With the APU strategy AMD seems to be pursuing there is 20MB on the CPU alone that could be linked in. An EDRAM node that only exists on GPUs could also do it. Future designs with a cache die stacked atop a logic die on an interposer that capacity should increase significantly. I'd agree binning is currently limited, but one big bin in the future doesn't seem unreasonable. Especially if we see any development of reconstruction techniques opposed to brute force rasterization. ASW for example changed the VR requirements rather significantly and that technique could extend to most games. Occluded geometry was a significant issue for that technique that an ID buffer or ROV could address.

    Sorry, mix-up in the concentration nomenclature there. The interposer routing would be concentrated. The un-concentrated portion would primarily reside on the GPU die where most of the nodes reside. A portion of that would likely extend to the interposer to connect separate processors. That's what I was getting at in regards to concentration.
     
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Random thought on NCU. Integrated PCIE switch as opposed to the bridge chips they've been using for SSGs, dual adapter, or what seems likely to be a fabric/network for the Instinct line. APUs wouldn't require one as the CPU obviously has ample PCIE lanes. Further what if it is an actual CPU, albeit a small one, that acts as command processor, external routing, etc. A larger APU like Naples wouldn't need one, smaller APUs have the CPU portion already. A standalone Vega might be designed to lean on a CPU for the command processor and in theory has the off die link designed in.
     
    #49 Anarchist4000, Dec 29, 2016
    Last edited: Dec 29, 2016
  10. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    The L3 cache is a victim private to the core complex though. So unless there is an optional memory-side LLC (which is possible), or evictions can bounce between L3 slices, the GPU is not gonna thrash any of these CPU caches.
     
    pharma and Razor1 like this.
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    That assumes one or more cores aren't used exclusively as the command processor. A design that AMD has been hinting at. That's 2MB L3 per core which is probably sufficient for binning geometry on a relatively small APU. In the case of a dedicated ARM core they likely have nearly all available cache if they really want it. We already know they have plans to tightly couple the CPU to GPU, just a question of implementation. That assumes they don't have an additional pool.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    Any reference for this?
    The command processor(s) currently have a very heavy specialization in grabbing a command packet and looking it up in a dedicated code store, then coordinating through dedicated queues and signals.
    Taking that function so far away from the rest of the GPU seems problematic.
    A high-performance core doesn't necessarily behave that consistently, and the cache hierarchy is not built to push updates or give consistently low latency all the way out to the L3/interconnect. Even with all the coherence talk from AMD, I'm not certain that the memory model will be even that tight, since HSA and QR promise something far weaker.

    That aside, the command processor is not particularly concerned with binning geometry. That would happen somewhere in the dedicated hardware in the shader engines. There could be a processor or state machine (or many) in there, since it's bespoke processors all the way down.

    edit:
    The following about the PS4 being jailbroken has links to a presentation that includes discussion of the multiple proprietary F32 cores in the the command processor.
    https://forum.beyond3d.com/posts/1959060/
     
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I have to run in 30m, but I'll try to dig up some stuff later. I've seen some suggestions about it in XB1 documentation and some patents, will take some time to dig up though.

    I'm talking in terms of APUs here, so while the command processor would be further away, it wouldn't be that far. Nothing like CPU to discrete graphics. This design would be foregoing the specialization in favor of adaptability. There would be a likely split between HWS/ACE and possibly a similar unit in functionality for graphics? Something hinted at with the PS4 Pro article. With dedicated hardware keeping queues close to the hardware moving the CP away might not be that problematic. It need only fill the queues which occurs at a far lower frequency and it's on the same package with a configurable mesh. With the HSA work there could very well be specialized hardware to speed translations as even the CPU might have to interpret, generate, or even execute GPU commands.

    Didn't mean to imply that the CPU or command processor would be doing the binning. Just that it's a rather likely node on the network that would likely have adequate storage for a bin. Possibly analogous to NCU or specialized core for the implementation. Given the mesh it should be reasonably linked in. Say an ARM core with extra PCIE lanes to drive a SSG, external network, or just house a large cache that could be the cores memory channel.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    The context was using cores in the separate CPU memory cache hierarchy.
    Discretes would be difficult, although the board distance is not the major contributor to latency.
    Even in an APU, it's a significant change since the comparison was using cores insulated behind 3+ cache layers and an interconnect with X number of links, northbridge, and an IOMMU, versus a command processor block that is physically wired and speaks the language of hardware it is interfacing with.

    I'm not following what it's adapting to, or the restriction since they are programmable. The job currently is to process API/device commands and signals for a bunch of esoteric hardware in a timely and reliable fashion. The meat of the graphical/computational load isn't there, but there is an assumption that it is available and consistent for hardware expecting interactions handled in a fixed/consistent number of cycles--no page faults, context changes, evicted microcode tables, mesh congestion, or tens to hundreds of cycles of fun cache/northbridge variability.

    There's apparently some kind of work distributor, although as described it sounds like it's somewhere past the command processor, in the area of the setup pipelines and shader engines.

    The CP is dedicated hardware, and how does one fill the queues inside the GPU from the host side of the cache hierarchy, unless there's some sort of "processor" that monitors and pulls in "commands" on the other?

    This is the microcode engines in the command processors, and why there are discrepancies in which GPUs could take the necessary microcode updates to support it.
    From my interpretation, HSA's command packets for non-host agents are assumed to be handled by a specialized controller, since HSA says they must be processed without taking up kernel execution resources.
     
  15. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    Queues are in the shared memory, regardless of whichever existing programming models (incl. HSA). The oversubscription-supported command processors only bind in a selected few at a time, streams the commands from the memory, and context-switch them back to the memory as appropriate.

    Especially for HSA, this is actually something defined clearly — heterogeneous agents communicate in the user space through shared memory with a common queueing format and doorbell interrupt mechanics.
     
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Significant change, but not necessarily unreasonable. It may be possible to strip out the northbridge and IOMMU and interface more directly to an AQL. Say 16 core Naples treating 16x4 CU clusters as an extension of it's resources as opposed to one large device. Each core reserving 4 CUs and the mesh interconnect tailored to suit that design. Shaders are capable of moving data and the CPU could be wired into the control mechanism. According to the HSA requirements a kernel should be capable of filling it's own queue. The CP aggregates commands from across a bus, but if there was only a single source and no need for global synchronization it could probably be bypassed. If the control network was part of the mesh design the CPU could tap into that along with some specialized hardware.

    This would be an ARM core tailor built to drive the GPU. There could be a FPGA or custom hardware along with it, however I think the ARM core might be simpler to interface. The point of adaptability, or keeping it separate, would be to have different versions for the desired task. One with a large cache as addressable memory/ESRAM, perhaps certain IO mechanisms (NVMe/SSG, Infiniband, 10GbE, USB3.1/Thunderbolt, display outputs), etc that might be desirable to update without completely changing the GPU design. I've even seen some 60Gbps external switched fabrics mentioned. Swapping the ARM core and updating the interposer and PCB would be all that is required to significantly modify the design without wasting die space. Without changing the GPU die they could update to a newer Displayport standard for instance. Some of those features wouldn't make much sense in consumer products just as display outputs wouldn't in server chips. An extension of that ambidextrous strategy they keep mentioning. Virtualization and their security mechanisms might want it as well.

    Currently yes, but there has been mention of streamlining the scheduling process. Dedicating a core could be construed as not taking up kernel execution resources. Limit the scope to smaller datasets, similar to using a CU to accelerate SSE style instructions for a single core, being able to interface more directly might be preferable. They've also mentioned "Initial SW GPU scheduler (disabled by default)". Link Which is a bit strange considering all their hardware scheduling capabilities, but would suggest the ability to bypass the hardware. It would make sense from the standpoint of heavily partitioning the device and lower latency. The dedicated hardware shouldn't be needed if the maximum scope of synchronization is a single CU or possibly cluster. In addition to the mesh connecting the CPU to each cluster directly.

    I'll see if I can find that ARM core slide, but it was a while ago I recall seeing it. I think it had to do with XB1 bypassing the CP in favor of software for indirect execution or something.
     
  17. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    If you have been following AMDKFD for a lomg time, you would have known that the "SW scheduler" was said to be only for GPU bring-up and debug purpose. What AMD prefers is the GPU side hardware scheduling with over-subscription, which runs independently from the CPU and communicates with the CPU via the shared memory.

    https://lkml.org/lkml/2014/11/8/161
     
  18. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    It seems the "initial SW scheduler" is actually something specific to the graphics stack, according to bridgman.

     
  19. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    127
    It doesn't seem like the HSA specification has been interpreted correctly, and the understanding of the programming model doesn't feel right either.

    A kernel does not have a queue.

    An application process can request multiple user-mode queues from any specific HSA agent, and enqueue packets to the queues. Packets (so far) can be either a kernel dispatch packet or a wait packet. A kernel is just a piece of program.

    A kernel dispatch contains any arbitrary number of work-items that can be organised as a one, two or three dimension space. In reality, it is just X size, Y size, Z size, a kernel pointer, a pointer to kernel arguments, the workgroup size and a few other flags and stuff. It is up to the HSA agent to break a kernel dispatch down in whatever ways they prefer.

    In a GCN GPU, all CUs sit behind a common scheduling layer, which is connected to multiple packet processors (ACEs) that can serve multiple processes at a time. Not "CPU cores" or "threads". We are talking about application processes that own a virtual address space.

    The packet (command) processor itself does not aggregate packets. The packets are written to the coherent memory, and the packet processor pulls them in upon being notified by other agents through user-mode signalling (doorbell interrupt).
     
    #59 pTmdfx, Dec 31, 2016
    Last edited: Dec 31, 2016
    pharma likes this.
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,117
    Likes Received:
    2,859
    Location:
    Well within 3d
    And the cache hierarchy, variable execution time, and lack of a domain-specific microcode store. On top of that, the command processor generates additional signals and instructions based on the contents of the packets, which generally should mean data amplification and a loss of abstraction if this is exposed to the host core complex. This is compromising the CPU's effectiveness in general if hooks for GPU packet processing and arbitrary hardware fiddling are added there.

    There are ARM cores for embedded or control with items like local memory storage and an emphasis on fixed cycle costs for all operations. What is not done with them is taking that and putting it into the CPU host complex. There are dozens of similar cores already driving the GPU, for which the ISA is not a particularly relevant distinction. It may be the case that the ISA is changed, but the role doesn't shift because of it.

    All of these change already without the command processor needing to be re-architected, most of those already change pretty freely since they are already behind microcontrollers or an IO stack. The ESRAM or cache would be an architectural change, but as in the case of the Xbox One, not particularly relevant since it's not there for the command processor.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...