Next Generation Hardware Speculation with a Technical Spin [2018]

Status
Not open for further replies.
Ok. From resetera:

https://www.resetera.com/threads/ps...ion-post-e3-2018.49214/page-184#post-13759967

Basically a new "insider" is claiming that the PS4 was delayed because it did not originally include backwards compatibility. There is not, to my knowledge, any confirmation that he actually is an insider. Assuming this is true, 2 more questions:

If backwards compatibility was going to be broken despite the APU being another AMD x86 and an AMD GPU, what, if anything, can be read into the fact that the basic design breaks backwards compatibility? Does this tell us anything about Navi?

I believe AMD just announced a short time ago that their first 7nm product "taped out". Is this not almost certainly a cell phone component and not a console APU? What I believe are called "beta devkits" cannot go out until the APU is actually being produced in some fashion, however low volume/ low yield. Correct?
amd 7nm vega is what your thinking off. It's the vega instinct card (believe it's called). It's not a gaming gpu, it's a professional workstation class gpu.

If that is true the only thing it really tells us is that bc wasn't the highest part of the design decision. It doesn't tell us anything inherently about navi.
It doesn't take a lot to break bc in terms of consoles from an uarch perspective, example removing/changing couple low level functions. Unless a lot of thought had gone into it

Are they saying it was delayed to add bc? That would probably be a lot of software development,
 
amd 7nm vega is what your thinking off. It's the vega instinct card (believe it's called). It's not a gaming gpu, it's a professional workstation class gpu.

If that is true the only thing it really tells us is that bc wasn't the highest part of the design decision. It doesn't tell us anything inherently about navi.
It doesn't take a lot to break bc in terms of consoles from an uarch perspective, example removing/changing couple low level functions. Unless a lot of thought had gone into it

Are they saying it was delayed to add bc? That would probably be a lot of software development,

The BC not being a priority and not being ready in time for a late 2019 launch is being cited. My lack of technical knowledge kills me sometimes. I am used to thinking about the same old software on the PC running just fine on the the next x86 or GPU 3xxx. Made me think there might be something considerably different, something that broke easy BC, about the APU.
 
was delayed because it did not originally include backwards compatibility. There is not, to my knowledge, any confirmation that he actually is an insider. Assuming this is true, 2 more questions:

If backwards compatibility was going to be broken despite the APU being another AMD x86 and an AMD GPU, what, if anything, can be read into the fact that the basic design breaks backwards compatibility? Does this tell us anything about Navi?

I think it says more about their lack of expertise in Software than it does about AMDs Hardware, specifically in the area of planning for virtualization and moving forward.
 
The BC not being a priority and not being ready in time for a late 2019 launch is being cited. My lack of technical knowledge kills me sometimes. I am used to thinking about the same old software on the PC running just fine on the the next x86 or GPU 3xxx. Made me think there might be something considerably different, something that broke easy BC, about the APU.
A lot goes into making the pc bc.
drivers, api's, the game code itself not being tied/optimised to the execution of the hardware the way it is on console.MS is in a different place as they use a hypervisor etc

being x86 and amd based helps a lot, it doesn't make it easy though.

If your unconcerned about bc, then you may make different customisations that you know would make bc harder also.
 
If your unconcerned about bc, then you may make different customisations that you know would make bc harder also.
more or less the greatest challenge that Sony faces when we're looking at requisites for ease.
 
Correct, but only if you assume full 4:4:4 chroma with 12 bit color. 4:4:4 isn’t required for consoles, and we’re only at 10 bit color with current HDR. It’s limited to 48Gbps total.

That's why I mentioned audio. HDMI 2.1 has a maximum bandwidth of 48Gbps, uncompressed audio can consume as much as 18Gbps of that.
 
Ok. From resetera:

https://www.resetera.com/threads/ps...ion-post-e3-2018.49214/page-184#post-13759967

Basically a new "insider" is claiming that the PS5 (edit: thanks milk! - argh!) was delayed because it did not originally include backwards compatibility. There is not, to my knowledge, any confirmation that he actually is an insider. Assuming this is true, 2 more questions:

If backwards compatibility was going to be broken despite the APU being another AMD x86 and an AMD GPU, what, if anything, can be read into the fact that the basic design breaks backwards compatibility? Does this tell us anything about Navi?

I believe AMD just announced a short time ago that their first 7nm product "taped out". Is this not almost certainly a cell phone component and not a console APU? What I believe are called "beta devkits" cannot go out until the APU is actually being produced in some fashion, however low volume/ low yield. Correct?

The stack of detailed patents on making a system backwards compatible make the idea of a delay to add this as an afterthought quite perplexing. Those are not new patents and the work done on them is older than the published or granted date.
 
I'm not sure if patent checking is necessarily a good way to determine if BC is coming. Xbox One has actual hardware customizations built in, and patents around virtualization forever and no one foresaw the BC coming to Xbox One, at least not until it was announced.

The idea that PS5 isn't coming without BC is plausible. But I don't think that's the reason. People in the PS ecosystem are in fact used to not having their games carry forward. The costs to implement BC is neither cheap, or easy for them.
 
The stack of detailed patents on making a system backwards compatible make the idea of a delay to add this as an afterthought quite perplexing. Those are not new patents and the work done on them is older than the published or granted date.

Weren't those patents specifically for the 4Pro? Where the uArch was identical except for 3 to 4 new instructions added that wouldnt impact old code?
 
Weren't those patents specifically for the 4Pro? Where the uArch was identical except for 3 to 4 new instructions added that wouldnt impact old code?
No. They talked about a much more broad shift in architecture (as patents are typical to speak in as encompassing terms as they can), but the patents specifically speak to the architectures having different features. The presence of an L3 cache where the prior architecture did not have it is explicitly stated, for example. They talk about changing cache associativity as well to force mappings to be predictable.

In aspects of the current disclosure, in timing testing mode CPU resources may be configured to be restricted in ways that affect the timing of execution of application code. Queues, e.g., store and load queues, retirement queues, and scheduling queues, may be configured to be reduced in size (e.g., the usable portion of the resource may be restricted). Caches, such as the L1 I-Cache and D-Cache, the ITLB and DTLB cache hierarchies, and higher level caches may be reduced in size (e.g. the number of values that can be stored in a fully associative cache may be reduced, or for a cache with a limited number of ways the available bank count or way count may be reduced). The rate of execution of all instructions or specific instructions running on the ALU, AGU or SIMD pipes may be reduced (e.g. the latency increases and/or the throughput decreases).

In aspects of the current disclosure, in timing testing mode, the application threads may execute on a CPU core different from that designated by the application. By way of example, but not by way of limitation, in an system with two clusters (cluster “A” and cluster “B”) each with two cores, all threads designated for execution on core 0 of cluster A could instead by executed on core 0 of cluster B, and all threads designated for execution on core 0of cluster B could instead by executed on core 0 of cluster A, resulting in different timing of execution of thread processing due to sharing the cluster high level cache with different threads than under normal operation of the device.

In addition, other behaviors of one or more caches, such as the L1 I-Cache and D-Cache, the ITLB and DTLB cache hierarchies, and higher level caches may be modified in ways the disrupt timing in the timing testing mode. One non-limiting example of such a change in cache behavior modification would be to change whether a particular cache is exclusive or inclusive. A cache that is inclusive in the normal mode may be configured to be exclusive in the timing testing mode or vice versa.

Another non-limiting example of a cache behavior modification involves cache lookup behavior. In the timing testing mode, cache lookups may be done differently than in the normal mode. Memory access for certain newer processor hardware may actually slow down compared to older hardware if the newer hardware translates from virtual to physical address before a cache lookup and the older hardware does not. For cache entries stored by physical address, as is commonly done for multi-core CPU caches 325, a virtual address is always translated to a physical address before performing a cache look up (e.g., in L1 and L2). Always translating a virtual address to a physical address before performing any cache lookup allows a core that writes to a particular memory location to notify other cores not to write to that location. By contrast, cache lookups for cache entries stored according to virtual address (e.g., for GPU caches 334) can be performed without having to translate the address. This is faster because address translation only needs to be performed in the event of a cache miss, i.e., an entry is not in the cache and must be looked up in memory 340. The difference in cache behavior between may introduce a delay of 5 to 1000 cycles in newer hardware, e.g., if older GPU hardware stores cache entries by virtual address and newer GPU hardware stores cache entries by physical address. To test the application 322 for errors resulting from differences in cache lookup behavior, in the timing testing mode, caching and cache lookup behavior for one or more caches (e.g., GPU caches 334) may be changed from being based on virtual address to being based on physical address or vice versa.

Yet another, non-limiting, example of a behavior modification would be to disable an I-cache pre-fetch function in the timing testing mode for one or more I-caches that have such a function enabled in the normal mode.

Configuring the hardware and performing operations as described above (e.g., configuring the CPU cores to run at different frequencies) may expose errors in synchronization logic, but if the real time behavior of the device is important, the timing testing mode itself may cause errors in operation, e.g., in the case of a video game console, errors due to the inability of the lower speed CPU cores to meet real time deadlines imposed by display timing, audio streamout or the like. According to aspects of the present disclosure, in timing testing mode, the device 300 may be run at higher than standard operating speed. By way of non-limiting example, the higher than standard operating speed may be about 5% to about 30% higher than the standard operating speed. By way of example, but not by way of limitation, in timing testing mode, the clock of the CPU, CPU caches, GPU, internal bus, and memory may be set to higher frequencies than the standard operating frequency (or the standard operating frequency range) of the device. As the mass produced version of the device 300 may be constructed in such a way as to preclude setting of clocks at above standard operating frequencies, specially designed hardware may need to be created, for example hardware that uses higher speed memory chips than a corresponding mass produced device, or uses the portion of a manufacturing run of a system on chip (SoC) that allows higher speed operation than average, or uses higher spec motherboards, power supplies, and cooling systems than are used on the mass produced device.
There are a number of ways in which application errors may be manifested in the timing testing mode. According to one implementation, the specially designed hardware may include a circuit or circuits configured to determine the number of instructions per cycle (IPC) executed by the device 300. The OS 321 may monitor changes in IPC to test for errors in the application. The OS may correlate significant variations in IPC to particular modifications to operation of the device in timing testing mode.
In other embodiments, CPU resources may be reduced when the device operates in the timing testing mode. Examples of such CPU resource reduction include, but are not limited to reducing the size of store queues, load queues, or caches (e.g., L1 or higher, I-cache, D-cache, ITLB, or DTLB). Other examples include, but are not limited to reducing the rate of execution of ALU, AGU, SIMD pipes, or specific instructions. In addition, one or more individual cores or application threads may be randomly or systematically preempted. Additional examples include delaying or speeding up or changing timing when using OS functionality, changing use of cores by the OS, altering virtual to physical core assignment (e.g., inter-cluster competition), leveraging other asymmetries, or writing back or invalidating caches and/or TLBs.
Examples of sending commands to hardware components in ways that disrupt timing at 544include altering the functioning of the memory 340 or memory controller 315. Examples of such alteration of memory or memory controller functioning include, but are not limited to, running a memory clock/ and internal bus clock different frequencies, inserting noise into memory operations, adding latency to memory operations, changing priorities of memory operations, and changing row and/or column channel bits, to simulate different channel counts or row breaks.
Jaguar has no L3.
In another example, a more powerful APU may contain a L3 cache for the CPU, compared to a less powerful APU that did not have such a cache.
In alternative implementations, the CPU 120 and GPU 130 may be implemented as separate hardware components on separate chips
For example, in the context of GPUs, parallel processing threads are bunched in what is sometimes called a “warp” (for NVIDIA hardware) or a “wavefront” (for AMD hardware) as the most basic unit of scheduling, the difference primarily being the number of threads that are grouped together.

IF call-out
To facilitate communication among the cores in a cluster, the clusters 201-1 . . . 202-M may include corresponding local busses 205-1 . . . 205-M coupled to each of the cores and the cluster-level cache for the cluster. Likewise, to facilitate communication among the clusters, the CPU 200 may include one or more higher-level busses 206 coupled to the clusters 201-1 . . . 201-M and to the higher level cache 204.

Jaguar has no HT/SMT:
Depending on the specifics of the CPU 200, a core may be capable of executing only one thread at once, or may be capable of executing multiple threads simultaneously (“hyperthreading”). In the case of a hyperthreaded CPU, an application may also designate which threads may be executed simultaneously with which other threads. Performance of a thread is impacted by the specific processing performed by any other threads being executed by the same core.
 
Jaguar has no L3.

Hum.. I'd say Jaguar has "no L2", since its L2 is shared between quad-core modules. Current x86 designs have an isolated L2 per core and the L3 is used for coherency. Jaguar just skips the typical L2-per-core and goes straight to L3 it seems.

That said, emulating it with a Zen design should be fairly easy. Since AFAIK Zen's L3 is a lot faster than Jaguar's L2 (at least comparing this with this..), maybe it would be just a case of telling Zen not to use the L2. Unless the data must hop through the L2 before going to the L3 and that would imply extra latency.
 
I hope next generation console will have an equivalent to mesh shader(primitive shader?)

http://reedbeta.com/blog/mesh-shader-possibilities/

A sucker Punch developer is excited by the possibility...

And I want to see next generation console with this type of rasterizer...

https://www.freshpatents.com/-dt20180301ptan20180061124.php
Vega already has primitive shaders.
I'm sure the next gen of uarch from amd will have it, if not improve on the vega implementation.
 
No. They talked about a much more broad shift in architecture (as patents are typical to speak in as encompassing terms as they can), but the patents specifically speak to the architectures having different features. The presence of an L3 cache where the prior architecture did not have it is explicitly stated, for example. They talk about changing cache associativity as well to force mappings to be predictable.
Patents are written to be as broad as they can be, including known products, future ones, or hypothetical combinations of features that are never realized. The "non-limiting" or similar modifiers indicate an example alteration could be made or not made without leaving the scope of the patent--effectively stating that almost anything could be changed and instrumented to see if it breaks something. Some of the modifications to the PS4 Pro would change the clock and timing behaviors, and some other elements of the GPU's architecture were enhanced or even subject to being switched on the fly in ways that could be served by a subset of claims in this patent.
It would also be possible for the patent to refer to the reverse of the Pro, where a weaker architecture is being evaluated for how it runs standard software.
It does seem reasonable to expect a future architecture to be more powerful and have different features, of course, though the way companies spam patents on the off-chance that they prove useful means that this method may or may not be implemented.

Hum.. I'd say Jaguar has "no L2", since its L2 is shared between quad-core modules. Current x86 designs have an isolated L2 per core and the L3 is used for coherency. Jaguar just skips the typical L2-per-core and goes straight to L3 it seems.
The L2 is the second cache out from the core for a designer whose labeling scheme starts at 1. Jaguar wouldn't be the only shared-L2 x86, with Bulldozer and Conroe being other examples. Whether a cache is called a last-level cache is dependent on if it is shared. I think Jaguar would have a 2nd-level LLC.

That said, emulating it with a Zen design should be fairly easy. Since AFAIK Zen's L3 is a lot faster than Jaguar's L2 (at least comparing this with this..), maybe it would be just a case of telling Zen not to use the L2. Unless the data must hop through the L2 before going to the L3 and that would imply extra latency.
That might be difficult since the L2 is inclusive of the L1 caches, and the L2's tags are what the L3 mirrors for the purposes of cache coherence.

I’ve not seen the micropolygon patent before. It sounds a lot like tesselation?
The patent primarily concerns handling the very small primitives that can be generated by standard tessellation. Very small triangles at a quad/pixel/sub-pixel range require significantly fewer bits to represent their position data, which can translate into wasting a rasterizer unit with data paths and hardware sized for triangles that might span the whole screen. This allows a GPU to process multiple triangles more efficiently by routing them to rasterizers based on comparisons between their bounding boxes and the size of a given rasterizer.
 
So if my understanding of micropolygon is correct, it can basically display all the tiny wrinkles, pores and the tiniest skin detail fully in 3d without having too much of a performance hit right? True photorealism is finally here in real time?
 
So if my understanding of micropolygon is correct, it can basically display all the tiny wrinkles, pores and the tiniest skin detail fully in 3d without having too much of a performance hit right? True photorealism is finally here in real time?

You can use polygons for micro details instead of textures (normal maps or parallax maps).

With equivalent of Mesh shader it could be very interesting we can have adaptative tesselation with compute but it goes faster with Mesh shader.

https://jadkhoury.github.io/files/MasterThesisFinal.pdf

Adaptative tesselation with compute.


25% faster without optimisation with Mesh Shader.

It has some avantage to use tesselation and displacement:

https://developer.amd.com/wordpress/media/2012/10/Tatarchuk-Tessellation_GDC08.pdf

It is funny but use of a parrallel rasterizer remind me of this:

http://graphics.stanford.edu/papers/hwrast/

An evolution of this because they wanted a version working with big polygons like into the AMD patent.


EDIT: And funny things at least back in the right track with mesh shader and maybe primitive shaders if it the same things (look like it is but not 100% sure).


 
Last edited:
https://www.linkedin.com/jobs/view/sr-product-manager-campaign-management-at-playstation-855298791/

Sony is hiring a senior product manager manager for next generation marketing campaign

At PlayStation we are working at the frontiers of immersive experiences for our users. We are looking for an entrepreneurial Product Lead to join our Intelligence Platform Group. As part of this hands-on position you would work across Partners, Product and Engineering and contribute to development and growth of PlayStation Intelligence Platform. You will own the roadmap for next generation PlayStation campaign.

EDIT:
https://www.glassdoor.de/job-listin...E55,71.htm?jl=2592091888&countryRedirect=true

The team you'll be joining is chartered with developing platforms that enable delivering intelligent experiences to next generation of PlayStation® devices. To enable us keep the momentum of PlayStation® 4's tremendous success worldwide going over to next generation, you'll contribute to multiple opportunities to have an outsized impact on shaping the evolution of our web services platform
 
it's happening, the sword has been remade.

PucV.gif
 
Status
Not open for further replies.
Back
Top