Predict: The Next Generation Console Tech

Status
Not open for further replies.
Keep in mind the total die size is just 3.1 mm^2. Also, I suppose the green area named "FP" actually contains:
- scheduler
- vector register file
- 2 x VALU
- 1 x VIMUL
- 1 x St. Conv.
- 1 x FPAdd
- 1 x FPMul
Replacing the FPAdd and the FPMul with two FMA units (similar to the ones in Piledriver) shouldn't add too much to the die area.

The problem is that replacing ADD and MUL units with FMA is that you will likely up the latency which may be a major problem for the Jaguar with it's more modest out-of-order engine.
 
Is it too much to ask for the CPU and GPU to throttle relative to one another as needed?
Would it even be smart to use something like TurboCore in a console? I think you'd make it easier on the devs if you made sure that all frequencies stay constant while their game is running. I could be wrong of course, but it just seems to me that this would make optimizing their game easier.
 
Is it too much to ask for the CPU and GPU to throttle relative to one another as needed?

It looks like Trinity already does that. I never heard about it for low-power APUs, though.

The problem is that replacing ADD and MUL units with FMA is that you will likely up the latency which may be a major problem for the Jaguar with it's more modest out-of-order engine.

I don't think the latency for a single FMA could be worse than the latency of a MUL followed by an ADD. So as long as your workload does a lot of FMA, latency would actually be better.

If you referring to the latency for executing MUL/ADD on an FMA unit, that is likely true, even if Intel managed to avoid that in Haswell (5 cycle MUL, 3 cycle ADD, 5 cycle FMA).
 
depends on your power budget but i wouldn't expect 1/2, hell given my 8350 runs @4.6 undervolted to 1.275 and will run stock @ 1.2 ( all IBT stable) there seems to be a fair bit of ability to minimize heat just via binning.

An FX8350 at 4.6Ghz will easily pull 200w+, undervolted or not.
 
Would it even be smart to use something like TurboCore in a console? I think you'd make it easier on the devs if you made sure that all frequencies stay constant while their game is running. I could be wrong of course, but it just seems to me that this would make optimizing their game easier.

I would imagine not. Not for the frequencies remaining constant, but for the frequency ramping being predictable.
 
Would it even be smart to use something like TurboCore in a console? I think you'd make it easier on the devs if you made sure that all frequencies stay constant while their game is running. I could be wrong of course, but it just seems to me that this would make optimizing their game easier.

Let's suppose 95% of your code is parallel. By Amdhal's law, your speedup is a modest 5.93 on 8 cores. If you let the 5% sequential fraction of your code run 25% faster (1.6 Ghz -> 2.0 GHz), you achieve a speedup of 6.30. Overall, TurboCore gives you a 6% speedup.

If we consider a code with a smaller parallel fraction, let's say to 80%, the speedups are 3.33 without TurboCore and 3.85 with TurboCore. Overall, TurboCore gives you a 15% speedup.

It doesn't sound too bad to me. I also don't see how it would impact code optimization. I can't think of any optimization which depends on the clock frequency.
 
An FX8350 at 4.6Ghz will easily pull 200w+, undervolted or not.

Power consumption is linear with frequency and quadratic with voltage, so a 125 W CPU, undervolted and overclocked by 15%, can't pull more than ~150 W.

By the way, Trinity A10-5700 contains 4 Pilederiver cores at 3.4 GHz plus a 384- SP GPU at 65 W of TDP. So 4 high-clocked Piledriver/Steamroller cores are likely doable.
 
I don't think the latency for a single FMA could be worse than the latency of a MUL followed by an ADD. So as long as your workload does a lot of FMA, latency would actually be better.

If you referring to the latency for executing MUL/ADD on an FMA unit, that is likely true, even if Intel managed to avoid that in Haswell (5 cycle MUL, 3 cycle ADD, 5 cycle FMA).

Intel kept a dedicated FADD unit because of latency.
It might be open to question whether the same "free" FADD at the end of an FMA can be readily accomplished with the reduced custom physical design for a Jaguar core. The latencies in terms of cycles may be smaller, which leaves less wiggle room.
 
Intel kept a dedicated FADD unit because of latency.

Are you sure they maintained a separate FADD unit? It seems to me that the same unit (on port 1) can perform FMA within 5 cycles and FADD within 3 cycles*.

EDIT: You're right, from the description it does seem like a separate unit.

* http://www.realworldtech.com/haswell-cpu/4/

It might be open to question whether the same "free" FADD at the end of an FMA can be readily accomplished with the reduced custom physical design for a Jaguar core. The latencies in terms of cycles may be smaller, which leaves less wiggle room.

Even if FMA has higher latency than FMUL in a "vanilla" Jaguar core (i.e. the FADD at the end is not "free"), it will still be quicker than completing two separate FMUL+FADD.

For code which cannot be expressed in terms of FMA instructions, it might be more difficult to maintain the same latencies of the "vanilla" Jaguar core. I suppose maintaining the same latency for FMUL wouldn't be difficult, while it might be non-trivial for FADD.
 
Last edited by a moderator:
Power consumption is linear with frequency and quadratic with voltage, so a 125 W CPU, undervolted and overclocked by 15%, can't pull more than ~150 W.

If you wish a more in depth discussion then PM me.

But as some one who's had CPU's overclocked at -150c with LN2 I can tell you your theory is bullshit.

There are many many factors when it comes to power consumption.
 
If you wish a more in depth discussion then PM me.

But as some one who's had CPU's overclocked at -150c with LN2 I can tell you your theory is bullshit.

There are many many factors when it comes to power consumption.

It's not a theory. It's a recognized simplification you're taught in school; P=fCV^2. Of course, when you have a billion transistor part with different clock domains, simple formulas don't necessarily hold, but I would think if you pull a device to -150C it should be intuitively obvious that classical power consumption methods are going to cease to hold.
 
Instruction latencies (single precision packed):
- Bulldozer ADD/MUL/FMA4: 5-6 cycles
- Bobcat ADD: 3 cycles
- Bobcat MUL: 2 cycles
I didn't expect MUL latency being lower than ADD latency... Everything else being equal, since Jaguar can do ADD/MUL in one uop, latencies should be 2 cycles for ADD and 1 cycle for MUL. Is this correct?
Since Bulldozer is clocked more than two times higher than Jaguar, 2-3 cycle ADD/MUL/FMA4 might be possible in a Jaguar-derived core.
 
Bkilian did not say anything about the number of TFLOPS.He just said do not expect a monster GPU.Imho a monster GPU has 3/4 TFLOPS while a normal GPU has 2/2, 5 TFLOPS.
After 8 years of 360 I do not see why MS would not be able to easily create a 2/2, 5 TFLOPS console.
Would not make sense to create a console already old before entering the market unless they want to do a big favor to Sony;)

If you are new here, it would help if you would first look back over the thread and what has already been established or hasn't.

If you did that, you would know that I was one of the foremost proponents of a 2-2.5 TF GPU, based on the leaked dev kits shots.

Bkilian, as well as saying not to expect a monster GPU also said this in regards to there being a 2 or 2.25 GPU in the kits (note, that he isn't merely refuting that there's a HD6870/6950 in the kits since I didn't mention the cards by name):
Do you? Do you know that for a fact? Like I said earlier, your "fact" is based on looking at the output configuration of a card and inferring the identity of the card from that. I can tell you folks right now, nobody got it right, for a number of reasons.
http://forum.beyond3d.com/showpost.php?p=1689116&postcount=16921

lherre also said 2.5 TF GPU was out of the question:
http://forum.beyond3d.com/showpost.php?p=1692093&postcount=17785

Furthermore, the last solid info we had on the power of the GPU in the kit was from bgassassin whose source said 1+ TF and not close to the 1.8 TF GPU in the PS4 kits.

So that gives us the range of 1 to 1.5 TF.

If we can establish the provenance of the guy posting the 10 CUs @ 1Ghz rumours (basically a 7770) then maybe we have another piece of info.

Didn't bkillian say that no one guessed right in which card was in the devkit? Or maybe it just wasn't HD6870 or 6950? People also guessed 7950 though.

So if every guess was wrong then.. *shrugs*

It's possible that the GPU is custom so we can't guess by external appearance - this is most likely. But it could also be that we haven't guessed correctly and there's a lower power GPU that fits the bill, what does the 7770 look like from the back?

Bgassassin also thought it could be a 6790 which is 1.3 TF:
That was a nice leak. I was also going to mention the MSI logo can be seen in the picture.

Another "outside" contender is the MSI 6790. The angle of the dev kit pic makes it tough to tell if it has a red cover. But I'd probably lean more toward the 6950 as well.
 
Last edited by a moderator:
Everything else being equal, since Jaguar can do ADD/MUL in one uop, latencies should be 2 cycles for ADD and 1 cycle for MUL. Is this correct?

Not necessarily. The listed latency for Bobcat may be for the 64-bit halves instead of the full 128-bit result. You'd still be able to issue dependent operations that many cycles apart since the second 64-bit half of the second operation starts a cycle later, so long as the scheduling logic is okay with this.

In this case the 128-bit units in Jaguar wouldn't change the latency. Single cycle FP32 multiplies sounds awfully fast, even at these frequencies..
 
It's possible that the GPU is custom so we can't guess by external appearance - this is most likely. But it could also be that we haven't guessed correctly and there's a lower power GPU that fits the bill, what does the 7770 look like from the back?

Because MS would have wanted to get something with as comparable feature set as possible into the devkits, it's extremely likely that it was an AMD Beta card, quite probably with a customized BIOS.
We have NV20 cards based on Beta silicon months before they shipped to retail when the original XBox was shipping, and Kits based on final silicon weren't available.
 
With regards to power consumption; rumors are putting the 720 at 170-200w power draw. Assuming ~50watts for the system would leave 120-150w left for the GPU/CPU.

AMD's Tamesh SoC draws 15w at full power mode; with 4 jaguar cores and 2 GCN CUs. 2 would then draw 30w; still leaving 90-120w free in the power envelope. Recent rumors point to the 720 having 10CUs. Using these numbers, there would be 90w free and 6CUs unaccounted for, and I don't see how 6CUs would use that much power. With 90w you could throw in 6 more Tamesh SoCs which would be 12 more CUs (not like they would of course).

Unless the 720 will use much less power the 170w, I feel like I missing something, or my calculations are way off? (I know the final silicon won't just be a bunch of Tamesh chips stapled together, I was just using those numbers as a reference point.)

Additional Links:
http://www.engadget.com/2013/01/10/amd-temash-reference-laptop-hands-on/
Mention of 4C and 2CU per SoC in this interview
 
With regards to power consumption; rumors are putting the 720 at 170-200w power draw. Assuming ~50watts for the system would leave 120-150w left for the GPU/CPU.

AMD's Tamesh SoC draws 15w at full power mode; with 4 jaguar cores and 2 GCN CUs. 2 would then draw 30w; still leaving 90-120w free in the power envelope. Recent rumors point to the 720 having 10CUs. Using these numbers, there would be 90w free and 6CUs unaccounted for, and I don't see how 6CUs would use that much power. With 90w you could throw in 6 more Tamesh SoCs which would be 12 more CUs (not like they would of course).

Unless the 720 will use much less power the 170w, I feel like I missing something, or my calculations are way off? (I know the final silicon won't just be a bunch of Tamesh chips stapled together, I was just using those numbers as a reference point.)

Additional Links:
http://www.engadget.com/2013/01/10/amd-temash-reference-laptop-hands-on/
Mention of 4C and 2CU per SoC in this interview

I don't know that 170-200W is something everyone is buying. I think that's an Aegis rumor.

Don't forget to figure around 80% efficiency. 200W from the wall is 160W system power.
 
If you wish a more in depth discussion then PM me.

But as some one who's had CPU's overclocked at -150c with LN2 I can tell you your theory is bullshit.

There are many many factors when it comes to power consumption.

Just because you've used liquid nitrogen it doesn't mean that you're claims about Piledriver are any more or less correct! The two things are unrelated.

As it happens, you're off the mark about Piledriver. Here is a Piledriver overlocked all the way up to 5 gHz and over-volted to a whopping great toasty 1.5v.

http://www.techpowerup.com/reviews/AMD/FX-8350_Piledriver_Review/7.html

205W off the 8-pin and 254W for the entire system (presumably at the wall, or it wouldn't be the entire system). At 4.6 and undervolted you'd have to be looking at well under 150W.

(Edit: Although if they're testing using something puny like "noob stable" wprime then actual max could possibly be higher)

Bulldozer was a hot mess. Piledriver shows massive improvements in performance per watt though (and on the same node and with a short turnaround), and 2 Piledriver modules in a console at about 3.5 gHz would be well under the current gen launch power envelopes. I'm looking forward to seeing if Richland squeezes any more improvements out!
 
Not necessarily. The listed latency for Bobcat may be for the 64-bit halves instead of the full 128-bit result. You'd still be able to issue dependent operations that many cycles apart since the second 64-bit half of the second operation starts a cycle later, so long as the scheduling logic is okay with this.

Right, I didn't think about that. It makes sense that the new operation can be issued as soon as the first 64-bit half is ready.

In this case the 128-bit units in Jaguar wouldn't change the latency. Single cycle FP32 multiplies sounds awfully fast, even at these frequencies..

Let's say Jaguar mantains the 3-cycle ADD and the 2-cycle MUL. A Jaguar-based core with two 3-cycle FMA units* would provide, compared to Jaguar:
- 2x the peak throughput
- 40% lower latency for a MUL followed by an ADD (replaced by a single FMA)
- same latency for ADD
- 50% higher latency for MUL
It sounds like a good compromise to me.

* Should be doable considered Bulldozer have 5-6 cycle FMA @ 4 GHz.
 
Status
Not open for further replies.
Back
Top