Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 19-Jun-2011, 17:16   #351
swaaye
Entirely Suboptimal
 
Join Date: Mar 2003
Location: WI, USA
Posts: 7,316
Default

I like to reflect on the temporal aspects to GPU development.

Not many enthusiast people consider that like this stuff wasn't designed recently. In reality it's likely been in the works since RV770 days. And yet ATI of course works hard to sell us on Cypress and Cayman being the best things since sliced bread. Less than a year ago VLIW4 was the hottest game in town.

Undoubtedly this recent presentation was also in a way a smokescreen for the next next generation of GPU hardware. If this stuff is taped out it's definitely old news for them.
swaaye is offline   Reply With Quote
Old 19-Jun-2011, 19:35   #352
chiadog
Junior Member
 
Join Date: May 2008
Posts: 21
Default

^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
I guess this sounded a little more like conspiracy theory than I intended
chiadog is offline   Reply With Quote
Old 19-Jun-2011, 19:38   #353
Kaotik
Drunk Member
 
Join Date: Apr 2003
Posts: 5,429
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by chiadog View Post
^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
I guess this sounded a little more like conspiracy theory than I intended
According to Dave Baumann, Cayman is exactly the same as it would have been on 32nm, but wether it would have been the high end chip on 32nm is completely another thing (since it would have been around the size of Barts or so)
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline   Reply With Quote
Old 19-Jun-2011, 19:59   #354
chiadog
Junior Member
 
Join Date: May 2008
Posts: 21
Default

Got it. That puts my world back together (in regards to AMD's time line). I guess I may have read too many silly season posts and confused myself.
chiadog is offline   Reply With Quote
Old 19-Jun-2011, 21:30   #355
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,451
Default

The scalar unit's status is something of a second-class citizen, since it doesn't seem capable of writing to memory, and there are only certain ways it can gather data from the SIMD units.
The CU itself is a multi-issue unit that could with some evolution become a 5-wide FP coprocessor that could allow a CPU/GPU combo where the CPU can issue to a CU like the shared FPU on Bulldozer as one kernel while the GPU shares resources.

The memory subsystem and the interconnect are more significant changes than the right turn taken by the execution hardware.
The caches are probably larger per unit of storage, given they take traffic from multiple directions.
The L1/L2 crossbar would have been changed significantly, adding a write capability and writeback support. The crossbar has more clients than Fermi's and the coherency model sound more intensive on AMD's chip, though it seems it can be optional.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 20-Jun-2011, 05:02   #356
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 10,067
Default

Quote:
Originally Posted by Gipsel View Post
Yes, 5 instructions of different types, which have to come from different waves each cycle. That is definitely a bit more than what was traded for the number of operation per issue (4 or 5 max) every 4 cycles.
I thought branches are already co-issued with VLIW instructions on existing architectures.

Quote:
But you are completely right on the first point, the beefed up issue capabilities doesn't change that there is not much dynamics going on. The instructions are plainly issued in order to a given and predetermined unit with no fancy stuff going on.
I'm struggling to understand how this is any different or less dynamic than what Fermi does. GCN seems to be doing the same thing except that it only has 10 wavefronts to choose from per SIMD instead of 24. The issue logic actually seems to be a lot more complex than Fermi which can only dispatch one instruction per clock (or two in the case of GF10x).
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 20-Jun-2011, 05:50   #357
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,272
Send a message via Skype™ to rpg.314
Default

GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.
rpg.314 is offline   Reply With Quote
Old 20-Jun-2011, 06:02   #358
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 10,067
Default

Quote:
Originally Posted by rpg.314 View Post
GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.
I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 20-Jun-2011, 06:59   #359
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,272
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by trinibwoy View Post
I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.
Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum. Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.
rpg.314 is offline   Reply With Quote
Old 20-Jun-2011, 07:30   #360
trinibwoy
Meh
 
Join Date: Mar 2004
Location: New York
Posts: 10,067
Default

Quote:
Originally Posted by rpg.314 View Post
Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum.
How are you defining bare minimum - pipeline depth? I haven't seen anything conclusive about GCN's ALU pipeline indicating that it's only 4 cycles. The "vector back-to-back instruction issue" in the slides could be referring to the round robin issue and not necessarily back-to-back issue from the same wave.

Quote:
Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.
You don't need a complex scoreboard if you're relying solely on TLP. Fermi's additional complexity comes from a few things:

1. Instruction issue runs at warp execution speed, on GCN it runs at 4x wavefront execution speed so Fermi needs 4x the number of dispatchers to feed an equivalent number of SIMDs.
2. ALU pipeline is deeper so the "bare minimum" required warps for latency hiding is higher.
3. Multiple instructions can be in-flight from the same warp.

The score-boarding is only necessary for #3. This actually lets Fermi get away with fewer warps than "bare minimum" would otherwise suggest.
__________________
What the deuce!?
trinibwoy is offline   Reply With Quote
Old 20-Jun-2011, 12:03   #361
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,272
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by trinibwoy View Post
How are you defining bare minimum - pipeline depth? I haven't seen anything conclusive about GCN's ALU pipeline indicating that it's only 4 cycles. The "vector back-to-back instruction issue" in the slides could be referring to the round robin issue and not necessarily back-to-back issue from the same wave.
The minimum needed to fill all ALU's, eg 2 warps for fermi.
rpg.314 is offline   Reply With Quote
Old 20-Jun-2011, 12:31   #362
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

Quote:
Originally Posted by rpg.314 View Post
The minimum needed to fill all ALU's, eg 2 warps for fermi.
That would fill the ALUs for exactly 2 cycles in the worst case (and 8 cycles in the best case with a high ILP) and letting them the next 10 to 16 cycles idle. The bare minimum is actually between 6 warps with a high ILP and 18 warps (the pipeline depth) for ILP=1. For GF104 style SMs those numbers are even higher.
Gipsel is offline   Reply With Quote
Old 20-Jun-2011, 12:47   #363
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

Quote:
Originally Posted by trinibwoy View Post
I thought branches are already co-issued with VLIW instructions on existing architectures.
In principle yes, but as control flow opens up a new clause anyway, it's a moot point to some extent, as clause switching is expensive if the clauses are shorter than 10 instructions or so.
Quote:
Originally Posted by trinibwoy View Post
I'm struggling to understand how this is any different or less dynamic than what Fermi does. GCN seems to be doing the same thing except that it only has 10 wavefronts to choose from per SIMD instead of 24. The issue logic actually seems to be a lot more complex than Fermi which can only dispatch one instruction per clock (or two in the case of GF10x).
GCN's issue logic is simple in the respect that the type of the instruction and the instruction buffer it comes from determines the exact unit it will be executed on and probably no dependencies between arithmetic instructions need to be tracked. There are simply not many decisions to make. With Fermi on the other side, the scheduler has to check much more dependencies between instructions in flight and determine which vector ALU or which SFU block (there are two in a GF104 SM) the instruction has to go to. Fermi can issue instructions to any of the 16 wide vector ALUs in the SM (very evident in GF104 style SMs and also for DP ops). This also complicates the operand collector and result networks from the register files (which probably contributes significantly to the latency).
Gipsel is offline   Reply With Quote
Old 20-Jun-2011, 13:01   #364
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,919
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by trinibwoy View Post
1. Instruction issue runs at warp […] speed
I knew Fermi was fast, but faster than light?!



Sorry, I'm gone now, you can resume normal discussions.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is offline   Reply With Quote
Old 21-Jun-2011, 15:26   #365
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

If someone wants to look over the (I hope it's complete, but maybe I missed something to copy) instruction set of SI, have fun!

Vector instructions:
Code:
v_cmpx_t_u64  	v_cmpx_ge_u64  	v_cmpx_ne_u64 	v_cmpx_gt_u64
v_cmpx_le_u64 	v_cmpx_eq_u64 	v_cmpx_lt_u64 	v_cmpx_f_u64
v_cmp_t_u64 	v_cmp_ge_u64 	v_cmp_ne_u64 	v_cmp_gt_u64
v_cmp_le_u64 	v_cmp_eq_u64 	v_cmp_lt_u64 	v_cmp_f_u64
v_cmpx_t_u32 	v_cmpx_ge_u32 	v_cmpx_ne_u32 	v_cmpx_gt_u32
v_cmpx_le_u32 	v_cmpx_eq_u32 	v_cmpx_lt_u32 	v_cmpx_f_u32
v_cmp_t_u32 	v_cmp_ge_u32 	v_cmp_ne_u32 	v_cmp_gt_u32
v_cmp_le_u32 	v_cmp_eq_u32 	v_cmp_lt_u32 	v_cmp_f_u32
v_cmpx_t_i64 	v_cmpx_ge_i64 	v_cmpx_ne_i64 	v_cmpx_gt_i64
v_cmpx_le_i64 	v_cmpx_eq_i64 	v_cmpx_lt_i64 	v_cmpx_f_i64
v_cmp_t_i64 	v_cmp_ge_i64 	v_cmp_ne_i64 	v_cmp_gt_i64
v_cmp_le_i64 	v_cmp_eq_i64 	v_cmp_lt_i64 	v_cmp_f_i64
v_cmpx_t_i32 	v_cmpx_ge_i32 	v_cmpx_ne_i32 	v_cmpx_gt_i32
v_cmpx_le_i32 	v_cmpx_eq_i32 	v_cmpx_lt_i32 	v_cmpx_f_i32
v_cmp_t_i32 	v_cmp_ge_i32 	v_cmp_ne_i32 	v_cmp_gt_i32
v_cmp_le_i32 	v_cmp_eq_i32 	v_cmp_lt_i32 	v_cmp_f_i32
v_cmpsx_tru_f64 	v_cmpsx_nlt_f64 	v_cmpsx_neq_f64 	v_cmpsx_nle_f64
v_cmpsx_ngt_f64 	v_cmpsx_nlg_f64 	v_cmpsx_nge_f64 	v_cmpsx_u_f64
v_cmpsx_o_f64 	v_cmpsx_ge_f64 	v_cmpsx_lg_f64 	v_cmpsx_gt_f64
v_cmpsx_le_f64 	v_cmpsx_eq_f64 	v_cmpsx_lt_f64 	v_cmpsx_f_f64
v_cmps_tru_f64 	v_cmps_nlt_f64 	v_cmps_neq_f64 	v_cmps_nle_f64
v_cmps_ngt_f64 	v_cmps_nlg_f64 	v_cmps_nge_f64 	v_cmps_u_f64
v_cmps_o_f64 	v_cmps_ge_f64 	v_cmps_lg_f64 	v_cmps_gt_f64
v_cmps_le_f64 	v_cmps_eq_f64 	v_cmps_lt_f64 	v_cmps_f_f64
v_cmpsx_tru_f32 	v_cmpsx_nlt_f32 	v_cmpsx_neq_f32 	v_cmpsx_nle_f32
v_cmpsx_ngt_f32 	v_cmpsx_nlg_f32 	v_cmpsx_nge_f32 	v_cmpsx_u_f32
v_cmpsx_o_f32 	v_cmpsx_ge_f32 	v_cmpsx_lg_f32 	v_cmpsx_gt_f32
v_cmpsx_le_f32 	v_cmpsx_eq_f32 	v_cmpsx_lt_f32 	v_cmpsx_f_f32
v_cmps_tru_f32 	v_cmps_nlt_f32 	v_cmps_neq_f32 	v_cmps_nle_f32
v_cmps_ngt_f32 	v_cmps_nlg_f32 	v_cmps_nge_f32 	v_cmps_u_f32
v_cmps_o_f32 	v_cmps_ge_f32 	v_cmps_lg_f32 	v_cmps_gt_f32
v_cmps_le_f32 	v_cmps_eq_f32 	v_cmps_lt_f32 	v_cmps_f_f32
v_cmpx_tru_f64 	v_cmpx_nlt_f64 	v_cmpx_neq_f64 	v_cmpx_nle_f64
v_cmpx_ngt_f64 	v_cmpx_nlg_f64 	v_cmpx_nge_f64 	v_cmpx_u_f64
v_cmpx_o_f64 	v_cmpx_ge_f64 	v_cmpx_lg_f64 	v_cmpx_gt_f64
v_cmpx_le_f64 	v_cmpx_eq_f64 	v_cmpx_lt_f64 	v_cmpx_f_f64
v_cmp_tru_f64 	v_cmp_nlt_f64 	v_cmp_neq_f64 	v_cmp_nle_f64
v_cmp_ngt_f64 	v_cmp_nlg_f64 	v_cmp_nge_f64 	v_cmp_u_f64
v_cmp_o_f64 	v_cmp_ge_f64 	v_cmp_lg_f64 	v_cmp_gt_f64
v_cmp_le_f64 	v_cmp_eq_f64 	v_cmp_lt_f64 	v_cmp_f_f64
v_cmpx_tru_f32 	v_cmpx_nlt_f32 	v_cmpx_neq_f32 	v_cmpx_nle_f32
v_cmpx_ngt_f32 	v_cmpx_nlg_f32 	v_cmpx_nge_f32 	v_cmpx_u_f32
v_cmpx_o_f32 	v_cmpx_ge_f32 	v_cmpx_lg_f32 	v_cmpx_gt_f32
v_cmpx_le_f32 	v_cmpx_eq_f32 	v_cmpx_lt_f32 	v_cmpx_f_f32
v_cmp_tru_f32 	v_cmp_nlt_f32 	v_cmp_neq_f32 	v_cmp_nle_f32
v_cmp_ngt_f32 	v_cmp_nlg_f32 	v_cmp_nge_f32 	v_cmp_u_f32
v_cmp_o_f32 	v_cmp_ge_f32 	v_cmp_lg_f32 	v_cmp_gt_f32
v_cmp_le_f32 	v_cmp_eq_f32 	v_cmp_lt_f32 	v_cmp_f_f32
v_sad_u16 	v_med3_i32 	v_rcp_f32 	v_sqrt_f64
v_min_f64 	v_cvt_f32_f16 	v_floor_f32 	v_mul_lo_u32
v_ldexp_f32 	v_movrels_b32 	v_ashr_i32 	v_cvt_f64_u32
v_rsq_f64 	v_trunc_f32 	v_max_f32 	v_cvt_pknorm_i16_f32
v_subrev_i32 	v_add_f32 	v_cubema_f32 	v_cvt_f32_ubyte0
v_cvt_f32_ubyte1 	v_cvt_f32_ubyte2 	v_movreld_b32 	v_cvt_flr_i32_f32
v_cmp_class_f64 	v_cmpx_class_f64 	v_max3_i32 	v_cmpx_i64
v_cubeid_f32 	v_sad_u8 	v_cubetc_f32 	v_rcp_f64
v_fma_f32 	v_rndne_f32 	v_cmp_f32 	v_cndmask_b32
v_nop 	v_ldexp_f64 	v_bfi_b32 	v_cmpx_f64
v_cvt_f32_ubyte3 	v_cos_f32 	v_cvt_f16_f32 	v_ceil_f32
v_mad_i32_i24 	v_rcp_clamp_f64 	v_rsq_f32 	v_bcnt_u32_b32
v_subb_u32 	v_fract_f64 	v_min3_f32 	v_mac_f32
v_cmpx_u32 	v_mul_u32_u24 	v_mov_b32 	v_max3_u32
v_bfe_i32 	v_ffbh_u32 	v_addc_u32 	v_cvt_f64_i32
v_div_scale_f64 	v_madmk_f32 	v_mbcnt_hi_u32_b32 	v_cmp_i32
v_sub_i32 	v_sub_f32 	v_sad_hi_u8 	v_max_i32
v_writelane_b32 	v_bfm_b32 	v_ffbl_b32 	v_sqrt_f32
v_min_f32 	v_med3_u32 	v_cvt_u32_f64 	v_cvt_f64_f32
v_mac_legacy_f32 	v_interp_mov_f32 	v_mbcnt_lo_u32_b32 	v_rsq_clamp_f64
v_or_b32 	v_ashr_i64 	v_readlane_b32 	v_min3_u32
v_log_f32 	v_rsq_legacy_f32 	v_div_scale_f32 	v_madak_f32
v_add_f64 	v_mul_f64 	v_lshrrev_b32 	v_cmpx_i32
v_min_legacy_f32 	v_fma_f64 	v_min_i32 	v_cvt_pkrtz_f16_f32
v_lshl_b32 	v_xor_b32 	v_and_b32 	v_cubesc_f32
v_max_legacy_f32 	v_cvt_i32_f32 	v_cvt_i32_f64 	v_rsq_clamp_f32
v_ffbh_i32 	v_cmpx_f32 	v_mul_i32_i24 	v_sad_u32
v_exp_f32 	v_mul_f32 	v_movrelsd_b32 	v_frexp_mant_f64
v_readfirstlane_b32 	v_cvt_off_f32_i4 	v_cvt_f32_u32 	v_bfrev_b32
v_ashrrev_i32 	v_cvt_rpi_i32_f32 	v_mul_hi_i32_i24 	v_mad_legacy_f32
v_lshr_b32 	v_cmpx_u64 	v_sin_f32 	v_add_i32
v_mul_hi_u32 	v_lshl_b64 	v_fract_f32 	v_cmp_class_f32
v_lerp_u8 	v_max_f64 	v_cvt_pk_u8_f32 	v_med3_f32
v_min3_i32 	v_frexp_mant_f32 	v_rcp_clamp_f32 	v_cmp_f64
v_cmp_u32 	v_mullit_f32 	v_mul_hi_u32_u24 	v_frexp_exp_i32_f64
v_cvt_u32_f32 	v_cmp_i64 	v_max_u32 	v_not_b32
v_min_u32 	v_mad_f32 	v_alignbyte_b32 	v_cvt_f32_f64
v_lshr_b64 	v_subrev_f32 	v_mul_lo_i32 	v_log_clamp_f32
v_rcp_legacy_f32 	v_subbrev_u32 	v_mad_u32_u24 	v_max3_f32
v_cvt_f32_i32 	v_lshlrev_b32 	v_mul_hi_i32 	v_cmp_u64
v_alignbit_b32 	v_bfe_u32 	v_interp_mac2_f32 	v_cmpx_class_f32
v_frexp_exp_i32_f32 	v_mul_legacy_f32
Scalar instructions:
Code:
s_cselect_b64 	s_wqm_b64 	s_lshl_b64 	s_bitset0_b64
s_ashr_i32 	s_mul_i32 	s_bitcmp1_b64 	s_ff1_i32_b32
s_flbit_i32_b32 	s_andn2_saveexec_b64 	s_xor_saveexec_b64 	s_nop
s_mov_b32 	s_cbranch_i_fork 	s_nor_b64 	s_cbranch_execnz
s_quadmask_b32 	s_or_saveexec_b64 	s_branch 	s_cmp_i32
s_cmov_b64 	s_sendmsg 	s_getpc_b64 	s_rfe_b64
s_endpgm 	s_cselect_b32 	s_addc_u32 	s_memtime
s_bitcmp0_b64 	s_nor_b32 	s_min_i32 	s_bfe_i32
s_ff1_i32_b64 	s_xor_b64 	s_andn2_b32 	s_nand_saveexec_b64
s_setprio 	s_mov_b64 	s_bcnt0_i32_b64 	s_ashr_i64
s_cmov_b32 	s_bcnt1_i32_b64 	s_nor_saveexec_b64 	s_xnor_b32
s_or_b32 	s_brev_b64 	s_lshr_b64  s_xor_b32
s_not_b32 	s_orn2_b64 	s_sext_i32_i16 	s_nand_b64
s_cmp_u32 	s_lshl_b32 	s_bfe_u64 	s_max_u32
s_min_u32 	s_movrels_b32 	s_brev_b32 	s_lshr_b32
s_sext_i32_i8 	s_orn2_b32 	s_movreld_b64 	s_cbranch_join
s_bitcmp0_b32 	s_cbranch_vccz 	s_flbit_i32_b64 	s_add_i32
s_nand_b32 	s_setreg_b32 	s_xnor_saveexec_b64 	s_bcnt1_i32_b32
s_quadmask_b64 	s_movreld_b32 	s_ff0_i32_b64 	s_and_b64
s_barrier 	s_bitset1_b32 	s_flbit_i32_i64 	s_swappc_b64
s_ff0_i32_b32 	s_setpc_b64 	s_waitcnt 	s_andn2_b64
s_cbranch_vccnz 	s_and_saveexec_b64 	s_bcnt0_i32_b32 	s_bitcmp1_b32
s_movrels_b64 	s_bfe_u32 	s_subb_u32 	s_and_b32
s_max_i32 	s_bitset1_b64 	s_cbranch_scc0 	s_bfm_b64
s_or_b64 	s_orn2_saveexec_b64 	s_wqm_b32 	s_bfm_b32
s_xnor_b64 	s_bfe_i64 	s_getreg_b32 	s_sub_i32
s_not_b64 	s_flbit_i32 	s_cbranch_scc1 	s_cbranch_execz
s_bitset0_b32
s_load_dword 	s_buffer_load_dword
Vector memory instructions:
Code:
Texture ops:
image_get_lod 	image_sample_c_l_o 	image_sample_c_b_o 	image_sample_c_d_o
image_atomic_rsub 	image_sample_d_cl_o 	image_sample_c_b_cl_o 	image_atomic_umin
image_sample 	image_gather4_cl 	image_sample_b_cl 	image_gather4_c_cl
image_sample_lz_o 	image_atomic_sub 	image_gather4 	image_sample_c_lz
image_atomic_smin 	image_atomic_umax 	image_sample_lz 	image_atomic_add
image_sample_o 	image_sample_l 	image_sample_c 	image_sample_b
image_sample_d 	image_sample_cd 	image_sample_cl 	image_gather4_c_cl_o
image_sample_cd_cl_o 	image_atomic_xor 	image_sample_d_o 	image_atomic_dec
image_sample_c_cd_cl_o 	image_gather4_c_o 	image_sample_b_o 	image_atomic_cmpswap
image_atomic_smax 	image_sample_l_o 	image_get_resinfo 	image_sample_c_b_cl
image_sample_c_cd 	image_sample_c_cl 	image_atomic_and 	image_atomic_or
image_atomic_inc 	image_sample_c_d_cl_o 	image_sample_c_d 	image_sample_c_bimage_sample_c_o
image_sample_c_l 	image_gather4_cl_o 	image_sample_cd_o 	image_load
image_load_mip 	image_sample_cd_cl 	image_sample_b_cl_o 	image_sample_c_lz_o
image_atomic_swap 	image_sample_cl_o 	image_store 	image_sample_d_cl
image_sample_c_cl_o 	image_gather4_o 	image_gather4_c 	image_sample_c_d_cl
image_sample_c_cd_o 	image_sample_c_cd_cl

Memory ops:
buffer_atomic_cmpswap 	buffer_load_dwordx2 	buffer_store_format_xy 	buffer_load_sbyte
buffer_load_format_x 	buffer_store_format_xyz 	buffer_atomic_smax_x2 	buffer_atomic_or_x2
buffer_atomic_smin 	buffer_load_format_xyz 	buffer_load_format_xyzw 	buffer_atomic_add_x2
buffer_store_dwordx2 	buffer_store_dwordx4 	buffer_atomic_xor 	buffer_store_dword
buffer_atomic_cmpswap_x2 	buffer_atomic_umax_x2 	buffer_atomic_fmin 	buffer_atomic_fcmpswap_x2
buffer_atomic_umin_x2 	buffer_atomic_umax 	buffer_atomic_xor_x2 	buffer_atomic_sub
buffer_atomic_rsub 	buffer_load_dword 	buffer_load_ushort 	buffer_atomic_sub_x2
buffer_atomic_fcmpswap 	buffer_load_dwordx4 	buffer_atomic_inc_x2 	buffer_load_format_xy
buffer_atomic_fmax 	buffer_atomic_fmax_x2 	buffer_atomic_umin 	buffer_atomic_inc
buffer_load_ubyte 	buffer_atomic_or 	buffer_store_format_x 	buffer_store_format_xyzw
buffer_atomic_and 	buffer_store_short 	buffer_atomic_smin_x2 	buffer_store_byte
buffer_load_sshort 	buffer_atomic_smax 	buffer_atomic_fmin_x2 	buffer_atomic_dec_x2
buffer_atomic_add 	buffer_atomic_swap_x2 	buffer_atomic_and_x2 	buffer_atomic_dec
buffer_atomic_rsub_x2 	buffer_atomic_swap
tbuffer_store_format_xy 	tbuffer_store_format_x 	tbuffer_load_format_xy 	tbuffer_store_format_xyz
tbuffer_store_format_xyzw 	tbuffer_load_format_x 	tbuffer_load_format_xyz 	tbuffer_load_format_xyzw
Data share Instructions:
Code:
ds_read_i16	ds_sub_rtn_u32	ds_wrxchg_rtn_b64	ds_max_rtn_f64
ds_cmpst_rtn_f64	ds_write_b8	ds_min_rtn_f64	ds_min_rtn_i32
ds_wrxchg2_rtn_b32 	ds_max_rtn_f32	ds_read_u16	ds_inc_rtn_u64
ds_write2st64_b32	ds_dec_rtn_u32	ds_min_f32	ds_dec_u64
ds_consume	ds_min_rtn_f32	ds_gws_sema_br	ds_max_i32
ds_read2st64_b64	ds_write_b64	ds_cmpst_b64	ds_add_rtn_u32
ds_gws_init	ds_min_rtn_i64	ds_wrxchg2st64_rtn_b64	ds_wrxchg2_rtn_b64
ds_min_rtn_u64	ds_min_u32	ds_mskor_b64	ds_sub_u64
ds_dec_rtn_u64	ds_dec_u32	ds_max_f32	ds_read2st64_b32
ds_write_b32	ds_cmpst_rtn_f32	ds_sub_rtn_u64	ds_min_f64
ds_read_i8	ds_swizzle_b32	ds_and_b64	ds_or_rtn_b32
ds_min_i64	ds_write2_b64	ds_max_rtn_i32	ds_xor_b64
ds_and_rtn_b64	ds_write2st64_b64	ds_read_b32	ds_cmpst_rtn_b32
ds_gws_barrier	ds_or_b64 	ds_read2_b32	ds_add_u32
ds_cmpst_b32	ds_and_rtn_b32	ds_append	 	ds_min_i32
ds_xor_rtn_b32	ds_write2_b32	ds_wrxchg2st64_rtn_b32	ds_sub_u32
ds_cmpst_rtn_b64	ds_cmpst_f64	ds_max_f64	ds_or_b32
ds_max_rtn_u32	ds_write_b16	ds_ordered_count	ds_max_u64
ds_gws_sema_p	ds_gws_sema_v	ds_read_u8	ds_rsub_rtn_u32
ds_rsub_u64	ds_max_i64	ds_inc_u64	ds_inc_u64
ds_mskor_rtn_b64	ds_add_rtn_u64	ds_and_b32	ds_xor_rtn_b64
ds_wrxchg_rtn_b32	ds_or_rtn_b64	ds_min_rtn_u32	ds_min_u64
ds_mskor_b32	ds_cmpst_f32	ds_max_rtn_u64	ds_max_u32
ds_max_rtn_i64	ds_rsub_rtn_u64	ds_rsub_u32	ds_read_b64
ds_inc_u32	ds_mskor_rtn_b32	ds_inc_rtn_u32	ds_read2_b64
ds_add_u64	ds_xor_b32
Other instructions (Internal ones? Some branch instructions?):
Code:
sys_input 	init_opnd 	sc_opnd_table 	sc_op_unknown
merge 	mem_merge 	killz 	killnz 	phi
undefined
if_wv_i32 	if_wv_bit1 	if_wv_bit0 	if_wv_f32
if_th_bit0 	if_th_bit1 	if_wv_u32
callrtn	tabjmp
Appear to be quite a few atomic (even for images/textures) and data share instructions as well as an awful lot of comparison instructions(?!?). Can't offer you a documentation though, sorry

Edit:
The code tags don't get the tabs entirely correct, sometimes there is no space where it should be. I've put in now a combination of spaces and tabs.

Should have posted it before the AFDS.

Last edited by Gipsel; 21-Jun-2011 at 17:10.
Gipsel is offline   Reply With Quote
Old 21-Jun-2011, 15:52   #366
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,272
Send a message via Skype™ to rpg.314
Default

Score
rpg.314 is offline   Reply With Quote
Old 21-Jun-2011, 16:08   #367
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,451
Default

The transcendentals are in the vector ops section. Implementation details could be intresting. There's no VLIW-exposed linking of 3 FMA units to get a result. It could still link up units on other SIMDs, which would complicate scheduling since it would force a stall for that category on the other issue cycles.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 21-Jun-2011, 16:26   #368
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

Quote:
Originally Posted by 3dilettante View Post
The transcendentals are in the vector ops section. Implementation details could be intresting. There's no VLIW-exposed linking of 3 FMA units to get a result. It could still link up units on other SIMDs, which would complicate scheduling since it would force a stall for that category on the other issue cycles.
I guess, it will loop 3 times within a single ALU.
Gipsel is offline   Reply With Quote
Old 21-Jun-2011, 17:43   #369
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,719
Default

Hmm lots of cmps indeed. Anyone know what they do? I mean there's a full set of them for each operator (ne, lt and so on) on all datatypes (f32, f64, u32, u64, i32, i64) but what's the cmp/cmpx/cmps/cmpsx doing (though the "s" versions are only for floats - maybe versions ignoring/not ignoring sign)? Also some of the operators are a little odd (o? tru?).
mczak is offline   Reply With Quote
Old 21-Jun-2011, 17:49   #370
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,272
Send a message via Skype™ to rpg.314
Default

Compare element wise with scalar?
rpg.314 is offline   Reply With Quote
Old 21-Jun-2011, 19:49   #371
ECH
Member
 
Join Date: May 2007
Posts: 655
Default

nvrm
ECH is offline   Reply With Quote
Old 21-Jun-2011, 23:25   #372
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

Quote:
Originally Posted by rpg.314 View Post
Compare element wise with scalar?
I just tried to figure it out with some test code, but unfortunately the support in the driver isn't complete (i.e. functional), yet. Looks like AMD does not put that stuff in the public versions that early anymore. It keeps kicking me out with some missing dll message (looks like they put the actual shader compiler for SI in a separate dll for the time being).

Nevertheless, the other stuff for the disassemby appears to work. So I see already the output mask for the new architecture (like number of used scalar and vector registers, it is definitely for GCN). By the way, no new VLIW ASICs appeared as target IDs, just three for GCN. And we have only a single vacancy left in the middle of VLIW IDs, which happens to be the only VLIW4 ASIC besides Cayman. So place your bets what that means.
Gipsel is offline   Reply With Quote
Old 21-Jun-2011, 23:30   #373
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,919
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by Gipsel View Post
I just tried to figure it out with some test code, but unfortunately the support in the driver isn't complete (i.e. functional), yet. Looks like AMD does not put that stuff in the public versions that early anymore. It keeps kicking me out with some missing dll message (looks like they put the actual shader compiler for SI in a separate dll for the time being).

Nevertheless, the other stuff for the disassemby appears to work. So I see already the output mask for the new architecture (like number of used scalar and vector registers, it is definitely for GCN). By the way, no new VLIW ASICs appeared as target IDs, just three for GCN. And we have only a single vacancy left in the middle of VLIW IDs, which happens to be the only VLIW4 ASIC besides Cayman. So place your bets what that means.
That should be the GPU in Trinity. So VLIW is definitely gone, I guess.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is offline   Reply With Quote
Old 21-Jun-2011, 23:38   #374
Gipsel
Senior Member
 
Join Date: Jan 2010
Location: Hamburg, Germany
Posts: 1,450
Default

Quote:
Originally Posted by Alexko View Post
That should be the GPU in Trinity. So VLIW is definitely gone, I guess.
Exactly my guess.
Or AMD does also some straight shrink to 28nm for some GPUs. There is one example (RV740 in 40nm) which behaved that similar to its predecessor (55nm RV770), it didn't got its own ID but shared it with RV770.
Gipsel is offline   Reply With Quote
Old 22-Jun-2011, 01:24   #375
swaaye
Entirely Suboptimal
 
Join Date: Mar 2003
Location: WI, USA
Posts: 7,316
Default

So some years ago DAAMIT sat down and decided VLIW was a lost cause for compute. Of course they had a few VLIW projects to finish up and talk up in the meantime.

It'll be fun to see comprehensive comparisons of Cayman and this new arch.
swaaye is offline   Reply With Quote

Reply

Tags
bye bye vliw, fps, stutter, untapped power, vliw lives on

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 00:55.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.