AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314 · Jun 20, 2011

trinibwoy said:
How are you defining bare minimum - pipeline depth? I haven't seen anything conclusive about GCN's ALU pipeline indicating that it's only 4 cycles. The "vector back-to-back instruction issue" in the slides could be referring to the round robin issue and not necessarily back-to-back issue from the same wave.

The minimum needed to fill all ALU's, eg 2 warps for fermi.

Gipsel · Jun 20, 2011

rpg.314 said:
The minimum needed to fill all ALU's, eg 2 warps for fermi.

That would fill the ALUs for exactly 2 cycles in the worst case (and 8 cycles in the best case with a high ILP) and letting them the next 10 to 16 cycles idle. The bare minimum is actually between 6 warps with a high ILP and 18 warps (the pipeline depth) for ILP=1. For GF104 style SMs those numbers are even higher.

Gipsel · Jun 20, 2011

trinibwoy said:
I thought branches are already co-issued with VLIW instructions on existing architectures.

In principle yes, but as control flow opens up a new clause anyway, it's a moot point to some extent, as clause switching is expensive if the clauses are shorter than 10 instructions or so.

trinibwoy said:
I'm struggling to understand how this is any different or less dynamic than what Fermi does. GCN seems to be doing the same thing except that it only has 10 wavefronts to choose from per SIMD instead of 24. The issue logic actually seems to be a lot more complex than Fermi which can only dispatch one instruction per clock (or two in the case of GF10x).

GCN's issue logic is simple in the respect that the type of the instruction and the instruction buffer it comes from determines the exact unit it will be executed on and probably no dependencies between arithmetic instructions need to be tracked. There are simply not many decisions to make. With Fermi on the other side, the scheduler has to check much more dependencies between instructions in flight and determine which vector ALU or which SFU block (there are two in a GF104 SM) the instruction has to go to. Fermi can issue instructions to any of the 16 wide vector ALUs in the SM (very evident in GF104 style SMs and also for DP ops). This also complicates the operand collector and result networks from the register files (which probably contributes significantly to the latency).

Alexko · Jun 20, 2011

trinibwoy said:
1. Instruction issue runs at warp […] speed

I knew Fermi was fast, but faster than light?!

Sorry, I'm gone now, you can resume normal discussions.

Gipsel · Jun 21, 2011

If someone wants to look over the (I hope it's complete, but maybe I missed something to copy) instruction set of SI, have fun!

Vector instructions:

Code:

v_cmpx_t_u64  	v_cmpx_ge_u64  	v_cmpx_ne_u64 	v_cmpx_gt_u64
v_cmpx_le_u64 	v_cmpx_eq_u64 	v_cmpx_lt_u64 	v_cmpx_f_u64
v_cmp_t_u64 	v_cmp_ge_u64 	v_cmp_ne_u64 	v_cmp_gt_u64
v_cmp_le_u64 	v_cmp_eq_u64 	v_cmp_lt_u64 	v_cmp_f_u64
v_cmpx_t_u32 	v_cmpx_ge_u32 	v_cmpx_ne_u32 	v_cmpx_gt_u32
v_cmpx_le_u32 	v_cmpx_eq_u32 	v_cmpx_lt_u32 	v_cmpx_f_u32
v_cmp_t_u32 	v_cmp_ge_u32 	v_cmp_ne_u32 	v_cmp_gt_u32
v_cmp_le_u32 	v_cmp_eq_u32 	v_cmp_lt_u32 	v_cmp_f_u32
v_cmpx_t_i64 	v_cmpx_ge_i64 	v_cmpx_ne_i64 	v_cmpx_gt_i64
v_cmpx_le_i64 	v_cmpx_eq_i64 	v_cmpx_lt_i64 	v_cmpx_f_i64
v_cmp_t_i64 	v_cmp_ge_i64 	v_cmp_ne_i64 	v_cmp_gt_i64
v_cmp_le_i64 	v_cmp_eq_i64 	v_cmp_lt_i64 	v_cmp_f_i64
v_cmpx_t_i32 	v_cmpx_ge_i32 	v_cmpx_ne_i32 	v_cmpx_gt_i32
v_cmpx_le_i32 	v_cmpx_eq_i32 	v_cmpx_lt_i32 	v_cmpx_f_i32
v_cmp_t_i32 	v_cmp_ge_i32 	v_cmp_ne_i32 	v_cmp_gt_i32
v_cmp_le_i32 	v_cmp_eq_i32 	v_cmp_lt_i32 	v_cmp_f_i32
v_cmpsx_tru_f64 	v_cmpsx_nlt_f64 	v_cmpsx_neq_f64 	v_cmpsx_nle_f64
v_cmpsx_ngt_f64 	v_cmpsx_nlg_f64 	v_cmpsx_nge_f64 	v_cmpsx_u_f64
v_cmpsx_o_f64 	v_cmpsx_ge_f64 	v_cmpsx_lg_f64 	v_cmpsx_gt_f64
v_cmpsx_le_f64 	v_cmpsx_eq_f64 	v_cmpsx_lt_f64 	v_cmpsx_f_f64
v_cmps_tru_f64 	v_cmps_nlt_f64 	v_cmps_neq_f64 	v_cmps_nle_f64
v_cmps_ngt_f64 	v_cmps_nlg_f64 	v_cmps_nge_f64 	v_cmps_u_f64
v_cmps_o_f64 	v_cmps_ge_f64 	v_cmps_lg_f64 	v_cmps_gt_f64
v_cmps_le_f64 	v_cmps_eq_f64 	v_cmps_lt_f64 	v_cmps_f_f64
v_cmpsx_tru_f32 	v_cmpsx_nlt_f32 	v_cmpsx_neq_f32 	v_cmpsx_nle_f32
v_cmpsx_ngt_f32 	v_cmpsx_nlg_f32 	v_cmpsx_nge_f32 	v_cmpsx_u_f32
v_cmpsx_o_f32 	v_cmpsx_ge_f32 	v_cmpsx_lg_f32 	v_cmpsx_gt_f32
v_cmpsx_le_f32 	v_cmpsx_eq_f32 	v_cmpsx_lt_f32 	v_cmpsx_f_f32
v_cmps_tru_f32 	v_cmps_nlt_f32 	v_cmps_neq_f32 	v_cmps_nle_f32
v_cmps_ngt_f32 	v_cmps_nlg_f32 	v_cmps_nge_f32 	v_cmps_u_f32
v_cmps_o_f32 	v_cmps_ge_f32 	v_cmps_lg_f32 	v_cmps_gt_f32
v_cmps_le_f32 	v_cmps_eq_f32 	v_cmps_lt_f32 	v_cmps_f_f32
v_cmpx_tru_f64 	v_cmpx_nlt_f64 	v_cmpx_neq_f64 	v_cmpx_nle_f64
v_cmpx_ngt_f64 	v_cmpx_nlg_f64 	v_cmpx_nge_f64 	v_cmpx_u_f64
v_cmpx_o_f64 	v_cmpx_ge_f64 	v_cmpx_lg_f64 	v_cmpx_gt_f64
v_cmpx_le_f64 	v_cmpx_eq_f64 	v_cmpx_lt_f64 	v_cmpx_f_f64
v_cmp_tru_f64 	v_cmp_nlt_f64 	v_cmp_neq_f64 	v_cmp_nle_f64
v_cmp_ngt_f64 	v_cmp_nlg_f64 	v_cmp_nge_f64 	v_cmp_u_f64
v_cmp_o_f64 	v_cmp_ge_f64 	v_cmp_lg_f64 	v_cmp_gt_f64
v_cmp_le_f64 	v_cmp_eq_f64 	v_cmp_lt_f64 	v_cmp_f_f64
v_cmpx_tru_f32 	v_cmpx_nlt_f32 	v_cmpx_neq_f32 	v_cmpx_nle_f32
v_cmpx_ngt_f32 	v_cmpx_nlg_f32 	v_cmpx_nge_f32 	v_cmpx_u_f32
v_cmpx_o_f32 	v_cmpx_ge_f32 	v_cmpx_lg_f32 	v_cmpx_gt_f32
v_cmpx_le_f32 	v_cmpx_eq_f32 	v_cmpx_lt_f32 	v_cmpx_f_f32
v_cmp_tru_f32 	v_cmp_nlt_f32 	v_cmp_neq_f32 	v_cmp_nle_f32
v_cmp_ngt_f32 	v_cmp_nlg_f32 	v_cmp_nge_f32 	v_cmp_u_f32
v_cmp_o_f32 	v_cmp_ge_f32 	v_cmp_lg_f32 	v_cmp_gt_f32
v_cmp_le_f32 	v_cmp_eq_f32 	v_cmp_lt_f32 	v_cmp_f_f32
v_sad_u16 	v_med3_i32 	v_rcp_f32 	v_sqrt_f64
v_min_f64 	v_cvt_f32_f16 	v_floor_f32 	v_mul_lo_u32
v_ldexp_f32 	v_movrels_b32 	v_ashr_i32 	v_cvt_f64_u32
v_rsq_f64 	v_trunc_f32 	v_max_f32 	v_cvt_pknorm_i16_f32
v_subrev_i32 	v_add_f32 	v_cubema_f32 	v_cvt_f32_ubyte0
v_cvt_f32_ubyte1 	v_cvt_f32_ubyte2 	v_movreld_b32 	v_cvt_flr_i32_f32
v_cmp_class_f64 	v_cmpx_class_f64 	v_max3_i32 	v_cmpx_i64
v_cubeid_f32 	v_sad_u8 	v_cubetc_f32 	v_rcp_f64
v_fma_f32 	v_rndne_f32 	v_cmp_f32 	v_cndmask_b32
v_nop 	v_ldexp_f64 	v_bfi_b32 	v_cmpx_f64
v_cvt_f32_ubyte3 	v_cos_f32 	v_cvt_f16_f32 	v_ceil_f32
v_mad_i32_i24 	v_rcp_clamp_f64 	v_rsq_f32 	v_bcnt_u32_b32
v_subb_u32 	v_fract_f64 	v_min3_f32 	v_mac_f32
v_cmpx_u32 	v_mul_u32_u24 	v_mov_b32 	v_max3_u32
v_bfe_i32 	v_ffbh_u32 	v_addc_u32 	v_cvt_f64_i32
v_div_scale_f64 	v_madmk_f32 	v_mbcnt_hi_u32_b32 	v_cmp_i32
v_sub_i32 	v_sub_f32 	v_sad_hi_u8 	v_max_i32
v_writelane_b32 	v_bfm_b32 	v_ffbl_b32 	v_sqrt_f32
v_min_f32 	v_med3_u32 	v_cvt_u32_f64 	v_cvt_f64_f32
v_mac_legacy_f32 	v_interp_mov_f32 	v_mbcnt_lo_u32_b32 	v_rsq_clamp_f64
v_or_b32 	v_ashr_i64 	v_readlane_b32 	v_min3_u32
v_log_f32 	v_rsq_legacy_f32 	v_div_scale_f32 	v_madak_f32
v_add_f64 	v_mul_f64 	v_lshrrev_b32 	v_cmpx_i32
v_min_legacy_f32 	v_fma_f64 	v_min_i32 	v_cvt_pkrtz_f16_f32
v_lshl_b32 	v_xor_b32 	v_and_b32 	v_cubesc_f32
v_max_legacy_f32 	v_cvt_i32_f32 	v_cvt_i32_f64 	v_rsq_clamp_f32
v_ffbh_i32 	v_cmpx_f32 	v_mul_i32_i24 	v_sad_u32
v_exp_f32 	v_mul_f32 	v_movrelsd_b32 	v_frexp_mant_f64
v_readfirstlane_b32 	v_cvt_off_f32_i4 	v_cvt_f32_u32 	v_bfrev_b32
v_ashrrev_i32 	v_cvt_rpi_i32_f32 	v_mul_hi_i32_i24 	v_mad_legacy_f32
v_lshr_b32 	v_cmpx_u64 	v_sin_f32 	v_add_i32
v_mul_hi_u32 	v_lshl_b64 	v_fract_f32 	v_cmp_class_f32
v_lerp_u8 	v_max_f64 	v_cvt_pk_u8_f32 	v_med3_f32
v_min3_i32 	v_frexp_mant_f32 	v_rcp_clamp_f32 	v_cmp_f64
v_cmp_u32 	v_mullit_f32 	v_mul_hi_u32_u24 	v_frexp_exp_i32_f64
v_cvt_u32_f32 	v_cmp_i64 	v_max_u32 	v_not_b32
v_min_u32 	v_mad_f32 	v_alignbyte_b32 	v_cvt_f32_f64
v_lshr_b64 	v_subrev_f32 	v_mul_lo_i32 	v_log_clamp_f32
v_rcp_legacy_f32 	v_subbrev_u32 	v_mad_u32_u24 	v_max3_f32
v_cvt_f32_i32 	v_lshlrev_b32 	v_mul_hi_i32 	v_cmp_u64
v_alignbit_b32 	v_bfe_u32 	v_interp_mac2_f32 	v_cmpx_class_f32
v_frexp_exp_i32_f32 	v_mul_legacy_f32

Scalar instructions:

Code:

s_cselect_b64 	s_wqm_b64 	s_lshl_b64 	s_bitset0_b64
s_ashr_i32 	s_mul_i32 	s_bitcmp1_b64 	s_ff1_i32_b32
s_flbit_i32_b32 	s_andn2_saveexec_b64 	s_xor_saveexec_b64 	s_nop
s_mov_b32 	s_cbranch_i_fork 	s_nor_b64 	s_cbranch_execnz
s_quadmask_b32 	s_or_saveexec_b64 	s_branch 	s_cmp_i32
s_cmov_b64 	s_sendmsg 	s_getpc_b64 	s_rfe_b64
s_endpgm 	s_cselect_b32 	s_addc_u32 	s_memtime
s_bitcmp0_b64 	s_nor_b32 	s_min_i32 	s_bfe_i32
s_ff1_i32_b64 	s_xor_b64 	s_andn2_b32 	s_nand_saveexec_b64
s_setprio 	s_mov_b64 	s_bcnt0_i32_b64 	s_ashr_i64
s_cmov_b32 	s_bcnt1_i32_b64 	s_nor_saveexec_b64 	s_xnor_b32
s_or_b32 	s_brev_b64 	s_lshr_b64  s_xor_b32
s_not_b32 	s_orn2_b64 	s_sext_i32_i16 	s_nand_b64
s_cmp_u32 	s_lshl_b32 	s_bfe_u64 	s_max_u32
s_min_u32 	s_movrels_b32 	s_brev_b32 	s_lshr_b32
s_sext_i32_i8 	s_orn2_b32 	s_movreld_b64 	s_cbranch_join
s_bitcmp0_b32 	s_cbranch_vccz 	s_flbit_i32_b64 	s_add_i32
s_nand_b32 	s_setreg_b32 	s_xnor_saveexec_b64 	s_bcnt1_i32_b32
s_quadmask_b64 	s_movreld_b32 	s_ff0_i32_b64 	s_and_b64
s_barrier 	s_bitset1_b32 	s_flbit_i32_i64 	s_swappc_b64
s_ff0_i32_b32 	s_setpc_b64 	s_waitcnt 	s_andn2_b64
s_cbranch_vccnz 	s_and_saveexec_b64 	s_bcnt0_i32_b32 	s_bitcmp1_b32
s_movrels_b64 	s_bfe_u32 	s_subb_u32 	s_and_b32
s_max_i32 	s_bitset1_b64 	s_cbranch_scc0 	s_bfm_b64
s_or_b64 	s_orn2_saveexec_b64 	s_wqm_b32 	s_bfm_b32
s_xnor_b64 	s_bfe_i64 	s_getreg_b32 	s_sub_i32
s_not_b64 	s_flbit_i32 	s_cbranch_scc1 	s_cbranch_execz
s_bitset0_b32
s_load_dword 	s_buffer_load_dword

Vector memory instructions:

Code:

[B]Texture ops:[/B]
image_get_lod 	image_sample_c_l_o 	image_sample_c_b_o 	image_sample_c_d_o
image_atomic_rsub 	image_sample_d_cl_o 	image_sample_c_b_cl_o 	image_atomic_umin
image_sample 	image_gather4_cl 	image_sample_b_cl 	image_gather4_c_cl
image_sample_lz_o 	image_atomic_sub 	image_gather4 	image_sample_c_lz
image_atomic_smin 	image_atomic_umax 	image_sample_lz 	image_atomic_add
image_sample_o 	image_sample_l 	image_sample_c 	image_sample_b
image_sample_d 	image_sample_cd 	image_sample_cl 	image_gather4_c_cl_o
image_sample_cd_cl_o 	image_atomic_xor 	image_sample_d_o 	image_atomic_dec
image_sample_c_cd_cl_o 	image_gather4_c_o 	image_sample_b_o 	image_atomic_cmpswap
image_atomic_smax 	image_sample_l_o 	image_get_resinfo 	image_sample_c_b_cl
image_sample_c_cd 	image_sample_c_cl 	image_atomic_and 	image_atomic_or
image_atomic_inc 	image_sample_c_d_cl_o 	image_sample_c_d 	image_sample_c_bimage_sample_c_o
image_sample_c_l 	image_gather4_cl_o 	image_sample_cd_o 	image_load
image_load_mip 	image_sample_cd_cl 	image_sample_b_cl_o 	image_sample_c_lz_o
image_atomic_swap 	image_sample_cl_o 	image_store 	image_sample_d_cl
image_sample_c_cl_o 	image_gather4_o 	image_gather4_c 	image_sample_c_d_cl
image_sample_c_cd_o 	image_sample_c_cd_cl

[B]Memory ops:[/B]
buffer_atomic_cmpswap 	buffer_load_dwordx2 	buffer_store_format_xy 	buffer_load_sbyte
buffer_load_format_x 	buffer_store_format_xyz 	buffer_atomic_smax_x2 	buffer_atomic_or_x2
buffer_atomic_smin 	buffer_load_format_xyz 	buffer_load_format_xyzw 	buffer_atomic_add_x2
buffer_store_dwordx2 	buffer_store_dwordx4 	buffer_atomic_xor 	buffer_store_dword
buffer_atomic_cmpswap_x2 	buffer_atomic_umax_x2 	buffer_atomic_fmin 	buffer_atomic_fcmpswap_x2
buffer_atomic_umin_x2 	buffer_atomic_umax 	buffer_atomic_xor_x2 	buffer_atomic_sub
buffer_atomic_rsub 	buffer_load_dword 	buffer_load_ushort 	buffer_atomic_sub_x2
buffer_atomic_fcmpswap 	buffer_load_dwordx4 	buffer_atomic_inc_x2 	buffer_load_format_xy
buffer_atomic_fmax 	buffer_atomic_fmax_x2 	buffer_atomic_umin 	buffer_atomic_inc
buffer_load_ubyte 	buffer_atomic_or 	buffer_store_format_x 	buffer_store_format_xyzw
buffer_atomic_and 	buffer_store_short 	buffer_atomic_smin_x2 	buffer_store_byte
buffer_load_sshort 	buffer_atomic_smax 	buffer_atomic_fmin_x2 	buffer_atomic_dec_x2
buffer_atomic_add 	buffer_atomic_swap_x2 	buffer_atomic_and_x2 	buffer_atomic_dec
buffer_atomic_rsub_x2 	buffer_atomic_swap
tbuffer_store_format_xy 	tbuffer_store_format_x 	tbuffer_load_format_xy 	tbuffer_store_format_xyz
tbuffer_store_format_xyzw 	tbuffer_load_format_x 	tbuffer_load_format_xyz 	tbuffer_load_format_xyzw

Data share Instructions:

Code:

ds_read_i16	ds_sub_rtn_u32	ds_wrxchg_rtn_b64	ds_max_rtn_f64
ds_cmpst_rtn_f64	ds_write_b8	ds_min_rtn_f64	ds_min_rtn_i32
ds_wrxchg2_rtn_b32 	ds_max_rtn_f32	ds_read_u16	ds_inc_rtn_u64
ds_write2st64_b32	ds_dec_rtn_u32	ds_min_f32	ds_dec_u64
ds_consume	ds_min_rtn_f32	ds_gws_sema_br	ds_max_i32
ds_read2st64_b64	ds_write_b64	ds_cmpst_b64	ds_add_rtn_u32
ds_gws_init	ds_min_rtn_i64	ds_wrxchg2st64_rtn_b64	ds_wrxchg2_rtn_b64
ds_min_rtn_u64	ds_min_u32	ds_mskor_b64	ds_sub_u64
ds_dec_rtn_u64	ds_dec_u32	ds_max_f32	ds_read2st64_b32
ds_write_b32	ds_cmpst_rtn_f32	ds_sub_rtn_u64	ds_min_f64
ds_read_i8	ds_swizzle_b32	ds_and_b64	ds_or_rtn_b32
ds_min_i64	ds_write2_b64	ds_max_rtn_i32	ds_xor_b64
ds_and_rtn_b64	ds_write2st64_b64	ds_read_b32	ds_cmpst_rtn_b32
ds_gws_barrier	ds_or_b64 	ds_read2_b32	ds_add_u32
ds_cmpst_b32	ds_and_rtn_b32	ds_append	 	ds_min_i32
ds_xor_rtn_b32	ds_write2_b32	ds_wrxchg2st64_rtn_b32	ds_sub_u32
ds_cmpst_rtn_b64	ds_cmpst_f64	ds_max_f64	ds_or_b32
ds_max_rtn_u32	ds_write_b16	ds_ordered_count	ds_max_u64
ds_gws_sema_p	ds_gws_sema_v	ds_read_u8	ds_rsub_rtn_u32
ds_rsub_u64	ds_max_i64	ds_inc_u64	ds_inc_u64
ds_mskor_rtn_b64	ds_add_rtn_u64	ds_and_b32	ds_xor_rtn_b64
ds_wrxchg_rtn_b32	ds_or_rtn_b64	ds_min_rtn_u32	ds_min_u64
ds_mskor_b32	ds_cmpst_f32	ds_max_rtn_u64	ds_max_u32
ds_max_rtn_i64	ds_rsub_rtn_u64	ds_rsub_u32	ds_read_b64
ds_inc_u32	ds_mskor_rtn_b32	ds_inc_rtn_u32	ds_read2_b64
ds_add_u64	ds_xor_b32

Other instructions (Internal ones? Some branch instructions?):

Code:

sys_input 	init_opnd 	sc_opnd_table 	sc_op_unknown
merge 	mem_merge 	killz 	killnz 	phi
undefined
if_wv_i32 	if_wv_bit1 	if_wv_bit0 	if_wv_f32
if_th_bit0 	if_th_bit1 	if_wv_u32
callrtn	tabjmp

Appear to be quite a few atomic (even for images/textures) and data share instructions as well as an awful lot of comparison instructions(?!?). Can't offer you a documentation though, sorry

Edit:
The code tags don't get the tabs entirely correct, sometimes there is no space where it should be. I've put in now a combination of spaces and tabs.

Should have posted it before the AFDS.

rpg.314 · Jun 21, 2011

Score

3dilettante · Jun 21, 2011

The transcendentals are in the vector ops section. Implementation details could be intresting. There's no VLIW-exposed linking of 3 FMA units to get a result. It could still link up units on other SIMDs, which would complicate scheduling since it would force a stall for that category on the other issue cycles.

Gipsel · Jun 21, 2011

3dilettante said:
The transcendentals are in the vector ops section. Implementation details could be intresting. There's no VLIW-exposed linking of 3 FMA units to get a result. It could still link up units on other SIMDs, which would complicate scheduling since it would force a stall for that category on the other issue cycles.

I guess, it will loop 3 times within a single ALU.

mczak · Jun 21, 2011

Hmm lots of cmps indeed. Anyone know what they do? I mean there's a full set of them for each operator (ne, lt and so on) on all datatypes (f32, f64, u32, u64, i32, i64) but what's the cmp/cmpx/cmps/cmpsx doing (though the "s" versions are only for floats - maybe versions ignoring/not ignoring sign)? Also some of the operators are a little odd (o? tru?).

rpg.314 · Jun 21, 2011

Compare element wise with scalar?

ECH · Jun 21, 2011

Gipsel · Jun 21, 2011

rpg.314 said:
Compare element wise with scalar?

I just tried to figure it out with some test code, but unfortunately the support in the driver isn't complete (i.e. functional), yet. Looks like AMD does not put that stuff in the public versions that early anymore. It keeps kicking me out with some missing dll message (looks like they put the actual shader compiler for SI in a separate dll for the time being).

Nevertheless, the other stuff for the disassemby appears to work. So I see already the output mask for the new architecture (like number of used scalar and vector registers, it is definitely for GCN). By the way, no new VLIW ASICs appeared as target IDs, just three for GCN. And we have only a single vacancy left in the middle of VLIW IDs, which happens to be the only VLIW4 ASIC besides Cayman. So place your bets what that means.

Alexko · Jun 21, 2011

Gipsel said:
I just tried to figure it out with some test code, but unfortunately the support in the driver isn't complete (i.e. functional), yet. Looks like AMD does not put that stuff in the public versions that early anymore. It keeps kicking me out with some missing dll message (looks like they put the actual shader compiler for SI in a separate dll for the time being).

Nevertheless, the other stuff for the disassemby appears to work. So I see already the output mask for the new architecture (like number of used scalar and vector registers, it is definitely for GCN). By the way, no new VLIW ASICs appeared as target IDs, just three for GCN. And we have only a single vacancy left in the middle of VLIW IDs, which happens to be the only VLIW4 ASIC besides Cayman. So place your bets what that means.

That should be the GPU in Trinity. So VLIW is definitely gone, I guess.

Gipsel · Jun 21, 2011

Alexko said:
That should be the GPU in Trinity. So VLIW is definitely gone, I guess.

Exactly my guess.
Or AMD does also some straight shrink to 28nm for some GPUs. There is one example (RV740 in 40nm) which behaved that similar to its predecessor (55nm RV770), it didn't got its own ID but shared it with RV770.

swaaye · Jun 22, 2011

So some years ago DAAMIT sat down and decided VLIW was a lost cause for compute. Of course they had a few VLIW projects to finish up and talk up in the meantime.

It'll be fun to see comprehensive comparisons of Cayman and this new arch.

itsmydamnation · Jun 22, 2011

swaaye said:
So some years ago DAAMIT sat down and decided VLIW was a lost cause for compute. Of course they had a few VLIW projects to finish up and talk up in the meantime.

It'll be fun to see comprehensive comparisons of Cayman and this new arch.

depends if its evolution or revolution, there have been quite a few changes to the ALU's between R600/RV770/evergreen/cypress. until we have a better idea about how things fit together exactly we cant really tell if its just the next step along the path or something completely new.

it could also be that they predicted that initally they could get better performace of things like compute shaders with less flexable but small sized/high ALU unit count VLIW design but as process sizes shrink and complexity of code and complexity of scale increases more powerfull and flexible ALU's make more sense.

kind of the oposite to R580/600 who strengths apear to be ahead of there times (excluding all the brokeness of R600).

3dilettante · Jun 22, 2011

How's about we start begging for a die shot now?
I'm starting to think the RV770 one was only released because some guy at ATI got drunk and accidentally sexted what he thought was a picture of his junk.

I'm thinking this design would look intesting in a side by side comparison.

Man from Atlantis · Jun 22, 2011

Graphics Core Next (compared to Cayman) architectures by Hiroshi Goto

http://translate.google.com/transla...co.jp/docs/column/kaigai/20110622_454826.html

Harison · Jun 22, 2011

Man from Atlantis said:
Graphics Core Next (compared to Cayman) architectures by Hiroshi Goto

http://translate.google.com/transla...co.jp/docs/column/kaigai/20110622_454826.html

http://translate.google.com/transla...co.jp/docs/column/kaigai/20110622_454826.html

Fixed, and thx for the link :smile:

Harison · Jun 22, 2011

swaaye said:
It'll be fun to see comprehensive comparisons of Cayman and this new arch.

While it definitely will be interesting comparison, I think it will take few generations to get new arch. up to full speed. Nvidia failed pretty badly with first incarnation of Fermi, and GCN is even more advanced. At least its good AMD is already working hard with Microsoft and other devs to get tools ready, we'll see how mature they are when GCN will reach the market.

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314

Gipsel

Gipsel

Alexko

Gipsel

rpg.314

3dilettante

Gipsel

mczak

rpg.314

ECH

Gipsel

Alexko

Gipsel

swaaye

Entirely Suboptimal

itsmydamnation

3dilettante

Man from Atlantis

idk

Harison

Harison

Similar threads