AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    The minimum needed to fill all ALU's, eg 2 warps for fermi.
     
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    That would fill the ALUs for exactly 2 cycles in the worst case (and 8 cycles in the best case with a high ILP) and letting them the next 10 to 16 cycles idle. The bare minimum is actually between 6 warps with a high ILP and 18 warps (the pipeline depth) for ILP=1. For GF104 style SMs those numbers are even higher.
     
  3. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    In principle yes, but as control flow opens up a new clause anyway, it's a moot point to some extent, as clause switching is expensive if the clauses are shorter than 10 instructions or so.
    GCN's issue logic is simple in the respect that the type of the instruction and the instruction buffer it comes from determines the exact unit it will be executed on and probably no dependencies between arithmetic instructions need to be tracked. There are simply not many decisions to make. With Fermi on the other side, the scheduler has to check much more dependencies between instructions in flight and determine which vector ALU or which SFU block (there are two in a GF104 SM) the instruction has to go to. Fermi can issue instructions to any of the 16 wide vector ALUs in the SM (very evident in GF104 style SMs and also for DP ops). This also complicates the operand collector and result networks from the register files (which probably contributes significantly to the latency).
     
  4. Alexko

    Veteran

    Joined:
    Aug 31, 2009
    Messages:
    3,932
    I knew Fermi was fast, but faster than light?! :shock:



    Sorry, I'm gone now, you can resume normal discussions.
     
  5. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    If someone wants to look over the (I hope it's complete, but maybe I missed something to copy) instruction set of SI, have fun!

    Vector instructions:
    Code:
    v_cmpx_t_u64  	v_cmpx_ge_u64  	v_cmpx_ne_u64 	v_cmpx_gt_u64
    v_cmpx_le_u64 	v_cmpx_eq_u64 	v_cmpx_lt_u64 	v_cmpx_f_u64
    v_cmp_t_u64 	v_cmp_ge_u64 	v_cmp_ne_u64 	v_cmp_gt_u64
    v_cmp_le_u64 	v_cmp_eq_u64 	v_cmp_lt_u64 	v_cmp_f_u64
    v_cmpx_t_u32 	v_cmpx_ge_u32 	v_cmpx_ne_u32 	v_cmpx_gt_u32
    v_cmpx_le_u32 	v_cmpx_eq_u32 	v_cmpx_lt_u32 	v_cmpx_f_u32
    v_cmp_t_u32 	v_cmp_ge_u32 	v_cmp_ne_u32 	v_cmp_gt_u32
    v_cmp_le_u32 	v_cmp_eq_u32 	v_cmp_lt_u32 	v_cmp_f_u32
    v_cmpx_t_i64 	v_cmpx_ge_i64 	v_cmpx_ne_i64 	v_cmpx_gt_i64
    v_cmpx_le_i64 	v_cmpx_eq_i64 	v_cmpx_lt_i64 	v_cmpx_f_i64
    v_cmp_t_i64 	v_cmp_ge_i64 	v_cmp_ne_i64 	v_cmp_gt_i64
    v_cmp_le_i64 	v_cmp_eq_i64 	v_cmp_lt_i64 	v_cmp_f_i64
    v_cmpx_t_i32 	v_cmpx_ge_i32 	v_cmpx_ne_i32 	v_cmpx_gt_i32
    v_cmpx_le_i32 	v_cmpx_eq_i32 	v_cmpx_lt_i32 	v_cmpx_f_i32
    v_cmp_t_i32 	v_cmp_ge_i32 	v_cmp_ne_i32 	v_cmp_gt_i32
    v_cmp_le_i32 	v_cmp_eq_i32 	v_cmp_lt_i32 	v_cmp_f_i32
    v_cmpsx_tru_f64 	v_cmpsx_nlt_f64 	v_cmpsx_neq_f64 	v_cmpsx_nle_f64
    v_cmpsx_ngt_f64 	v_cmpsx_nlg_f64 	v_cmpsx_nge_f64 	v_cmpsx_u_f64
    v_cmpsx_o_f64 	v_cmpsx_ge_f64 	v_cmpsx_lg_f64 	v_cmpsx_gt_f64
    v_cmpsx_le_f64 	v_cmpsx_eq_f64 	v_cmpsx_lt_f64 	v_cmpsx_f_f64
    v_cmps_tru_f64 	v_cmps_nlt_f64 	v_cmps_neq_f64 	v_cmps_nle_f64
    v_cmps_ngt_f64 	v_cmps_nlg_f64 	v_cmps_nge_f64 	v_cmps_u_f64
    v_cmps_o_f64 	v_cmps_ge_f64 	v_cmps_lg_f64 	v_cmps_gt_f64
    v_cmps_le_f64 	v_cmps_eq_f64 	v_cmps_lt_f64 	v_cmps_f_f64
    v_cmpsx_tru_f32 	v_cmpsx_nlt_f32 	v_cmpsx_neq_f32 	v_cmpsx_nle_f32
    v_cmpsx_ngt_f32 	v_cmpsx_nlg_f32 	v_cmpsx_nge_f32 	v_cmpsx_u_f32
    v_cmpsx_o_f32 	v_cmpsx_ge_f32 	v_cmpsx_lg_f32 	v_cmpsx_gt_f32
    v_cmpsx_le_f32 	v_cmpsx_eq_f32 	v_cmpsx_lt_f32 	v_cmpsx_f_f32
    v_cmps_tru_f32 	v_cmps_nlt_f32 	v_cmps_neq_f32 	v_cmps_nle_f32
    v_cmps_ngt_f32 	v_cmps_nlg_f32 	v_cmps_nge_f32 	v_cmps_u_f32
    v_cmps_o_f32 	v_cmps_ge_f32 	v_cmps_lg_f32 	v_cmps_gt_f32
    v_cmps_le_f32 	v_cmps_eq_f32 	v_cmps_lt_f32 	v_cmps_f_f32
    v_cmpx_tru_f64 	v_cmpx_nlt_f64 	v_cmpx_neq_f64 	v_cmpx_nle_f64
    v_cmpx_ngt_f64 	v_cmpx_nlg_f64 	v_cmpx_nge_f64 	v_cmpx_u_f64
    v_cmpx_o_f64 	v_cmpx_ge_f64 	v_cmpx_lg_f64 	v_cmpx_gt_f64
    v_cmpx_le_f64 	v_cmpx_eq_f64 	v_cmpx_lt_f64 	v_cmpx_f_f64
    v_cmp_tru_f64 	v_cmp_nlt_f64 	v_cmp_neq_f64 	v_cmp_nle_f64
    v_cmp_ngt_f64 	v_cmp_nlg_f64 	v_cmp_nge_f64 	v_cmp_u_f64
    v_cmp_o_f64 	v_cmp_ge_f64 	v_cmp_lg_f64 	v_cmp_gt_f64
    v_cmp_le_f64 	v_cmp_eq_f64 	v_cmp_lt_f64 	v_cmp_f_f64
    v_cmpx_tru_f32 	v_cmpx_nlt_f32 	v_cmpx_neq_f32 	v_cmpx_nle_f32
    v_cmpx_ngt_f32 	v_cmpx_nlg_f32 	v_cmpx_nge_f32 	v_cmpx_u_f32
    v_cmpx_o_f32 	v_cmpx_ge_f32 	v_cmpx_lg_f32 	v_cmpx_gt_f32
    v_cmpx_le_f32 	v_cmpx_eq_f32 	v_cmpx_lt_f32 	v_cmpx_f_f32
    v_cmp_tru_f32 	v_cmp_nlt_f32 	v_cmp_neq_f32 	v_cmp_nle_f32
    v_cmp_ngt_f32 	v_cmp_nlg_f32 	v_cmp_nge_f32 	v_cmp_u_f32
    v_cmp_o_f32 	v_cmp_ge_f32 	v_cmp_lg_f32 	v_cmp_gt_f32
    v_cmp_le_f32 	v_cmp_eq_f32 	v_cmp_lt_f32 	v_cmp_f_f32
    v_sad_u16 	v_med3_i32 	v_rcp_f32 	v_sqrt_f64
    v_min_f64 	v_cvt_f32_f16 	v_floor_f32 	v_mul_lo_u32
    v_ldexp_f32 	v_movrels_b32 	v_ashr_i32 	v_cvt_f64_u32
    v_rsq_f64 	v_trunc_f32 	v_max_f32 	v_cvt_pknorm_i16_f32
    v_subrev_i32 	v_add_f32 	v_cubema_f32 	v_cvt_f32_ubyte0
    v_cvt_f32_ubyte1 	v_cvt_f32_ubyte2 	v_movreld_b32 	v_cvt_flr_i32_f32
    v_cmp_class_f64 	v_cmpx_class_f64 	v_max3_i32 	v_cmpx_i64
    v_cubeid_f32 	v_sad_u8 	v_cubetc_f32 	v_rcp_f64
    v_fma_f32 	v_rndne_f32 	v_cmp_f32 	v_cndmask_b32
    v_nop 	v_ldexp_f64 	v_bfi_b32 	v_cmpx_f64
    v_cvt_f32_ubyte3 	v_cos_f32 	v_cvt_f16_f32 	v_ceil_f32
    v_mad_i32_i24 	v_rcp_clamp_f64 	v_rsq_f32 	v_bcnt_u32_b32
    v_subb_u32 	v_fract_f64 	v_min3_f32 	v_mac_f32
    v_cmpx_u32 	v_mul_u32_u24 	v_mov_b32 	v_max3_u32
    v_bfe_i32 	v_ffbh_u32 	v_addc_u32 	v_cvt_f64_i32
    v_div_scale_f64 	v_madmk_f32 	v_mbcnt_hi_u32_b32 	v_cmp_i32
    v_sub_i32 	v_sub_f32 	v_sad_hi_u8 	v_max_i32
    v_writelane_b32 	v_bfm_b32 	v_ffbl_b32 	v_sqrt_f32
    v_min_f32 	v_med3_u32 	v_cvt_u32_f64 	v_cvt_f64_f32
    v_mac_legacy_f32 	v_interp_mov_f32 	v_mbcnt_lo_u32_b32 	v_rsq_clamp_f64
    v_or_b32 	v_ashr_i64 	v_readlane_b32 	v_min3_u32
    v_log_f32 	v_rsq_legacy_f32 	v_div_scale_f32 	v_madak_f32
    v_add_f64 	v_mul_f64 	v_lshrrev_b32 	v_cmpx_i32
    v_min_legacy_f32 	v_fma_f64 	v_min_i32 	v_cvt_pkrtz_f16_f32
    v_lshl_b32 	v_xor_b32 	v_and_b32 	v_cubesc_f32
    v_max_legacy_f32 	v_cvt_i32_f32 	v_cvt_i32_f64 	v_rsq_clamp_f32
    v_ffbh_i32 	v_cmpx_f32 	v_mul_i32_i24 	v_sad_u32
    v_exp_f32 	v_mul_f32 	v_movrelsd_b32 	v_frexp_mant_f64
    v_readfirstlane_b32 	v_cvt_off_f32_i4 	v_cvt_f32_u32 	v_bfrev_b32
    v_ashrrev_i32 	v_cvt_rpi_i32_f32 	v_mul_hi_i32_i24 	v_mad_legacy_f32
    v_lshr_b32 	v_cmpx_u64 	v_sin_f32 	v_add_i32
    v_mul_hi_u32 	v_lshl_b64 	v_fract_f32 	v_cmp_class_f32
    v_lerp_u8 	v_max_f64 	v_cvt_pk_u8_f32 	v_med3_f32
    v_min3_i32 	v_frexp_mant_f32 	v_rcp_clamp_f32 	v_cmp_f64
    v_cmp_u32 	v_mullit_f32 	v_mul_hi_u32_u24 	v_frexp_exp_i32_f64
    v_cvt_u32_f32 	v_cmp_i64 	v_max_u32 	v_not_b32
    v_min_u32 	v_mad_f32 	v_alignbyte_b32 	v_cvt_f32_f64
    v_lshr_b64 	v_subrev_f32 	v_mul_lo_i32 	v_log_clamp_f32
    v_rcp_legacy_f32 	v_subbrev_u32 	v_mad_u32_u24 	v_max3_f32
    v_cvt_f32_i32 	v_lshlrev_b32 	v_mul_hi_i32 	v_cmp_u64
    v_alignbit_b32 	v_bfe_u32 	v_interp_mac2_f32 	v_cmpx_class_f32
    v_frexp_exp_i32_f32 	v_mul_legacy_f32
    Scalar instructions:
    Code:
    s_cselect_b64 	s_wqm_b64 	s_lshl_b64 	s_bitset0_b64
    s_ashr_i32 	s_mul_i32 	s_bitcmp1_b64 	s_ff1_i32_b32
    s_flbit_i32_b32 	s_andn2_saveexec_b64 	s_xor_saveexec_b64 	s_nop
    s_mov_b32 	s_cbranch_i_fork 	s_nor_b64 	s_cbranch_execnz
    s_quadmask_b32 	s_or_saveexec_b64 	s_branch 	s_cmp_i32
    s_cmov_b64 	s_sendmsg 	s_getpc_b64 	s_rfe_b64
    s_endpgm 	s_cselect_b32 	s_addc_u32 	s_memtime
    s_bitcmp0_b64 	s_nor_b32 	s_min_i32 	s_bfe_i32
    s_ff1_i32_b64 	s_xor_b64 	s_andn2_b32 	s_nand_saveexec_b64
    s_setprio 	s_mov_b64 	s_bcnt0_i32_b64 	s_ashr_i64
    s_cmov_b32 	s_bcnt1_i32_b64 	s_nor_saveexec_b64 	s_xnor_b32
    s_or_b32 	s_brev_b64 	s_lshr_b64  s_xor_b32
    s_not_b32 	s_orn2_b64 	s_sext_i32_i16 	s_nand_b64
    s_cmp_u32 	s_lshl_b32 	s_bfe_u64 	s_max_u32
    s_min_u32 	s_movrels_b32 	s_brev_b32 	s_lshr_b32
    s_sext_i32_i8 	s_orn2_b32 	s_movreld_b64 	s_cbranch_join
    s_bitcmp0_b32 	s_cbranch_vccz 	s_flbit_i32_b64 	s_add_i32
    s_nand_b32 	s_setreg_b32 	s_xnor_saveexec_b64 	s_bcnt1_i32_b32
    s_quadmask_b64 	s_movreld_b32 	s_ff0_i32_b64 	s_and_b64
    s_barrier 	s_bitset1_b32 	s_flbit_i32_i64 	s_swappc_b64
    s_ff0_i32_b32 	s_setpc_b64 	s_waitcnt 	s_andn2_b64
    s_cbranch_vccnz 	s_and_saveexec_b64 	s_bcnt0_i32_b32 	s_bitcmp1_b32
    s_movrels_b64 	s_bfe_u32 	s_subb_u32 	s_and_b32
    s_max_i32 	s_bitset1_b64 	s_cbranch_scc0 	s_bfm_b64
    s_or_b64 	s_orn2_saveexec_b64 	s_wqm_b32 	s_bfm_b32
    s_xnor_b64 	s_bfe_i64 	s_getreg_b32 	s_sub_i32
    s_not_b64 	s_flbit_i32 	s_cbranch_scc1 	s_cbranch_execz
    s_bitset0_b32
    s_load_dword 	s_buffer_load_dword
    Vector memory instructions:
    Code:
    [B]Texture ops:[/B]
    image_get_lod 	image_sample_c_l_o 	image_sample_c_b_o 	image_sample_c_d_o
    image_atomic_rsub 	image_sample_d_cl_o 	image_sample_c_b_cl_o 	image_atomic_umin
    image_sample 	image_gather4_cl 	image_sample_b_cl 	image_gather4_c_cl
    image_sample_lz_o 	image_atomic_sub 	image_gather4 	image_sample_c_lz
    image_atomic_smin 	image_atomic_umax 	image_sample_lz 	image_atomic_add
    image_sample_o 	image_sample_l 	image_sample_c 	image_sample_b
    image_sample_d 	image_sample_cd 	image_sample_cl 	image_gather4_c_cl_o
    image_sample_cd_cl_o 	image_atomic_xor 	image_sample_d_o 	image_atomic_dec
    image_sample_c_cd_cl_o 	image_gather4_c_o 	image_sample_b_o 	image_atomic_cmpswap
    image_atomic_smax 	image_sample_l_o 	image_get_resinfo 	image_sample_c_b_cl
    image_sample_c_cd 	image_sample_c_cl 	image_atomic_and 	image_atomic_or
    image_atomic_inc 	image_sample_c_d_cl_o 	image_sample_c_d 	image_sample_c_bimage_sample_c_o
    image_sample_c_l 	image_gather4_cl_o 	image_sample_cd_o 	image_load
    image_load_mip 	image_sample_cd_cl 	image_sample_b_cl_o 	image_sample_c_lz_o
    image_atomic_swap 	image_sample_cl_o 	image_store 	image_sample_d_cl
    image_sample_c_cl_o 	image_gather4_o 	image_gather4_c 	image_sample_c_d_cl
    image_sample_c_cd_o 	image_sample_c_cd_cl
    
    [B]Memory ops:[/B]
    buffer_atomic_cmpswap 	buffer_load_dwordx2 	buffer_store_format_xy 	buffer_load_sbyte
    buffer_load_format_x 	buffer_store_format_xyz 	buffer_atomic_smax_x2 	buffer_atomic_or_x2
    buffer_atomic_smin 	buffer_load_format_xyz 	buffer_load_format_xyzw 	buffer_atomic_add_x2
    buffer_store_dwordx2 	buffer_store_dwordx4 	buffer_atomic_xor 	buffer_store_dword
    buffer_atomic_cmpswap_x2 	buffer_atomic_umax_x2 	buffer_atomic_fmin 	buffer_atomic_fcmpswap_x2
    buffer_atomic_umin_x2 	buffer_atomic_umax 	buffer_atomic_xor_x2 	buffer_atomic_sub
    buffer_atomic_rsub 	buffer_load_dword 	buffer_load_ushort 	buffer_atomic_sub_x2
    buffer_atomic_fcmpswap 	buffer_load_dwordx4 	buffer_atomic_inc_x2 	buffer_load_format_xy
    buffer_atomic_fmax 	buffer_atomic_fmax_x2 	buffer_atomic_umin 	buffer_atomic_inc
    buffer_load_ubyte 	buffer_atomic_or 	buffer_store_format_x 	buffer_store_format_xyzw
    buffer_atomic_and 	buffer_store_short 	buffer_atomic_smin_x2 	buffer_store_byte
    buffer_load_sshort 	buffer_atomic_smax 	buffer_atomic_fmin_x2 	buffer_atomic_dec_x2
    buffer_atomic_add 	buffer_atomic_swap_x2 	buffer_atomic_and_x2 	buffer_atomic_dec
    buffer_atomic_rsub_x2 	buffer_atomic_swap
    tbuffer_store_format_xy 	tbuffer_store_format_x 	tbuffer_load_format_xy 	tbuffer_store_format_xyz
    tbuffer_store_format_xyzw 	tbuffer_load_format_x 	tbuffer_load_format_xyz 	tbuffer_load_format_xyzw
    Data share Instructions:
    Code:
    ds_read_i16	ds_sub_rtn_u32	ds_wrxchg_rtn_b64	ds_max_rtn_f64
    ds_cmpst_rtn_f64	ds_write_b8	ds_min_rtn_f64	ds_min_rtn_i32
    ds_wrxchg2_rtn_b32 	ds_max_rtn_f32	ds_read_u16	ds_inc_rtn_u64
    ds_write2st64_b32	ds_dec_rtn_u32	ds_min_f32	ds_dec_u64
    ds_consume	ds_min_rtn_f32	ds_gws_sema_br	ds_max_i32
    ds_read2st64_b64	ds_write_b64	ds_cmpst_b64	ds_add_rtn_u32
    ds_gws_init	ds_min_rtn_i64	ds_wrxchg2st64_rtn_b64	ds_wrxchg2_rtn_b64
    ds_min_rtn_u64	ds_min_u32	ds_mskor_b64	ds_sub_u64
    ds_dec_rtn_u64	ds_dec_u32	ds_max_f32	ds_read2st64_b32
    ds_write_b32	ds_cmpst_rtn_f32	ds_sub_rtn_u64	ds_min_f64
    ds_read_i8	ds_swizzle_b32	ds_and_b64	ds_or_rtn_b32
    ds_min_i64	ds_write2_b64	ds_max_rtn_i32	ds_xor_b64
    ds_and_rtn_b64	ds_write2st64_b64	ds_read_b32	ds_cmpst_rtn_b32
    ds_gws_barrier	ds_or_b64 	ds_read2_b32	ds_add_u32
    ds_cmpst_b32	ds_and_rtn_b32	ds_append	 	ds_min_i32
    ds_xor_rtn_b32	ds_write2_b32	ds_wrxchg2st64_rtn_b32	ds_sub_u32
    ds_cmpst_rtn_b64	ds_cmpst_f64	ds_max_f64	ds_or_b32
    ds_max_rtn_u32	ds_write_b16	ds_ordered_count	ds_max_u64
    ds_gws_sema_p	ds_gws_sema_v	ds_read_u8	ds_rsub_rtn_u32
    ds_rsub_u64	ds_max_i64	ds_inc_u64	ds_inc_u64
    ds_mskor_rtn_b64	ds_add_rtn_u64	ds_and_b32	ds_xor_rtn_b64
    ds_wrxchg_rtn_b32	ds_or_rtn_b64	ds_min_rtn_u32	ds_min_u64
    ds_mskor_b32	ds_cmpst_f32	ds_max_rtn_u64	ds_max_u32
    ds_max_rtn_i64	ds_rsub_rtn_u64	ds_rsub_u32	ds_read_b64
    ds_inc_u32	ds_mskor_rtn_b32	ds_inc_rtn_u32	ds_read2_b64
    ds_add_u64	ds_xor_b32
    Other instructions (Internal ones? Some branch instructions?):
    Code:
    sys_input 	init_opnd 	sc_opnd_table 	sc_op_unknown
    merge 	mem_merge 	killz 	killnz 	phi
    undefined
    if_wv_i32 	if_wv_bit1 	if_wv_bit0 	if_wv_f32
    if_th_bit0 	if_th_bit1 	if_wv_u32
    callrtn	tabjmp
    Appear to be quite a few atomic (even for images/textures) and data share instructions as well as an awful lot of comparison instructions(?!?). Can't offer you a documentation though, sorry :oops:

    Edit:
    The code tags don't get the tabs entirely correct, sometimes there is no space where it should be. I've put in now a combination of spaces and tabs. :roll:

    Should have posted it before the AFDS.
     
    #365 Gipsel, Jun 21, 2011
    Last edited by a moderator: Jun 21, 2011
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
  7. 3dilettante

    Legend

    Joined:
    Sep 15, 2003
    Messages:
    6,756
    Location:
    Well within 3d
    The transcendentals are in the vector ops section. Implementation details could be intresting. There's no VLIW-exposed linking of 3 FMA units to get a result. It could still link up units on other SIMDs, which would complicate scheduling since it would force a stall for that category on the other issue cycles.
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    I guess, it will loop 3 times within a single ALU.
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,909
    Hmm lots of cmps indeed. Anyone know what they do? I mean there's a full set of them for each operator (ne, lt and so on) on all datatypes (f32, f64, u32, u64, i32, i64) but what's the cmp/cmpx/cmps/cmpsx doing (though the "s" versions are only for floats - maybe versions ignoring/not ignoring sign)? Also some of the operators are a little odd (o? tru?).
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Compare element wise with scalar?
     
  11. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    655
  12. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    I just tried to figure it out with some test code, but unfortunately the support in the driver isn't complete (i.e. functional), yet. Looks like AMD does not put that stuff in the public versions that early anymore. It keeps kicking me out with some missing dll message (looks like they put the actual shader compiler for SI in a separate dll for the time being).

    Nevertheless, the other stuff for the disassemby appears to work. So I see already the output mask for the new architecture (like number of used scalar and vector registers, it is definitely for GCN). By the way, no new VLIW ASICs appeared as target IDs, just three for GCN. And we have only a single vacancy left in the middle of VLIW IDs, which happens to be the only VLIW4 ASIC besides Cayman. So place your bets what that means. :grin:
     
  13. Alexko

    Veteran

    Joined:
    Aug 31, 2009
    Messages:
    3,932
    That should be the GPU in Trinity. So VLIW is definitely gone, I guess.
     
  14. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    Exactly my guess.
    Or AMD does also some straight shrink to 28nm for some GPUs. There is one example (RV740 in 40nm) which behaved that similar to its predecessor (55nm RV770), it didn't got its own ID but shared it with RV770.
     
  15. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    7,872
    Location:
    WI, USA
    So some years ago DAAMIT sat down and decided VLIW was a lost cause for compute. Of course they had a few VLIW projects to finish up and talk up in the meantime.

    It'll be fun to see comprehensive comparisons of Cayman and this new arch.
     
  16. itsmydamnation

    Regular

    Joined:
    Apr 29, 2007
    Messages:
    923
    Location:
    Australia
    depends if its evolution or revolution, there have been quite a few changes to the ALU's between R600/RV770/evergreen/cypress. until we have a better idea about how things fit together exactly we cant really tell if its just the next step along the path or something completely new.

    it could also be that they predicted that initally they could get better performace of things like compute shaders with less flexable but small sized/high ALU unit count VLIW design but as process sizes shrink and complexity of code and complexity of scale increases more powerfull and flexible ALU's make more sense.

    kind of the oposite to R580/600 who strengths apear to be ahead of there times (excluding all the brokeness of R600).
     
  17. 3dilettante

    Legend

    Joined:
    Sep 15, 2003
    Messages:
    6,756
    Location:
    Well within 3d
    How's about we start begging for a die shot now?
    I'm starting to think the RV770 one was only released because some guy at ATI got drunk and accidentally sexted what he thought was a picture of his junk.

    I'm thinking this design would look intesting in a side by side comparison.
     
  18. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    730
  19. Harison

    Newcomer

    Joined:
    Mar 29, 2010
    Messages:
    195
  20. Harison

    Newcomer

    Joined:
    Mar 29, 2010
    Messages:
    195
    While it definitely will be interesting comparison, I think it will take few generations to get new arch. up to full speed. Nvidia failed pretty badly with first incarnation of Fermi, and GCN is even more advanced. At least its good AMD is already working hard with Microsoft and other devs to get tools ready, we'll see how mature they are when GCN will reach the market.
     

Share This Page

Loading...