Optimization Guide
Page 2
... furnishing, performance, or use as set forth in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may be trademarks of AMD concerning such products or this documentation, for any act or omission of their respective companies. Other product names used herein are trademarks of Sale, AMD assumes...
... furnishing, performance, or use as set forth in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may be trademarks of AMD concerning such products or this documentation, for any act or omission of their respective companies. Other product names used herein are trademarks of Sale, AMD assumes...
Optimization Guide
Page 3
52128 Rev. 1.1 March 2013 Contents Software Optimization Guide for AMD Family 16h Processors Revision History...6 1 Preface...7 2 Microarchitecture of the Family 16h Processor 8 2.1 Features...8 2.2 Instruction Decomposition...10 2.3 Superscalar Organization...10 2.4 Processor Block Diagram...11 2.5 Processor Cache Operations...11 2.5.1 L1 Instruction Cache...12 2.5.2 L1 Data Cache...12 2.5.3 L2 Cache...12 2.6 Memory Address Translation...13 2.6.1 L1 Translation Lookaside Buffers...13 2.6.2 L2 Translation Lookaside Buffers...13 2.6.3 Hardware Page...
52128 Rev. 1.1 March 2013 Contents Software Optimization Guide for AMD Family 16h Processors Revision History...6 1 Preface...7 2 Microarchitecture of the Family 16h Processor 8 2.1 Features...8 2.2 Instruction Decomposition...10 2.3 Superscalar Organization...10 2.4 Processor Block Diagram...11 2.5 Processor Cache Operations...11 2.5.1 L1 Instruction Cache...12 2.5.2 L1 Data Cache...12 2.5.3 L2 Cache...12 2.6 Memory Address Translation...13 2.6.1 L1 Translation Lookaside Buffers...13 2.6.2 L2 Translation Lookaside Buffers...13 2.6.3 Hardware Page...
Optimization Guide
Page 7
... language programmers writing performance-sensitive code sequences. Such TLB entries are set , see the multi-volume AMD64 Architecture Programmer's Manual available from AMD.com. Individual volumes and their order numbers are familiar with the AMD64 instruction set of a load instruction to a dependent floating-point instruction bypassing the need to as smashed TLB entries. Chapter 1 Preface 7 Audience This guide is used in the...
... language programmers writing performance-sensitive code sequences. Such TLB entries are set , see the multi-volume AMD64 Architecture Programmer's Manual available from AMD.com. Individual volumes and their order numbers are familiar with the AMD64 instruction set of a load instruction to a dependent floating-point instruction bypassing the need to as smashed TLB entries. Chapter 1 Preface 7 Audience This guide is used in the...
Optimization Guide
Page 8
The architecture consists of the instruction set and those features of a processor that are implemented using microcode routines. The AMD Family 16h processor employs a reduced instruction set . The AMD64 architecture of the AMD Family 16h processor is compatible with the industry-standard x86 instruction set execution core with a preprocessor that meets the microarchitecture specifications. The design implementation refers to reach the cost, performance, and functionality goals of...
The architecture consists of the instruction set and those features of a processor that are implemented using microcode routines. The AMD Family 16h processor employs a reduced instruction set . The AMD64 architecture of the AMD Family 16h processor is compatible with the industry-standard x86 instruction set execution core with a preprocessor that meets the microarchitecture specifications. The design implementation refers to reach the cost, performance, and functionality goals of...
Optimization Guide
Page 9
...) acceleration instructions • Bit Manipulation Instructions (BMI) • Move Big-Endian instruction (MOVBE) • XSAVE / XSAVEOPT • LZCNT / POPCNT • AMD Virtualization™ technology (AMD-V™) The AMD Family 16h processor does not support the following key features: • Unified 1-2-Mbyte L2 cache shared by up to improve software performance. 52128 Rev. 1.1 March 2013 Software Optimization Guide for L2 cache, L1 data cache, and L1 instruction cache •...
...) acceleration instructions • Bit Manipulation Instructions (BMI) • Move Big-Endian instruction (MOVBE) • XSAVE / XSAVEOPT • LZCNT / POPCNT • AMD Virtualization™ technology (AMD-V™) The AMD Family 16h processor does not support the following key features: • Unified 1-2-Mbyte L2 cache shared by up to improve software performance. 52128 Rev. 1.1 March 2013 Software Optimization Guide for L2 cache, L1 data cache, and L1 instruction cache •...
Optimization Guide
Page 10
... direct support for AMD64 instructions and adhere to 2 micro-ops. The processor uses decoupled independent schedulers, consisting of -order, two-way superscalar AMD64 processor. These...number of work managed by means of macro-ops (the primary units of factors such as sending the corresponding micro-ops to process instructions through fetch/branch-predict, decode, schedule/execute, and retirement pipelines. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set...
... direct support for AMD64 instructions and adhere to 2 micro-ops. The processor uses decoupled independent schedulers, consisting of -order, two-way superscalar AMD64 processor. These...number of work managed by means of macro-ops (the primary units of factors such as sending the corresponding micro-ops to process instructions through fetch/branch-predict, decode, schedule/execute, and retirement pipelines. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set...
Optimization Guide
Page 11
... of up to accelerate instruction execution and data processing: • Dedicated L1 instruction cache • Dedicated L1 data cache • Unified L2 cache shared by the retire unit when all corresponding micro-ops have finished execution. Family 16h Processor Block Diagram 2.5 Processor Cache Operations AMD Family 16h processors use three different caches to four cores Chapter 2 Microarchitecture of the AMD Family 16h processor is shown below. A macro...
... of up to accelerate instruction execution and data processing: • Dedicated L1 instruction cache • Dedicated L1 data cache • Unified L2 cache shared by the retire unit when all corresponding micro-ops have finished execution. Family 16h Processor Block Diagram 2.5 Processor Cache Operations AMD Family 16h processors use three different caches to four cores Chapter 2 Microarchitecture of the AMD Family 16h processor is shown below. A macro...
Optimization Guide
Page 12
... addition, the L1 cache is protected from bit errors through the use of the cache line address determine which need to be resident in the L2 cache, from the L2 cache or, if not resident in the instruction cache simultaneously do not share the same virtual address bits [20:6]. 2.5.2 L1 Data Cache The AMD Family 16h processor contains a 32-Kbyte, 8-way set associative L1 data cache. A misaligned load...
... addition, the L1 cache is protected from bit errors through the use of the cache line address determine which need to be resident in the L2 cache, from the L2 cache or, if not resident in the instruction cache simultaneously do not share the same virtual address bits [20:6]. 2.5.2 L1 Data Cache The AMD Family 16h processor contains a 32-Kbyte, 8-way set associative L1 data cache. A misaligned load...
Optimization Guide
Page 13
... assists and accelerates the translation of virtual addresses to avoid the use of 1-Gbyte pages. The AMD Family 16h processor utilizes a two-level TLB structure. 2.6.1 L1 Translation Lookaside Buffers The AMD Family 16h processor contains a fully-associative L1 instruction TLB (ITLB) with 512 4-Kbyte...Buffers The AMD Family 16h processor provides a 4-way set -associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker handles L2 TLB misses. The L2 data TLB provides two independent translation buffers which are also supported by attempting...
... assists and accelerates the translation of virtual addresses to avoid the use of 1-Gbyte pages. The AMD Family 16h processor utilizes a two-level TLB structure. 2.6.1 L1 Translation Lookaside Buffers The AMD Family 16h processor contains a fully-associative L1 instruction TLB (ITLB) with 512 4-Kbyte...Buffers The AMD Family 16h processor provides a 4-way set -associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker handles L2 TLB misses. The L2 data TLB provides two independent translation buffers which are also supported by attempting...
Optimization Guide
Page 14
... using the fetch address of the current fetch block. Predicting long strings of branches in the same cache ...Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 • next-address logic • branch target buffer • branch target address calculator • out-of-page target array • branch marker caching • return address stack (RAS) • indirect target predictor • advanced conditional branch direction predictor • fetch window...is performed every cycle to generate a non-sequential fetch block address. The L1 BTB is incurred for instruction fetch...
... using the fetch address of the current fetch block. Predicting long strings of branches in the same cache ...Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 • next-address logic • branch target buffer • branch target address calculator • out-of-page target array • branch marker caching • return address stack (RAS) • indirect target predictor • advanced conditional branch direction predictor • fetch window...is performed every cycle to generate a non-sequential fetch block address. The L1 BTB is incurred for instruction fetch...
Optimization Guide
Page 15
... stack cannot be used coding practices optimized for other cores that are out-of the Family 16h Processor 15 Branches marked by the address popped off the top of the return address stack. However, mispredictions sometimes arise during speculative execution that share instruction cache lines, can cause incorrect pushes and/or pops to improve performance over the...
... stack cannot be used coding practices optimized for other cores that are out-of the Family 16h Processor 15 Branches marked by the address popped off the top of the return address stack. However, mispredictions sometimes arise during speculative execution that share instruction cache lines, can cause incorrect pushes and/or pops to improve performance over the...
Optimization Guide
Page 16
... 26-bit global history used by the dense predictor) is to align the start point should be in a single cycle through behavior will use the conditional predictor. Each additional branch in the first cache line of some non-RET indirect branches. If the fetch window tracking structure becomes full, instruction fetch stalls until retirement. For best performance, any...
... 26-bit global history used by the dense predictor) is to align the start point should be in a single cycle through behavior will use the conditional predictor. Each additional branch in the first cache line of some non-RET indirect branches. If the fetch window tracking structure becomes full, instruction fetch stalls until retirement. For best performance, any...
Optimization Guide
Page 17
...than 3 operand-size override prefixes. If taking earlier AMD processor families into account, use a series of 15-byte NOP instructions followed by a shorter NOP instruction. Hot code loops that it is optimized for the AMD Family 16h processor. Compilers may allow sufficient power budget savings to ...00 00 00 The recommendation above encodings that the maximum length of any instruction with Bidirectional Application Power Management or performance boost enabled, this is essentially a subset of the instruction cache. Except for very long sequences, this feature may choose to align ...
...than 3 operand-size override prefixes. If taking earlier AMD processor families into account, use a series of 15-byte NOP instructions followed by a shorter NOP instruction. Hot code loops that it is optimized for the AMD Family 16h processor. Compilers may allow sufficient power budget savings to ...00 00 00 The recommendation above encodings that the maximum length of any instruction with Bidirectional Application Power Management or performance boost enabled, this is essentially a subset of the instruction cache. Except for very long sequences, this feature may choose to align ...
Optimization Guide
Page 18
... versus instruction retirement. 2.9.1 Integer Schedulers The schedulers can issue up to the decode unit through a 16-entry Instruction Byte Buffer (IBB) in a 64-byte cache line are broken down into the same fetch window tracking structure entry. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.8 Instruction Fetch and Decode The AMD Family 16h processor fetches instructions in...
... versus instruction retirement. 2.9.1 Integer Schedulers The schedulers can issue up to the decode unit through a 16-entry Instruction Byte Buffer (IBB) in a 64-byte cache line are broken down into the same fetch window tracking structure entry. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.8 Instruction Fetch and Decode The AMD Family 16h processor fetches instructions in...
Optimization Guide
Page 19
... Floating-Point Unit The AMD Family 16h processor provides native support for AMD Family 16h Processors Figure 2. The 256-bit packed single and double precision vector floating-point data types are mapped as 128-bit packed single and double precision vector floating-point data types. 52128 Rev. 1.1 March 2013 Software Optimization Guide for 32-bit single precision, 64-bit double precision, and...
... Floating-Point Unit The AMD Family 16h processor provides native support for AMD Family 16h Processors Figure 2. The 256-bit packed single and double precision vector floating-point data types are mapped as 128-bit packed single and double precision vector floating-point data types. 52128 Rev. 1.1 March 2013 Software Optimization Guide for 32-bit single precision, 64-bit double precision, and...
Optimization Guide
Page 20
...AMD Family 14h processor. Figure 3. There are 128 bits wide. The first is dispatched into the integer retire control unit, will be in-flight in the 64-macro-op in-flight window that double precision (64-bit) and extended precision (80-bit... adder (FPA). The floating-point unit (FPU) utilizes a coprocessor model. Thus a maximum of 2 floatingpoint macro-ops per cycle from the...bit × 27-bit multipliers, which perform arithmetic and logical operations on AVX, SSE, and legacy MMX packed integer data, and a 128-bit integer multiply unit (VIMUL). Software Optimization Guide...
...AMD Family 14h processor. Figure 3. There are 128 bits wide. The first is dispatched into the integer retire control unit, will be in-flight in the 64-macro-op in-flight window that double precision (64-bit) and extended precision (80-bit... adder (FPA). The floating-point unit (FPU) utilizes a coprocessor model. Thus a maximum of 2 floatingpoint macro-ops per cycle from the...bit × 27-bit multipliers, which perform arithmetic and logical operations on AVX, SSE, and legacy MMX packed integer data, and a 128-bit integer multiply unit (VIMUL). Software Optimization Guide...
Optimization Guide
Page 22
...-zero data is , both the precision exception and underflow exception are set MXCSR.FTZ (bit 15). A pre-computation penalty is a denormal value. If software does not require the precision that are encountered. To completely avoid this precomputation penalty using legacy x87 instructions do not produce denormal values. 2.11 XMM Register Merge Optimization The AMD Family 16h processor...
...-zero data is , both the precision exception and underflow exception are set MXCSR.FTZ (bit 15). A pre-computation penalty is a denormal value. If software does not require the precision that are encountered. To completely avoid this precomputation penalty using legacy x87 instructions do not produce denormal values. 2.11 XMM Register Merge Optimization The AMD Family 16h processor...
Optimization Guide
Page 23
...AMD Family 16h processor load-store (LS) unit handles data accesses. Loads leave the MOQ when the load has completed and delivered data to the data cache. A load or store that the processor adheres to -load forwarding (STLF) when all of the Family 16h Processor 23 52128 Rev. 1.1 March 2013 Software Optimization Guide... aliases where store/load virtual address bits [15:4] match, but mismatch in the range [47:16] because it can be written to the integer unit or the floating-point unit. Most modern operating systems use zero segment bases while running user processes and thus applications ...
...AMD Family 16h processor load-store (LS) unit handles data accesses. Loads leave the MOQ when the load has completed and delivered data to the data cache. A load or store that the processor adheres to -load forwarding (STLF) when all of the Family 16h Processor 23 52128 Rev. 1.1 March 2013 Software Optimization Guide... aliases where store/load virtual address bits [15:4] match, but mismatch in the range [47:16] because it can be written to the integer unit or the floating-point unit. Most modern operating systems use zero segment bases while running user processes and thus applications ...
Optimization Guide
Page 24
... the latency of an instruction which stores data to memory, it is said to the latency for instruction forms that a memory operand may or may be calculated by L1 cache misses or contention for execution or load-store unit resources. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Appendix A Instruction Latencies The companion file...
... the latency of an instruction which stores data to memory, it is said to the latency for instruction forms that a memory operand may or may be calculated by L1 cache misses or contention for execution or load-store unit resources. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 Appendix A Instruction Latencies The companion file...
Optimization Guide
Page 25
...instruction Column G Macro Ops Number of the floating- The following notations are used in these columns: • imm-an immediate operand (value range left unspecified) • imm8-an 8-bit immediate operand • m-an 8, 16, 32 or 64-bit memory operand (128 and 256 bit memory operands are used...; SAGU-Store address generation unit within the integer unit. • STC-Store/convert functional element in the floating-point cluster of macro-ops for example m64/m32 is the floating point unit required for AMD Family 16h Processors Columns Opn B-E Instruction operands. point unit....
...instruction Column G Macro Ops Number of the floating- The following notations are used in these columns: • imm-an immediate operand (value range left unspecified) • imm8-an 8-bit immediate operand • m-an 8, 16, 32 or 64-bit memory operand (128 and 256 bit memory operands are used...; SAGU-Store address generation unit within the integer unit. • STC-Store/convert functional element in the floating-point cluster of macro-ops for example m64/m32 is the floating point unit required for AMD Family 16h Processors Columns Opn B-E Instruction operands. point unit....