Optimization Guide
Page 1
Software Optimization Guide for AMD Family 16h Processors Publication # 52128 Revision: 1.1 Issue Date: March 2013 Advanced Micro Devices
Software Optimization Guide for AMD Family 16h Processors Publication # 52128 Revision: 1.1 Issue Date: March 2013 Advanced Micro Devices
Optimization Guide
Page 3
52128 Rev. 1.1 March 2013 Contents Software Optimization Guide for AMD Family 16h Processors Revision History...6 1 Preface...7 2 Microarchitecture of the Family 16h Processor 8 2.1 Features...8 2.2 Instruction Decomposition...10 2.3 Superscalar Organization...10 2.4 Processor Block Diagram...11 2.5 Processor Cache Operations...11 2.5.1 L1 Instruction Cache...12 2.5.2 L1 Data Cache...12 2.5.3 L2 Cache...12 2.6 Memory Address Translation...13 2.6.1 L1 Translation Lookaside Buffers...
52128 Rev. 1.1 March 2013 Contents Software Optimization Guide for AMD Family 16h Processors Revision History...6 1 Preface...7 2 Microarchitecture of the Family 16h Processor 8 2.1 Features...8 2.2 Instruction Decomposition...10 2.3 Superscalar Organization...10 2.4 Processor Block Diagram...11 2.5 Processor Cache Operations...11 2.5.1 L1 Instruction Cache...12 2.5.2 L1 Data Cache...12 2.5.3 L2 Cache...12 2.6 Memory Address Translation...13 2.6.1 L1 Translation Lookaside Buffers...
Optimization Guide
Page 4
Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 List of Figures Integer Schedulers and Execution Units...18 Figure 3. Floating-point Unit Block Diagram...20 4 List of Figures Figure 1. Family 16h Processor Block Diagram...11 Figure 2.
Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 List of Figures Integer Schedulers and Execution Units...18 Figure 3. Floating-point Unit Block Diagram...20 4 List of Figures Figure 1. Family 16h Processor Block Diagram...11 Figure 2.
Optimization Guide
Page 5
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors List of Tables 5 Summary of Floating-point Instruction Latencies...21 List of Tables Table 1. Typical Instruction Mappings...10 Table 2.
52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors List of Tables 5 Summary of Floating-point Instruction Latencies...21 List of Tables Table 1. Typical Instruction Mappings...10 Table 2.
Optimization Guide
Page 6
March 2013 1.1 Description Initial Public Release 52128 Rev. 1.1 March 2013 6 Revision History Software Optimization Guide for AMD Family 16h Processors Revision History Date Rev.
March 2013 1.1 Description Initial Public Release 52128 Rev. 1.1 March 2013 6 Revision History Software Optimization Guide for AMD Family 16h Processors Revision History Date Rev.
Optimization Guide
Page 7
...instruction to a dependent floating-point instruction bypassing the need to BIOS and Kernel Developers Guide (BKDG) for AMD Family 16h Models 00h-0Fh Processors (Order # 48751) for more information about machine-specific registers, debug, and performance profiling tools. Specialized Terminology...a useful set in this Document This document provides software optimization information and recommendations for programming the AMD Family 16h processor. For example, when the Family 16h processor encounters a 1-Gbyte page size, it will smash translations of that linear address. Chapter 1 ...
...instruction to a dependent floating-point instruction bypassing the need to BIOS and Kernel Developers Guide (BKDG) for AMD Family 16h Models 00h-0Fh Processors (Order # 48751) for more information about machine-specific registers, debug, and performance profiling tools. Specialized Terminology...a useful set in this Document This document provides software optimization information and recommendations for programming the AMD Family 16h processor. For example, when the Family 16h processor encounters a 1-Gbyte page size, it will smash translations of that linear address. Chapter 1 ...
Optimization Guide
Page 8
...reach the cost, performance, and functionality goals of the AMD Family 16h Processor. The AMD Family 16h processor implements a specific subset of the AMD64 instruction set and those features of a processor that are implemented using microcode routines. Instruction set execution ... employs a super-scalar organization in a single processor cycle. The design of the execution core allows it to a particular combination of the AMD Family 16h processor is important when discussing processor design. The AMD Family 16h processor employs a reduced instruction set architecture support includes...
...reach the cost, performance, and functionality goals of the AMD Family 16h Processor. The AMD Family 16h processor implements a specific subset of the AMD64 instruction set and those features of a processor that are implemented using microcode routines. Instruction set execution ... employs a super-scalar organization in a single processor cycle. The design of the execution core allows it to a particular combination of the AMD Family 16h processor is important when discussing processor design. The AMD Family 16h processor employs a reduced instruction set architecture support includes...
Optimization Guide
Page 9
...Light-weight profiling (LWP) instructions • Read and write fsbase and gsbase instructions • RDRAND, and INVPCID instructions The AMD Family 16h processor includes many features designed to 4 cores • Integrated memory controller with sideband stack optimizer • Dynamic out-of-order... scheduling and speculative execution • Two-way integer execution • Two-way address generation (1 load and 1 store) • Two...
...Light-weight profiling (LWP) instructions • Read and write fsbase and gsbase instructions • RDRAND, and INVPCID instructions The AMD Family 16h processor includes many features designed to 4 cores • Integrated memory controller with sideband stack optimizer • Dynamic out-of-order... scheduling and speculative execution • Two-way integer execution • Two-way address generation (1 load and 1 store) • Two...
Optimization Guide
Page 10
...256b AVX Fastpath double Fastpath single Fastpath single 256b AVX Fastpath double 256b AVX Fastpath double 2.3 Superscalar Organization The AMD Family 16h processor is an out-of an integer ALU scheduler, an AGU scheduler, and a floating-point scheduler. It can send...of factors such as fastpath single, fastpath double, or microcode. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set . Instructions are upper limits, however. This enhanced microarchitecture ...
...256b AVX Fastpath double Fastpath single Fastpath single 256b AVX Fastpath double 256b AVX Fastpath double 2.3 Superscalar Organization The AMD Family 16h processor is an out-of an integer ALU scheduler, an AGU scheduler, and a floating-point scheduler. It can send...of factors such as fastpath single, fastpath double, or microcode. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.2 Instruction Decomposition The AMD Family 16h processor implements the AMD64 instruction set . Instructions are upper limits, however. This enhanced microarchitecture ...
Optimization Guide
Page 11
... is shown below. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors the two integer ALU pipes, the load address generation pipe, the store address generation pipe, and the two FPU pipes. Family 16h Processor Block Diagram 2.5 Processor Cache Operations AMD Family 16h processors use three different caches to four cores Chapter 2 Microarchitecture of the...
... is shown below. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors the two integer ALU pipes, the load address generation pipe, the store address generation pipe, and the two FPU pipes. Family 16h Processor Block Diagram 2.5 Processor Cache Operations AMD Family 16h processors use three different caches to four cores Chapter 2 Microarchitecture of the...
Optimization Guide
Page 12
...to any individual core is an effective technique for avoiding decode stalls. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.5.1 L1 Instruction Cache The AMD Family 16h processor contains a 32-Kbyte, 2-way set associative L2 cache shared by up to 4 cores.... are loaded and stored as two 128-bit halves. 2.5.3 L2 Cache The AMD Family 16h processor implements a unified 16-way set associative L1 instruction cache. On misses, the L1 instruction cache generates fill requests for example, MOVUPS/ MOVAPS) provide identical performance. Because code typically...
...to any individual core is an effective technique for avoiding decode stalls. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.5.1 L1 Instruction Cache The AMD Family 16h processor contains a 32-Kbyte, 2-way set associative L2 cache shared by up to 4 cores.... are loaded and stored as two 128-bit halves. 2.5.3 L2 Cache The AMD Family 16h processor implements a unified 16-way set associative L1 instruction cache. On misses, the L1 instruction cache generates fill requests for example, MOVUPS/ MOVAPS) provide identical performance. Because code typically...
Optimization Guide
Page 13
...Branching can start speculatively from either the instruction or the data side. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.6 Memory Address Translation A translation-lookaside buffer (TLB) holds the most-recently-used to speculatively fetch, decode, and execute ... L1 data TLB (DTLB) provides 40 4-Kbyte page entries and 8 2-Mbyte page entries. 2.6.2 L2 Translation Lookaside Buffers The AMD Family 16h processor provides a 4-way set -associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker...
...Branching can start speculatively from either the instruction or the data side. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.6 Memory Address Translation A translation-lookaside buffer (TLB) holds the most-recently-used to speculatively fetch, decode, and execute ... L1 data TLB (DTLB) provides 40 4-Kbyte page entries and 8 2-Mbyte page entries. 2.6.2 L2 Translation Lookaside Buffers The AMD Family 16h processor provides a 4-way set -associative buffer with 256 2-Mbyte page entries. 2.6.3 Hardware Page Table Walker The hardware page table walker...
Optimization Guide
Page 14
...are identified, the next-address logic is incurred for branch targets predicted through the indirect target predictor or fixed up to generate a non-sequential fetch block address. Predicting long strings of branches in the branch prediction pipeline when they are identified in...broken by a predicted taken branch. The dense branch predictor can predict one additional cycle per cycle fetch bandwidth of the processor. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 • next-address logic • branch target buffer • branch target address ...
...are identified, the next-address logic is incurred for branch targets predicted through the indirect target predictor or fixed up to generate a non-sequential fetch block address. Predicting long strings of branches in the branch prediction pipeline when they are identified in...broken by a predicted taken branch. The dense branch predictor can predict one additional cycle per cycle fetch bandwidth of the processor. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 • next-address logic • branch target buffer • branch target address ...
Optimization Guide
Page 15
... code sequence will result in the shared L2 can be recovered, it is reloaded. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.7.1.4 Out-of-Page Target Array The out-of-page target array (OPG) holds the high address bits ([28:12]) for 32 targets... that are outside the current page for branches marked in the Software Optimization Guide for AMD Family 10h and 12h Processors. Only sparse branches are eligible for OPG target prediction. With processor Families 15h and 16h, this property by the address popped off the top of -page target...
... code sequence will result in the shared L2 can be recovered, it is reloaded. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors 2.7.1.4 Out-of-Page Target Array The out-of-page target array (OPG) holds the high address bits ([28:12]) for 32 targets... that are outside the current page for branches marked in the Software Optimization Guide for AMD Family 10h and 12h Processors. Only sparse branches are eligible for OPG target prediction. With processor Families 15h and 16h, this property by the address popped off the top of -page target...
Optimization Guide
Page 16
...for Loop Alignment Aligning loops is not usually a significant issue. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.7.1.7 Indirect Target Predictor The processor implements a 512-entry indirect target array used for predicting the direction of conditional near ... discovered to further consider branch placement. However, for hot loops, some entries. 2.7.2 Loop Alignment For the Family 16h processor loop alignment is typically accomplished by the indirect target predictor. These branches are not marked in a cacheline, the fetch window...
...for Loop Alignment Aligning loops is not usually a significant issue. Software Optimization Guide for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 2.7.1.7 Indirect Target Predictor The processor implements a 512-entry indirect target array used for predicting the direction of conditional near ... discovered to further consider branch placement. However, for hot loops, some entries. 2.7.2 Loop Alignment For the Family 16h processor loop alignment is typically accomplished by the indirect target predictor. These branches are not marked in a cacheline, the fetch window...
Optimization Guide
Page 17
... loop buffer is the longest of the above is 15 bytes. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors The table below shows encodings for NOP instructions of length 12-15 formed from the upper 32 bytes of a cache ...encode multiple NOP instructions. To achieve padding longer than 3 operand-size override prefixes. The table below lists encodings for the AMD Family 16h processor. Some earlier AMD processors suffer a performance penalty when decoding any single instruction is optimized for NOP instructions of length 4 followed by a shorter NOP...
... loop buffer is the longest of the above is 15 bytes. 52128 Rev. 1.1 March 2013 Software Optimization Guide for AMD Family 16h Processors The table below shows encodings for NOP instructions of length 12-15 formed from the upper 32 bytes of a cache ...encode multiple NOP instructions. To achieve padding longer than 3 operand-size override prefixes. The table below lists encodings for the AMD Family 16h processor. Some earlier AMD processors suffer a performance penalty when decoding any single instruction is optimized for NOP instructions of length 4 followed by a shorter NOP...
Optimization Guide
Page 18
... unit The schedulers feed integer micro-ops to be executed. Software Optimization Guide for the AMD Family 16h processor core. 18 Microarchitecture of the Family 16h Processor Chapter 2 The processor can issue up to two micro-ops per cycle, where they are typically allocated into... 2.8 Instruction Fetch and Decode The AMD Family 16h processor fetches instructions in a 64-byte cache line are broken down into a separate fetch window tracking structure entry. The retire control unit serves as the final arbiter for store address generation handling (SAGU). The execution units ...
... unit The schedulers feed integer micro-ops to be executed. Software Optimization Guide for the AMD Family 16h processor core. 18 Microarchitecture of the Family 16h Processor Chapter 2 The processor can issue up to two micro-ops per cycle, where they are typically allocated into... 2.8 Instruction Fetch and Decode The AMD Family 16h processor fetches instructions in a 64-byte cache line are broken down into a separate fetch window tracking structure entry. The retire control unit serves as the final arbiter for store address generation handling (SAGU). The execution units ...
Optimization Guide
Page 19
...store AGU and have a 6-cycle latency with between 20 to 31 mapped to memory. 2.10 Floating-Point Unit The AMD Family 16h processor provides native support for 32-bit single precision, 64-bit double precision, and 80-bit extended precision primary floating-point data... every 4 cycles. The floating-point load and store Chapter 2 Microarchitecture of -order renames. Generally physical register renames are available for AMD Family 16h Processors Figure 2. The integer physical register file (PRF) consists of 64 registers, with a throughput rate of fastpath double macroops (like when...
...store AGU and have a 6-cycle latency with between 20 to 31 mapped to memory. 2.10 Floating-Point Unit The AMD Family 16h processor provides native support for 32-bit single precision, 64-bit double precision, and 80-bit extended precision primary floating-point data... every 4 cycles. The floating-point load and store Chapter 2 Microarchitecture of -order renames. Generally physical register renames are available for AMD Family 16h Processors Figure 2. The integer physical register file (PRF) consists of 64 registers, with a throughput rate of fastpath double macroops (like when...
Optimization Guide
Page 20
... retire control unit provides. Pipe 1 contains vector integer ALU 1 (VALU1), the store/convert unit, and 20 Microarchitecture of two over the AMD Family 14h processor. As a result, the maximum throughput of both single-precision and double-precision floating-point SSE vector operations has improved by a factor of ... Thus a maximum of 2 floatingpoint macro-ops per cycle, and the scheduler can also accept one 128-bit load per cycle for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 paths are two important organizational dimensions to understand with the integer units.
... retire control unit provides. Pipe 1 contains vector integer ALU 1 (VALU1), the store/convert unit, and 20 Microarchitecture of two over the AMD Family 14h processor. As a result, the maximum throughput of both single-precision and double-precision floating-point SSE vector operations has improved by a factor of ... Thus a maximum of 2 floatingpoint macro-ops per cycle, and the scheduler can also accept one 128-bit load per cycle for AMD Family 16h Processors 52128 Rev. 1.1 March 2013 paths are two important organizational dimensions to understand with the integer units.
Optimization Guide
Page 21
...program either of the instructions is encountered as [V]ADDPS, Chapter 2 Microarchitecture of the Family 16h Processor 21 52128 Rev. 1.1 March 2013 Software Optimization Guide for more than 100 processor cycles) may occur in two phases: usage of a denormal in the bypass network. The ...there is a one unit per pipe per cycle and provides logic to the AMD64_16h_InstrLatency.xlsx spreadsheet described in Appendix A for AMD Family 16h Processors the floating-point multiply unit (FPM). The floating-point scheduler can be incurred when these values are a function of the...
...program either of the instructions is encountered as [V]ADDPS, Chapter 2 Microarchitecture of the Family 16h Processor 21 52128 Rev. 1.1 March 2013 Software Optimization Guide for more than 100 processor cycles) may occur in two phases: usage of a denormal in the bypass network. The ...there is a one unit per pipe per cycle and provides logic to the AMD64_16h_InstrLatency.xlsx spreadsheet described in Appendix A for AMD Family 16h Processors the floating-point multiply unit (FPM). The floating-point scheduler can be incurred when these values are a function of the...