>>Fragment of the book 'Zen of Code Optimization' >>Find 'Excerpt' to get to the pentium code optimization part. >>19950206/Jaap van Ganswijk >Zen of Code Optimization >From the Jeff Duntemann Zen Programming Series Published by The Coriolis Group ISBN: 1-883577-03-9 $39.95; with 3 1/2" disk. (602) 483-0192 800 410-0192 This book is destined to become one of the important technical computer books published in the '90s. Written by Michael Abrash, best-selling author and code performance guru, Zen of Code Optimization is the complete guide to writing optimized software for PCs-- from 8086s to the hot new Pentium. Michael Abrash draws on over a decade of programming experience to show you how to fine-tune your programming skills and develop world-class software. To place an order mail: Ordering Information If you wish to use your credit card include the number and expiration date and include the details with your order. If you would prefer you can fax complete details to 602-483-0193. We only accept Visa or Mastercard at this time. You can also pay by check. Send your check to the address listed below. We can also ship your order C.O.D. The cost for C.O.D. handling is $6 per book. (C.O.D. service is for U.S. orders only.) The price for the book is $39.95. Please inlclude $2.50 for shipping and handling in the U.S. Cost for foreign shipping is $5.00 for surface mail. We can ship by other means -- please request shipping cost by e-mail or fax. Comments on this book: "At last, the definitive work on writing tight code from the guru . . ." --Bill Gates, Midnight Engineering "The subtitle says it all--the ultimate guide to writing software that pushes PCs to the limits." --Computer Publishing Update "PCs can do just about anything you can imagine. It's not easy coaxing them into doing what you want, but it's rewarding and sure as heck fun. In this book, we're going to work many wonders." --Author Michael Abrash Comments on Michael Abrash's previously published Zen of Assembly: "Michael Abrash's monumental Zen of Assembly Language must certainly rank as one of the most important works ever written on the craft of programming." --Dr. Dobb's Journal Who Is This Book For? This book is for intermediate to advanced programmers in C, C++ and assembly language who want to learn the techniques of code optimization; that is, techniques to produce the fastest possible code on their machines. What Do I Need to Use It? You need some background in programming in C and assembly. An Intel-based PC is required to run the code. Most of the assembly code is applicable to any Intel-based PC. A small number of the listings demonstrate features of the 386, 486, and Pentium, and will not run on earlier processors. A debugger like CodeView or Turbo Debugger is very helpful in examining the machine code generated by C and C++ compilers. What Sort of Code Is Present on the Code Disk? The code disk contains Michael Abrash's well-known Zen Timer utility, as well as numerous programs that demonstrate code optimization. Most of these programs are meaningful only in the context of the book's discussion, but there are several fast implementations of Conway's Game of Life that may be compiled and enjoyed apart from the book's treatment of them. These programs have been tested with the latest Borland and Microsoft compilers and assemblers, and most will work with earlier versions of those compilers and assemblers. The code is quite generic and performs very little I/O, so translation to other C compilers or assemblers like Mix C or A86 should not be difficult. ################################################### Zen of Code Optimization Contents Preface xix Chapter 1 The Best Optimizer Is between Your Ears 1 The Human Element of Code Optimization 1 Understanding High Performance 2 When Fast Isn't Fast 2 Rules for Building High Performance Code 3 Know Where You're Going 4 Make a Big Map 4 Make Lots of Little Maps 4 Listing 1.1 L1-1.C 4 Listing 1.2 L1-2.C 6 Listing 1.3 L1-3.ASM 7 Know the Territory 8 Listing 1.4 L1-4.C 9 Know When It Matters 10 Always Consider the Alternatives 11 Listing 1.5 L1-5.C 12 Know How to Turn On the Juice 14 Listing 1.6 L1-6.C 14 Listing 1.7 L1-7.ASM 15 Where We've Been, What We've Seen 17 Where We're Going 17 Chapter 2 A World Apart 19 The Unique Nature of Assembly Language Optimization 19 Instructions: The Individual versus the Collective 19 Assembly Is Fundamentally Different 21 Transformation Inefficiencies 21 Self-Reliance 22 Knowledge 23 The Flexible Mind 24 Where to Begin? 26 Chapter 3 Assume Nothing 27 Understanding and Using the Zen Timer 27 The Costs of Ignorance 27 The Zen Timer 29 Listing 3.1 PZTIMER.ASM 29 The Zen Timer Is a Means, Not an End 38 Starting the Zen Timer 38 Time and the PC 39 Stopping the Zen Timer 42 Reporting Timing Results 43 Notes on the Zen Timer 44 A Sample Use of the Zen Timer 45 Listing 3.2 PZTEST.ASM 45 Listing 3.3 LST3-3.ASM 46 Listing 3.4 PZTIME.BAT 48 The Long-Period Zen Timer 50 Stopping the Clock 51 Listing 3.5 LZTIMER.ASM 52 Example Use of the Long-Period Zen Timer 65 Listing 3.6 LZTEST.ASM 66 Listing 3.7 LZTIME.BAT 67 Listing 3.8 LST3-8.ASM 69 Using the Zen Timer from C 70 Watch Out for Optimizing Assemblers! 72 Further Reading 72 Armed with the Zen Timer, Onward and Upward 73 Chapter 4 In the Lair of the Cycle-Eaters 75 How the PC Hardware Devours Code Performance 75 Cycle-Eaters 76 The Nature of Cycle-Eaters 76 The 8088's Ancestral Cycle-Eaters 77 The 8-Bit Bus Cycle-Eater 77 The Impact of the 8-Bit Bus Cycle-Eater 80 What to Do about the 8-Bit Bus Cycle-Eater? 81 Listing 4.1 LST4-1.ASM 81 Listing 4.2 LST4-2.ASM 81 Listing 4.3 LST4-3.ASM 83 Listing 4.4 LST4-4.ASM 83 The Prefetch Queue Cycle-Eater 84 Official Execution Times Are Only Part of the Story 85 There is No Such Beast as a True Instruction Execution Time 86 Listing 4.5 LST4-5.ASM 88 Listing 4.6 LST4-6.ASM 88 Listing 4.7 LST4-7.ASM 90 Listing 4.8 LST4-8.ASM 91 Approximating Overall Execution Times 92 What to Do about the Prefetch Queue Cycle-Eater? 92 Holding Up the 8088 93 Dynamic RAM Refresh: The Invisible Hand 94 How DRAM Refresh Works in the PC 94 The Impact of DRAM Refresh 95 Listing 4.9 LST4-9.ASM 96 Listing 4.10 LST4-10.ASM 97 What to Do about the DRAM Refresh Cycle-Eater? 98 Wait States 99 The Display Adapter Cycle-Eater 100 The Impact of the Display Adapter Cycle-Eater 104 Listing 4.11 LST4-11.ASM 105 Listing 4.12 LST4-12.ASM 106 What to Do about the Display Adapter Cycle-Eater? 107 Cycle-Eaters: A Summary 108 What Does It All Mean? 109 Chapter 5 Crossing the Border 111 Searching Files with Restartable Blocks 111 Searching for Text 112 Avoiding the String Trap 113 Brute-Force Techniques 113 Using memchr() 114 Making a Search Restartable 116 Listing 5.1 SEARCH.C 116 Interpreting Where the Cycles Go 120 Knowing When Assembly is Pointless 121 Always Look Where Execution Is Going 122 Chapter 6 Looking Past Face Value 123 How Machine Instructions May Do More Than You Think 123 Memory Addressing and Arithmetic 125 Math via Memory Addressing 126 The Wonders of LEA on the 386 128 Multiplication with LEA Using Non-Powers of Two 129 Chapter 7 Local Optimization 131 Optimizing Halfway between Algorithms and Cycle Counting 131 When LOOP Is a Bad Idea 132 The Lessons of LOOP and JCXZ 134 Avoiding LOOPS of Any Stripe 134 Local Optimization 134 Listing 7.1 L7-1.ASM 135 Unrolling Loops 137 Listing 7.2 L7-2.ASM 138 Rotating and Shifting with Tables 141 Listing 7.3 L7-3.ASM 141 Listing 7.4 L7-4.ASM 142 NOT Flips Bits--Not Flags 142 Incrementing with and without Carry 143 Listing 7.5 L7-5.ASM 143 Listing 7.6 L7-6.ASM 143 Chapter 8 Speeding Up C with Assembly Language 145 Jumping Languages When You Know It'll Help 145 Billy, Don't Be a Compiler 146 Don't Call Your Functions on Me, Baby 147 Stack Frames Slow SO MUCH 147 Torn Between Two Segments 148 Why Speeding Up Is Hard to Do 148 Taking It to the Limit 150 A C-To-Assembly Case Study 150 Listing 8.1 L8-1.C 151 Listing 8.2 L8-2.COD 154 Listing 8.3 L8-3.ASM 155 Listing 8.4 L8-4.ASM 157 Listing 8.5 L8-5.C 159 Listing 8.6 L8-6.ASM 161 Chapter 9 Hints My Readers Gave Me 165 Optimization Odds and Ends from the Field 165 Another Look at LEA 166 The Kennedy Portfolio 167 Speeding Up Multiplication 169 Optimizing Optimized Searching 170 Listing 9.1 L9-1.ASM 173 Listing 9.2 L9-2.ASM 175 Listing 9.3 L9-3.C 177 Short Sorts 178 Listing 9.4 L9-4. ASM 178 Full 32-Bit Division 179 Listing 9.5 L9-5 ASM 180 Listing 9.6 L9-6.C 182 Sweet Spot Revisited 182 Hard-Core Cycle Counting 183 Hardwired Far Jumps 184 Listing 9.7 L9-7.ASM 185 Setting Extended Registers: Time versus Space 185 Chapter 10 Patient Coding, Faster Code 187 How Working Quickly Can Bring Execution to a Crawl 187 The Case for Delayed Gratification 188 The Brute-Force Syndrome 189 Listing 10.1 L10-1.C 191 Wasted Breakthroughs 192 Listing 10.2 L10-2.C 192 Listing 10.3 L10-3.C 195 Recursion 196 Listing 10.4 L10-4.C 196 Patient Optimization 197 Listing 10.5 L10-5.ASM 197 Chapter 11 Pushing the 286 and 386 201 New Registers, New Instructions, New Timings, New Complications 201 Family Matters 202 Crossing the Gulf to the 286 and the 386 202 In the Lair of the Cycle-Eaters, Part II 203 System Wait States 204 Listing 11.1 L11-1.ASM 206 Data Alignment 207 Listing 11.2 L11-2.ASM 209 Listing 11.3 L11-3.ASM 209 Code Alignment 210 Alignment and the 386 212 Alignment and the Stack 213 The DRAM Refresh Cycle-Eater: Still an Act of God 213 The Display Adapter Cycle-Eater 214 New Instructions and Features: The 286 216 New Instructions and Features: The 386 217 Optimization Rules: The More Things Change 218 Detailed Optimization 218 Listing 11.4 L11-4.ASM 219 Listing 11.5 L11-5.ASM 220 POPF and the 286 221 Chapter 12 Pushing the 486 227 It's Not Just a Bigger 386 227 Enter the 486 228 Rules to Optimize By 228 The Hazards of Indexed Addressing 229 Calculate Memory Pointers Ahead of Time 230 Caveat Programmor 233 Stack Addressing and Address Pipelining 234 Problems with Byte Registers 235 More Fun with Byte Registers 236 Timing Your Own 486 Code 238 Listing 12.1 LST12-1.ASM 238 Listing 12.2 LST12-2.ASM 239 The Story Continues 239 Chapter 13 Aiming the 486 241 Pipelines and Other Hazards of the High End 241 486 Pipeline Optimization 242 Listing 13.1 L13-1.ASM 242 Listing 13.2 L13-2.ASM 244 Listing 13.3 L13-3.ASM 244 Listing 13.4 L13-4.ASM 244 BSWAP: More Useful Than You Might Think 245 Listing 13.5 L13-5.ASM 246 Listing 13.6 L13-6.ASM 247 Pushing and Popping Memory 247 Optimal 1-Bit Shifts and Rotates 248 32-Bit Addressing Modes 249 Chapter 14 Boyer-Moore String Searching 253 Optimizing a Pretty Optimum Search Algorithm 253 String Searching Refresher 254 The Boyer-Moore Algorithm 255 Boyer-Moore: The Good and the Bad 258 Listing 14.1 L14-1.C 260 LISTING 14.2 L14-2.C 262 LISTING 14.3 L14-3.ASM 264 Further Optimization of Boyer-Moore 267 LISTING 14.4 L14-4.ASM 268 Know What You Know 271 Chapter 15 Linked Lists and Unintended Challenges 273 Unfamiliar Problems with Familiar Data Structures 273 Linked Lists 274 Listing 15.1 L15-1.C 276 Listing 15.2 LLIST.H 276 Listing 15.3 L15-3.C 276 Dummies and Sentinels 277 Listing 15.4 L15-4.C 279 Listing 15.5 L15-5.C 280 Circular Lists 280 Listing 15.6 L15-6.C 281 Listing 15.7 L15-7.ASM 283 Listing 15.8 L15-8.C 284 Hi/Lo in 24 Bytes 286 Listing 15.9 L15-9.ASM 286 Chapter 16 There Ain't No Such Thing as the Fastest Code 289 Lessons Learned in the Pursuit of the Ultimate Word Counter 289 Counting Words in a Hurry 290 Listing 16.1 L16-1.C 291 Listing 16.2 L16-2.C 292 Listing 16.3 L16-3.ASM 294 Listing 16.4 L16-4.ASM 296 Which Way to Go from Here? 295 Challenges and Hazards 298 Blinding Yourself to a Better Approach 299 Watch Out for Luggable Assumptions! 300 The Astonishment of Right-Brain Optimization 301 Listing 16.5 QSCAN3.ASM. 302 Levels of Optimization 306 Optimization Level 1: Good Code 307 Listing 16.6 OPT2.ASM 308 Level 2: A New Perspective 310 Listing 16.7 L16-7.ASM 311 Level 3: Breakthrough 312 Listing 16.8 MAKETAB.C 312 Enough Word Counting Already! 315 Chapter 17 The Game of Life 317 The Triumph of Algorithmic Optimization in a Cellular Automata Game 317 Conway's Game 318 The Rules of the Game 318 Listing 17.1 L17-1.CPP 319 Listing 17.2 L17-2.CPP 323 Where Does the Time Go? 324 The Hazards and Advatages of Abstraction 325 Listing 17.3 L17-3.CPP 327 Heavy-Duty C++ Optimization 332 Listing 17.4 L17-4.CPP 332 Bringing In the Right Brain 335 Re-Examining the Task 335 Acting on What We Know 337 Listing 17.5 L17-5.CPP 337 The Challenge That Ate My Life 344 Chapter 18 It's a Wonderful Life 345 Optimization beyond the Pale 345 Breaking the Rules 346 Table-Driven Magic 347 How to Build Qlife 348 Listing 18.1 BUILD.BAT 349 Listing 18.2 LCOMP.C 349 Listing 18.3 MAIN.C 359 Listing 18.4 VIDEO.C 361 Listing 18.5 LIFE.H 362 Keeping Track of Change with a Change List 362 Listing 18.6 QLIFE Assembly Segment Usage 365 A Layperson's Overview of Qlife 365 Chapter 19 Pentium: Not the Same Old Song 367 Learning a Whole Different Set of Optimization Rules 367 The Return of Optimization as Art 368 The Pentium: An Overview 368 Crossing Cache Lines 369 Cache Organization 371 Faster Addressing and More 372 Branch Prediction 374 Miscellaneous Pentium Topics 375 486 Versus Pentium Optimization 375 Going Superscalar 376 Chapter 20 Pentium Rules 377 How Your Carbon-Based Optimizer Can Put the "Super" in Superscalar 377 An Instruction in Every Pipe 378 V-Pipe-Capable Instructions 380 Lockstep Execution 385 Superscalar Notes 389 Register Starvation 390 Chapter 21 Unleashing the Pentium's V-Pipe 393 Focusing on Keeping Both Pentium Pipes Full 393 Address Generation Interlocks 394 Register Contention 398 Exceptions to Register Contention 398 Who's in First? 399 Pentium Optimization in Action 400 Listing 21.1 L21-1.ASM 400 Listing 21.2 L21-2.ASM 402 Listing 21.3 L21-3.ASM 403 Listing 21.4 L21-4.ASM 404 Listing 21.5 L21-5.ASM 405 A Quick Note on the 386 and 486 407 Chapter 22 Zenning and the Flexible Mind 409 Taking a Spin through What You've Learned 409 Zenning 409 Listing 22.1 L22-1.ASM 410 Listing 22.2 L22-2.ASM 411 Listing 22.3 L22-3.ASM 412 Listing 22.4 L22-4.ASM 412 Listing 22.5 L22-5.ASM 413 Listing 22.6 L22-6.ASM 414 Listing 22.7 L22-7.ASM 415 Appendix A 417 The C-Callable Zen Timer Full Listings 417 Listing A.1 PCZTNEAR.ASM 417 Listing A.2 PCZTFAR.ASM 427 Afterword 437 Index 439 ############################################## Zen of Code Optimization Chapter Excerpt The entire contents of this chapter is copyright 1994 of The Coriolis Group, Inc. Chapter 20 Pentium Rules How Your Carbon-Based Optimizer Can Put the "Super" in Superscalar At the 1983 West Coast Computer Faire, my friend Dan Illowsky, Andy Greenberg (co-author of Wizardry, at that time the best- selling computer game ever), and I had an animated discussion about starting a company in the then-budding world of microcomputer software. One hot new software category at the time was educational software, and one of the hottest new educational software companies was Spinnaker Software. Andy used Spinnaker as an example of a company that had been aimed at a good market and started up properly, and was succeeding as a result. Dan didn't buy this; his point was that Spinnaker had been given a bundle of money to get off the ground, and was growing only by spending a lot of that money in order to move its products. "Heck," said Dan, "I could get that kind of market share too if I gave away a fifty-dollar bill with each of my games." Remember, this was a time when a program, two diskette drives (for duplicating disks), and a couple of ads were enough to start a company, and, in fact, Dan built a very successful game company out of not much more than that. (I'll never forget coming to visit one day and finding his apartment stuffed literally to the walls and ceiling with boxes of diskettes and game packages; he had left a narrow path to the computer so his wife and his mother could get in there to duplicate disks.) Back then, the field was wide open, with just about every competent programmer thinking of striking out on his or her own to try to make their fortune, and Dan and Andy and I were no exceptions. In short, we were having a perfectly normal conversation, and Dan's comment was both appropriate, and, in retrospect, accurate. Appropriate, save for one thing: We were having this conversation while walking through a low-rent section of Market Street in San Francisco at night. A bum sitting against a nearby building overheard Dan, and rose up, shouting in a quavering voice loud enough to wake the dead, "Fifty-dollar bill! Fifty-dollar bill! He's giving away fifty-dollar bills!" We ignored him; undaunted, he followed us for a good half mile, stopping every few feet to bellow "fifty-dollar bill!" No one else seemed to notice, and no one hassled us, but I was mighty happy to get to the sanctuary of the Fairmont Hotel and slip inside. The point is, most actions aren't inherently good or bad; it's all a matter of context. If Dan had uttered the words "fifty-dollar bill" on the West Coast Faire's show floor, no one would have batted an eye. If he had said it in a slightly worse part of town than he did, we might have learned just how fast the three of us could run. Similarly, there's no such thing as inherently fast code, only fast code in context. At the moment, the context is the Pentium, and the truth is that a sizable number of the x86 optimization tricks that you and I have learned over the past ten years are obsolete on the Pentium. True, the Pentium contains what amounts to about one-and-a-half 486s, but, as we'll see shortly, that doesn't mean that optimized Pentium code looks much like optimized 486 code, or that fast 486 code runs particularly well on a Pentium. (Fast Pentium code, on the other hand, does tend to run well on the 486; the only major downsides are that it's larger, and that the FXCH instruction, which is largely free on the Pentium, is expensive on the 486.) So discard your x86 preconceptions as we delve into superscalar optimization for this one-of-a-kind processor. An Instruction in Every Pipe In the last chapter, we took a quick tour of the Pentium's architecture, and started to look into the Pentium's optimization rules. Now we're ready to get to the key rules, those having to do with the Pentium's most unique and powerful feature, the ability to execute more than one instruction per cycle. This is known as superscalar execution, and has heretofore been the sole province of fast RISC CPUs. The Pentium has two integer execution units, called the U-pipe and the V-pipe, which can execute two separate instructions simultaneously, potentially doubling performance--but only under the proper conditions. (There is also a separate floating- point execution unit that I won't have the space to cover in this book.) Your job, as a performance programmer, is to understand the conditions needed for superscalar performance and make sure they're met, and that's what this and the next chapters are all about. The two pipes are not independent processors housed in a single chip; that is, the Pentium is not like having two 486s in a single computer. Rather, the two pipes are integral, parallel parts of the same processor. They operate on the same instruction stream, with the V-pipe simply executing the next instruction that the U-pipe would have handled, as shown in Figure 20.1. What the Pentium does, pure and simple, is execute a single instruction stream and, whenever possible, take the next two waiting instructions and execute both at once, rather than one after the other. The U-pipe is the more capable of the two pipes, able to execute any instruction in the Pentium's instruction set. (A number of instructions actually use both pipes at once. Logically, though, you can think of such instructions as U-pipe instructions, and of the Pentium optimization model as one in which the U-pipe is able to execute all instructions and is always active, with the objective being to keep the V-pipe also working as much of the time as possible.) The U-pipe is generally similar to a full 486 in terms of both capabilities and instruction cycle counts. The V-pipe is a 486 subset, able to execute simple instructions such as MOV and ADD, but unable to handle MUL, DIV, string instructions, any sort of rotation or shift, or even ADC or SBB. Getting two instructions executing simultaneously in the two pipes is trickier than it sounds, not only because the V-pipe can handle only a relatively small subset of the Pentium's instruction set, but also because those instructions that the V-pipe can handle are able to pair only with certain U-pipe instructions. For example, MOVSD uses both pipes, so no instruction can be executed in parallel with MOVSD. The use of both pipes does make MOVSD nearly twice as fast on the Pentium as on the 486, but it's nonetheless slower than using equivalent simpler instructions that allow for superscalar execution. Stick to the Pentium's RISC-like instructions--the pairable instructions I'll discuss next--when you're seeking maximum performance, with just a few exceptions such as REP MOVS and REP STOS. Trickier yet, register contention can shut down the V- pipe on any given cycle, and Address Generation Interlocks (AGIs) can stall either pipe at any time, as we'll see in the next chapter. The key to Pentium optimization is to view execution as a stream of instructions going through the U- and V-pipes, and to eliminate, as much as possible, instruction mixes that take the V- pipe out of action. In practice, this is not too difficult. The only hard part is keeping in mind the long list of rules governing instruction pairing. The place to begin is with the set of instructions that can go through the V-pipe. V-Pipe-Capable Instructions Any instruction can go through the U-pipe, and, for practical purposes, the U-pipe is always executing instructions. (The exceptions are when the U-pipe execution unit is waiting for instruction or data bytes after a cache miss, and when a U-pipe instruction finishes before a paired V-pipe instruction, as I'll discuss below.) Only the instructions shown in Table 20.1 can go through the V-pipe. In addition, the V-pipe can execute a separate instruction only when one of the instructions listed in Table 20.2 is executing in the U-pipe; superscalar execution is not possible while any instruction not listed in Table 20.2 is executing in the U-pipe. So, for example, if you use SHR EDX,CL, which takes 4 cycles to execute, no other instructions can execute during those 4 cycles; if, on the other hand, you use SHR EDX,10, it will take 1 cycle to execute in the U-pipe, and another instruction can potentially execute concurrently in the V-pipe. (As you can see, similar instruction sequences can have vastly different performance characteristics on the Pentium.) Basically, after the current instruction or pair of instructions is finished (that is, once neither the U- nor V-pipe is executing anything), the Pentium sends the next instruction through the U-pipe. If the instruction after the one in the U-pipe is an instruction the V-pipe can handle, if the instruction in the U-pipe is pairable, and if register contention doesn't occur, then the V-pipe starts executing that instruction, as shown in Figure 20.2. Otherwise, the second instruction waits until the first instruction is done, then executes in the U-pipe, possibly pairing with the next instruction in line if all pairing conditions are met. The list of instructions the V-pipe can handle is not very long, and the list of U-pipe pairable instructions is not much longer, but these actually constitute the bulk of the instructions used in PC software. As a result, a fair amount of pairing happens even in normal, non-Pentium-optimized code. This fact, plus the 64-bit 66 MHz bus, branch prediction, dual 8K internal caches, and other Pentium features, together mean that a Pentium is considerably faster than a 486 at the same clock speed, even without Pentium-specific optimization, contrary to some reports. Besides, almost all operations can be performed by combinations of pairable instructions. For example, PUSH [mem] is not on either list, but both MOV reg,[mem] and PUSH reg are, and those two instructions can be used to push a value stored in memory. In fact, given the proper instruction stream, the discrete instructions can perform this operation effectively in just 1 cycle (taking one-half of each of 2 cycles, for 2*0.5 = 1 cycle total execution time), as shown in Figure 20.3--a full cycle faster than PUSH [mem], which takes 2 cycles. A fundamental rule of Pentium optimization is that it pays to break complex instructions into equivalent simple instructions, then shuffle the simple instructions for maximum use of the V-pipe. This is true partly because most of the pairable instructions are simple instructions, and partly because breaking instructions into pieces allows more freedom to rearrange code to avoid the AGIs and register contention I'll discuss in the next chapter. One downside of this "RISCification" (turning complex instructions into simple, RISC-like ones) of Pentium- optimized code is that it makes for substantially larger code. For example, push dword ptr [esi] is 1 byte smaller than this sequence: mov eax,[esi] push eax A more telling example is the following add [MemVar],eax versus the equivalent: mov edx,[MemVar] add edx,eax mov [MemVar],edx The single complex instruction takes 3 cycles and is 6 bytes long; with proper sequencing, interleaving the simple instructions with other instructions that don't use EDX or MemVar, the three- instruction sequence can be reduced to 1.5 cycles, but it is 14 bytes long. It's not unusual for Pentium optimization to approximately double both performance and code size at the same time. In an important loop, go for performance and ignore the size, but on a program- wide basis, the size bears watching. Lockstep Execution You may wonder why anyone would bother breaking ADD [MemVar],EAX into three instructions, given that this instruction can go through either pipe with equal ease. The answer is that while the memory-accessing instructions other than MOV, PUSH, and POP listed in Table 20.1 (that is, INC/DEC [mem], ADD/SUB/XOR/AND/OR/CMP/ADC/SBB reg,[mem], and ADD/SUB/XOR/AND/OR/CMP/ADC/SBB [mem],reg/immed) can be paired, they do not provide the 100 percent overlap that we seek. If you look at Tables 20.1 and 20.2, you will see that instructions taking from 1 to 3 cycles can pair. However, any pair of instructions goes through the two pipes in lockstep. This means, for example, that if ADD [EBX],EDX is going through the U-pipe, and INC EAX is going through the V-pipe, the V-pipe will be idle for 2 of the 3 cycles that the U-pipe takes to execute its instruction, as shown in Figure 20.4. Out of the theoretical 6 cycles of work that can be done during this time, we actually get only 4 cycles of work, or 67 percent utilization. Even though these instructions pair, then, this sequence fails to make maximum use of the Pentium's horsepower. The key here is that when two instructions pair, both execution units are tied up until both instructions have finished (which means at least for the amount of time required for the longer of the two to execute, plus possibly some extra cycles for pairable instructions that can't fully overlap, as described below). The logical conclusion would seem to be that we should strive to pair instructions of the same lengths, but that is often not correct. The actual rule is that we should strive to pair 1-cycle instructions (or, at most, 2-cycle instructions, but not 3-cycle instructions), which in turn leads to the corollary that we should, in general, use mostly 1-cycle instructions when optimizing. Here's why. The Pentium is fully capable of handling instructions that use memory operands in either pipe, or, if necessary, in both pipes at once. Each pipe has its own write FIFO, which buffers the last few writes and takes care of writing the data out while the Pentium continues processing. The Pentium also has a write-back internal data cache, so data that is frequently changed doesn't have to be written to external memory (which is much slower than the cache) very often. This combination means that unless you write large blocks of data at a high speed, the Pentium should be able to keep up with both pipes' memory writes without stalling execution. The Pentium is also designed to satisfy both pipes' needs for reading memory operands with little waiting. The data cache is constructed so that both pipes can read from the cache on the same cycle. This feat is accomplished by organizing the data cache as eight-banked memory, as shown in Figure 20.5, with each 32-byte cache line consisting of eight dwords, one in each bank. The banks are independent of one another, so as long as the desired data is in the cache and the U- and V-pipes don't try to read from the same bank on the same cycle, both pipes can read memory operands on the same cycle. (If there is a cache bank collision, the V-pipe instruction stalls for 1 cycle.) Normally, you won't pay close attention to which of the eight dword banks your paired memory accesses fall in--that's just too much work--but you might want to watch out for simultaneously-read addresses that have the same values for address bits 2, 3, and 4 (fall in the same bank) in tight loops, and you should also avoid sequences like mov bl,[esi] mov bh,[esi+1] because both operands will generally be in the same bank. An alternative is to place another instruction between the two instructions that access the same bank, as in this sequence: mov bl,[esi] mov edi,edx mov bh,[esi+1] By the way, the reason a code sequence that takes two instructions to load a single word is attractive in a 32-bit segment is because it takes only 1 cycle when the two instructions can be paired with other instructions; by contrast, the obvious way of loading BX mov bx,[esi] takes 1.5 to 2 cycles because the size prefix can't pair, as described below. This is yet another example of how different Pentium optimization can be from everything we've learned about its predecessors. The problem with pairing non-single-cycle instructions arises when a pipe executes an instruction other than MOV that has an explicit memory operand. (I'll call these complex memory instructions. They're the only pairable instructions, other than branches, that take more than 1 cycle.) We've already seen that, because instructions go through the pipes in lockstep, if one pipe executes a complex memory instruction such as ADD EAX,[EBX] while the other pipe executes a single-cycle instruction, the pipe with the faster instruction will sit idle for part of the time, wasting cycles. You might think that if both pipes execute complex instructions of the same length, then neither would lie idle, but that turns out to not always be the case. Two 2- cycle instructions (instructions with register destination operands) can indeed pair and execute in 2 cycles, so it's okay to pair two instructions such as these: add esi,[SourceSkip] ;U-pipe cycles 1 and 2 add edi,[DestinationSkip] ;V-pipe cycles 1 and 2 However, this beneficial pairing does not extend to non-MOV instructions with explicit memory destination operands, such as ADD [EBX],EAX. The Pentium executes only one such memory instruction at a time; if two memory-destination complex instructions get paired, first the U-pipe instruction is executed, and then the V-pipe instruction, with only one cycle of overlap, as shown in Figure 20.6. I don't know for sure, but I'd guess that this is to guarantee that the two pipes will never perform out-of-order access to any given memory location. Thus, even though AND [EBX],AL pairs with AND [ECX],DL, the two instructions take 5 cycles in all to execute, and 4 cycles of idle time--two in the U- pipe and two in the V-pipe, out of 10 cycles in all--are incurred in the process. The solution is to break the instructions into simple instructions and interleave them, as shown in Figure 20.7, which accomplishes the same task in 3 cycles, with no idle cycles whatsoever. Figure 20.7 is a good example of what optimized Pentium code generally looks like: mostly 1-cycle instructions, mixed together so that at least two operations are in progress at once. It's not the easiest code to read or write, but it's the only way to get both pipes running at capacity. Superscalar Notes You may well ask why it's necessary to interleave operations, as is done in Figure 20.7. It seems simpler just to turn and [ebx],al into mov dl,[ebx] and dl,al mov [ebx],dl and be done with it. The problem here is one of dependency. Before the Pentium can execute AND DL,AL, it must first know what is in DL, and it can't know that until it loads DL from the address pointed to by EBX. Therefore, AND DL,AL can't happen until the cycle after MOV DL,[EBX] executes. Likewise, the result can't be stored until the cycle after AND DL,AL has finished. This means that these instructions, as written, can't possibly pair, so the sequence takes the same 3 cycles as AND [EBX],AL. (Now it should be clear why AND [EBX], AL takes 3 cycles.) Consequently, it's necessary to interleave these instructions with instructions that use other registers, so this set of operations can execute in one pipe while the other, unrelated set executes in the other pipe, as is done in Figure 20.7. What we've just seen is the read-after-write form of the superscalar hazard known as register contention. I'll return to the subject of register contention in the next chapter; in the remainder of this chapter I'd like to cover a few short items about superscalar execution. Register Starvation The above examples should make it pretty clear that effective superscalar programming puts a lot of strain on the Pentium's relatively small register set. There are only seven general-purpose registers (I strongly suggest using EBP in critical loops), and it does not help to have to sacrifice one of those registers for temporary storage on each complex memory operation; in pre- superscalar days, we used to employ those handy CISC memory instructions to do all that stuff without using any extra registers. More problematic still is that for maximum pairing, you'll typically have two operations proceeding at once, one in each pipe, and trying to keep two operations in registers at once is difficult indeed. There's not much to be done about this, other than clever and Spartan register usage, but be aware that it's a major element of Pentium performance programming. Also be aware that prefixes of every sort, with the sole exception of the 0FH prefix on non-short conditional jumps, always execute in the U-pipe, and that Intel's documentation indicates that no pairing can happen while a prefix byte executes. (As I'll discuss in the next chapter, my experiments indicate that this rule doesn't always apply to multiple-cycle instructions, but you still won't go far wrong by assuming that the above rule is correct and trying to eliminate prefix bytes.) A prefix byte takes 1 cycle to execute; after that cycle, the actual prefixed instruction itself will go through the U-pipe, and if it and the following instruction are mutually pairable, then they will pair. Nonetheless, prefix bytes are very expensive, effectively taking at least as long as two normal instructions, and possibly, if a prefixed instruction could otherwise have paired in the V-pipe with the previous instruction, taking as long as three normal instructions, as shown in Figure 20.8. Finally, bear in mind that if the instructions being executed have not already been executed at least once since they were loaded into the internal cache, they can pair only if the first (U-pipe) instruction is not only pairable but also exactly 1 byte long, a category that includes only INC reg, DEC reg, PUSH reg, and POP reg. Knowing this can help you understand why sometimes, timing reveals that your code runs slower than it seems it should, although this will generally occur only when the cache working set for the code you're timing is on the order of 8K or more--an awful lot of code to try to optimize. It should be excruciatingly clear by this point that you must time your Pentium-optimized code if you're to have any hope of knowing if your optimizations are working as well as you think they are; there are just too many details involved for you to be sure your optimizations are working properly without checking. My most basic optimization rule has always been to grab the Zen timer and measure actual performance--and nowhere is this more true than on the Pentium. Don't believe it until you measure it! To place an order mail: >>End of fragment