On programming forums, every once in a while someone appears who wants to optimize his program by replacing his current 3D vector class by one that uses SSE opcodes, in the hope he'll make his program run 4 times as fast.
Every seasoned programmer should directly see why it's not going to work out that way - but still, instead of arguing the same points again and again in various forum threads, I decided to put it in writing once and for all, collecting most of the reasons why SSE vector classes are not as good an idea as they first seem.
Before I start, I want to make clear that this is only about the "classic" vector class layout with members x,y,z (3 components) plus an additional "w" for 4-component vectors, stored in a struct. I am well aware that it is possible to gain quite dramatic improvements in a relatively general framework when using other data layouts, but that's not the point of this article.
And, to make that clear, I'm assuming you're using C/C++ and working on a x86-compatible platform (else you shouldn't care about "SSE" anyway, but duh, who knows).
First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.
In other words, if your vector class turns out to be a major performance hog (assuming you've actually measured it, which, sadly, most people don't), it's more likely caused by lots of temporaries and unnecessary moving around of memory than by the actual computations.
There are of course exceptions here, like for example inner loops for software-based skinning, where you're basically just doing a lot of vector-matrix multiplies and weighted adds; however, in that case, using any kind of vector class is typically not the fastest way to do things anyway - in tight inner loops, expanding calculations by hand and optimizing common terms out etc. is important, and you're not going to be able to do that if you hide everything behind your shiny vector class.
We're now going to look at the somewhat dry but nonetheless very important aspects of memory usage and cache efficiency. It turns out that a typical SSE vector class isn't particularly glorious from that viewpoint.
If you want to use data for SSE instructions, you'd better make sure it's 16-byte aligned. Because otherwise SSE arithmetic instructions won't be able to directly use it at all, and you have to use unaligned loads/stores, which are a lot slower and would eat up most of the advantage we gained from using SSE (assuming there's any) in the first place. Also, SSE operates on chunks of 4 floats, so what you'll (normally) end up with is a 4-component vector that has to be 16-byte aligned. Ignoring the mathematical advantages of working with 4D homogenous coordinates for now, this means that even our 3D coordinates (usually the vast majority) will be stored in 4 components, with the w component being 0 or 1 most of the time. We'll also have to make sure everything stays 16 byte aligned.
The latter is a rather annoying exercise in using compiler extensions to the language and the standard library to make sure alignment inside structures is 16 bytes and our memory allocations are 16-byte aligned. As SSE is inherently platform-specific anyway, that's not much of a problem, even though it is rather annoying to have such low-level tinkering spread throughout your whole codebase by way of your vector class. The former is more of a practical issue - we're talking about wasting 25% of memory for every instance of what is probably one of the most common data types in your program. Again, due to alignment and memory allocator implementation aspects, we'll have some amount of memory waste with all our data structures, but two-figure percent values are certainly pushing it. And having that much clutter spread throughout your data isn't too nice on our cache usage either. (If you don't know about cache optimization, you should read up on it right now, because unlike replacing random calculations with SSE counterparts, cache-conscious design is actually going to give you some notable performance improvements).
If that isn't enough to start you thinking twice, there's more. You might be tempted not to write the actual calculations in your vector class yourself, but link some library like D3DX that has "optimized implementations" of all of them instead.
For things like 4x4 Matrices, where common operations (e.g. Matrix multiply) actually have some notable amount of calculation going on, or Quaternions, which are tricky if you're not used to the Maths and which you're not going to use all other the place anyway, there's nothing wrong with that. But most calculations with 4D vectors are in the range of 4-9 scalar operations - not much. In fact, it turns out that in your typical vector operation, you spend more time reading values into CPU registers and writing them back to memory than you will doing those calculations. That's why you typically want your vector arithmetic to be inlined by the compiler - the cost of these computations is normally a lot less than the explicit and implicit costs of a function call. Which means that calling into external code to implement your vector operations is a very bad idea, no matter how well-optimized it is.
So you'll have to implement itself. Fine, there's nothing wrong with that, at least in principle. You'll probably now start digging up an instruction reference, and start reading on SSE registers, instructions and how to write SSE code yourself. Woops - you don't have any experience writing x86 assembler? Tough luck, a lot more reading and learning then. And even if you have experience, don't worry, the next show stopper is just lurking around the corner: Assembly language code isn't going to actually solve the problem either - you still have to get those values from memory and store them back afterwards. When your code gets inlined, you've saved the function call, but still the resulting code will consist of two-thirds movements from and to memory and only one-third of actual computation. You don't need to be an optimization guru to notice that this is not a very efficient way of using your processor.
Now you might hope that "the compiler will optimize that away". But again, you're out of luck there - no compiler I know of will change inline assembler code in any way. After all, there might be a good reason you're periodically writing everything to memory and reading it back again. Those variables might be used for communication between multiple threads, for example. Or the memory might not even be real memory at all, but some hardware registers you're accessing, whose contents change on every read operation. In short, the compiler has no way of knowing what might go on "beyond the surface" in inline assembly code you write, so all compilers stay with the safe path, which is not touching such code at all. No problem when your fragment of inline assembly does something noteworthy - but it means a lot of overhead when your code is just one real computation sandwitched between two memory accesses.
Well, we're not yet through. With current compilers, there's another way to write SSE code: So-called intrinsics. I'm not going to give you the whole story here, but what it boils down to is that you get to treat SSE registers as if they were normal C++ variables, and the compiler exposes a lot of functions with funny names that compile to single SSE/MMX/whatever instructions. In short, you still perform the opcode selection by hand, but the compiler handles instruction scheduling and register allocation for you. It turns out that this is a lot more convenient than writing SSE code completely by hand. And, even nicer, this way your compiler actually knows what you're trying to do, so it can both perform proper inlining and optimize unnecessary memory accesses away. In short, it not only looks quite convenient, it also seems like it's the first really workable solution we've encountered. And it actually is - with some minor drawbacks, the most important one being that compilers have a tendency of generating rather weird code in places where intrinsics are used - useless memory accesses, that kind of thing. So this is not going to be the solution if you're looking for ultimate performance - but it's a lot besser than everything else we've seen so far, so we're gonna give it a shot and look at how far we get with it.