Single Instruction Multiple Data

When math is your bottleneck, maybe when calculating the position of thousands of objects in a game engine or doing physics calculations, you want your math to process as quickly as possible. One of my mantras is that you’re not going to be smarter than your compiler, so I don’t generally advocate for optimizing math operations, loops, and other data structures before they prove themselves an issue, but that doesn’t mean you shouldn’t understand how the compiler is helping you and what tools you have available to help it along. One of those tools is Single Instruction Multiple Data, or SIMD.

What is SIMD?

SIMD allows the processor to perform the same operation on multiple values a the same time. At its highest level, it refers to the architecture and instructions that enable a CPU to perform SIMD operations. While it’s not strictly necessary, discussing CPU structure is helpful to understanding why SIMD is so valuable.

When a CPU does math, it has to load the operands and operation into registers. Registers are tiny ammounts of storage directly on the CPU, and are the only place in your computer where work actually gets done. So to multiply 3 and 5, the CPU would load 3 into one register, 5 into another register, and lastly the instruction for multiply would also be loaded. Finally, the result will be output to a register (either one of the orignal two or a third, depending on the logic), and that result will either continue to have operations performed on it or be moved back to RAM.

Registers, despite being very small by storage standards, are still large enough to contain more than one number. You can imagine then, if we want to multiply 3 by 5, but also multiple 7 by 2, we could try moving both 3 and 5 into one register, and both 7 and 2 into the other, we would now have the “multiple data” part of SIMD in place. We can then apply a special multiply instruction that understands how to multiply two numbers at once, and 15 and 14 would be written to the last register.

The last piece is to understand that when your CPU asks for a chunk of data, it can ask for enough bytes to fill a register at once, as long as the data is contiguous in RAM. Only if 3 and 5 are right next to each other in RAM can register one efficiently be populated with both values, and the same has to be true for 7 and 2. Finally, the data type being written to would also have to be contiguous to enable SIMD operations.

SIMD in Practice

Let’s look at a situation where we might use SIMD. Taking the example of a location in 3D space, it would be reasonable to store the data as x, y, and z properties in a struct.

struct Position {
    let x: Float
    let y: Float
    let z: Float
}

These values are referred to as scalar values. Scalar means when a single value is used to represent a single concept. When using scalar values, there is no guarantee that x, y, and z are stored in a contiguous area of RAM. As we learned, if they aren’t stored contiguously we cannot take advantage of SIMD instructions. The alternative to scalar values that addresses the contiguous storage requirement is vector storage. Vectors are contiguous areas of RAM that store a single data type, and in Swift, SIMD is the data type for vector storage of values.

Using SIMD is straightforward, though you must choose a specific concrete implementation type that will define the number of values in the vector storage. The most commonly used values are SIMD3 and SIMD4, which can be used to represent 3D coordinates in space or XYZW cooridnates, RGB or RGBA color. Those use cases are so common that you can access the underlying values of SIMD3 using .x, .y, and .z for SIMD3, and additionally .w for SIMD4. You can convert the scalar representation to a vector representation and maintain the dot notation for x, y, and z with minimal additional code.

struct Position {
    let position: SIMD3<Float>
    
    var x: { position.x }
    var y: { position.y }
    var z: { position.z }
}

When to Use SIMD

We’ve seen that it’s easy to modify your code to use SIMD data structures, but we haven’t really seen what that means. When is it appropriate to use SIMD in your code? My default answer is always to wait until it presents an issue in your code, then consider if a data structure like SIMD can help unblock hot paths. But we can still talk about what kinds of hot paths are right for SIMD.

The Hard Rules

There are three hard requirements before we get to questions requiring engineering judgement.

  1. The data must all have the same type; the SIMD implementations are generic over a single data type.
  2. The data must be passed by value to reap the benefits of contiguous memory access, because data passed by reference require an additional trip to the heap to access the data they contain. This is true for all properties of the type as well, they can’t be reference types either.
  3. The entirety of the type must be small enough to fit in a register at least twice. Registers have limited sizes, so if you want to perform the same operation on two pieces of data in one register, you need to be able to fit them both into that one register. Your speed up from using SIMD will be related to how many times your data can fit in a register.

The Judgement Call

Once you’ve cleared those hurdles, the first real question to ask is “is the data actually well represented by a contiguous chunk of memory?” In other words, will you generally be using all the data when you’re using one of them? Position, rotation, velocity, color, normals… Many representations of the three dimensional physical world we inhabit are excellent candidates. It’s not often that you only need to know the x coordinate of an object, or just the green channel, you’ll be manipulating all at once. Using SIMD means that instead of fetching or writing three small numbers from memory, all of them can be loaded at once. The ease of copying and accessing continguous memory will also make explicit copies or buffer creation faster. 3D rendering is probably the most common use case for needing to create and manipulate buffers.

Benchmarking SIMD

My hypothesis going into this article was that SIMD would provide significant increase in speed over scalar operations in all use cases. My thinking was that if I can ask my computer to do three or four times the work in the same amount of time I was asking it to do one operation, I’ll always come out on top. I wanted to test this hypothesis though, so I put together a benchmarking tool to help me investigate.

Initial Results

In each case, the scalar code beat the SIMD code, sometimes by a lot. It turns out that the LLVM compiler is incredibly good at optimizing looped scalar arithmetic operations! Optimized scalar operations were about 10% faster than the same operations using Swift’s SIMD types. Swift’s SIMD16 and SIMD32 types seem to not provide any benefit from SIMD operations at all, and saw extreme slowdown. In each case they are more than a few orders of magnitude slower.

After this lesson in how good modern compilers are, I rewrote the tool to remove compiler optimization from the scalar path. This doesn’t represent real-world use cases, but it does give us a tool for comparison.

Float

The first SIMD test run we can look at is Float. Performing two calculations at once with SIMD2 doubles the speed of the calculations. This happens again when we jump to SIMD4. That’s the final speed increase we get though, as we’ve saturated the register width. Any additional attempts at speed up will not work.

WidthScalarSIMDSpeedup
SIMD225.9312.502.07x
SIMD425.916.274.13x
SIMD826.276.164.26x
SIMD1626.086.464.03x

Int8

The second test run we’ll look at is Int8. While Float is 32 bits wide, Int8 is only 8 bits wide. This means we have more headroom to improve performance using the smaller data type using SIMD. We see a speedup through SIMD16.

WidthScalarSIMDSpeedup
SIMD223.1019.451.19x
SIMD423.597.892.99x
SIMD823.313.067.60x
SIMD1623.321.4915.58x

Compiler Optimization

I touched on it briefly in the introduction, but you should never underestimate the compiler. For the benchmark, without forcing the compiler to leave the code unoptimized, the scalar impementation was dramatically faster than the SIMD implementation. The compiler can be better or worse at optimizing your work load, so actually testing is very important. The benchmark also runs reduction operations, which the compiler is much worse at optimizing away and SIMD takes the crown back from optimized scalar operations.

SIMD in Your Project

There is so much to unpack in a discussion about using SIMD data types. We touched on hardward architecture, data structures, memory packing, and benchmarking, all to hopefully shed a little light on somewhat niche data strucutre. Modern object oriented programming rarely requires a more data oriented approach, but I hope I shared a little knowledge of what’s available and some of the many variable that should be taken into account before implementating SIMD in your project.


Seeded Random Number Generation, part 2

Once we have the interface for our SeededRandomNumberGenerator and a very naïve implementation in place, the reasonable question to ask is “How do we make this better?” If you recall from part 1 of this discussion, the naive implementation actually does well at a number of qualities we expect of a random number generator, namely determinism and periodicity (insofar as it’s predictable and uses the whole number space). Where it falls short is in predictibility. We will spend our time here discussing how to improve the predictibility of our generator.

Linear Congruential Random

The linear congruential generator represents the first real step up from the trivial incrementing generator. It introduces feedback (each output depends on the previous one), but not enough nonlinearity to be suitable for high-quality randomness.

struct LinearCongruentialRandom: SeededRandomNumberGenerator {
    typealias Seed = UInt64

    private let a: UInt64 = 6364136223846793005
    private let c: UInt64 = 1

    private var state: Seed

    public let seed: Seed

    init(seed: Seed) {
        self.seed = seed
        self.state = seed
    }

    mutating func next() -> UInt64 {
        state = state &* a &+ c
        return state
    }
}

This is a much better generator, while still being very fast. It has the key quality of a seeded random number generator, that it will produce the same values give the same seed. And its period is high, as it covers the entire number space of the seed (with the right values for a and c). Recall the problem with the incremental random number generator: you could predict the next number from the current number. While that’s not entirely true with this generator, take a look at this string of ten outputs.

0110111101100000010100011100001111001011110101000001101010100000
0111111001000000010010100000010000001100100111100000111000100001
0000101111010000001110001111010110100101111111111101101011001110
1000100111010001010000011001110110011101011010111010100000110111
1101011001110110010011010001000101010000011000101101101010101100
1100010000110000101001011001110010000001111101111100010000111101
1000100001000111100101110010110100010110011010001100000110111010
1101110101100011011010110011100011100001110001110101001110110011
1101100101000000111110010101010101101101101111101000001101111000
1001100000101110001000000111011111110010100011011010010000011001

If you examine the low bit, you can see that it flips between 1 and 0. This is an example of an unitended pattern created by our implementation. This pattern creates predictibility. Given any number output by this generator, you could correctly predict the low bit. For some use cases this may not be a problem, but patterns the expose themselves in your random numbers can show up as patterns where you put them to use, such as commonly suggesting simliar passwords or terrain generation that has repetitive features.

Bit Mixing

In a random number generator, patters are often the result of any particular bit being manipulated a small number of times. In the Linear Congruential generator, each bit is multiplied by a large odd number. This is a small amount of manipulation for each bit, and the manipulation comes from a static source, the property c

If the Linear Congruential algorithm has good randomness at higher bits but poorer at lower bits, what if we improved our generator by “borrowing” some of that high quality randomness?

XorShift64*

The aptly named XorShift64* (pronounced ex or shift 64 star) introduces bit mixing throughout the length of the seed. The idea is that all bits have a chance to mix, with a focus on the high and lows mixing. There are three shift and xor combos. Together, they are meant to ensure high and low bit mix by moving information from low bits to the high position and vice verse through a series of shifts. After the shifts, we multiply by a constant just like in the Linear Congruential generator. All this shifting ensures that any pattern in high or low bits introduced by the multiply is erased, or at least heavily obfuscated.

struct XorShift64Star: SeededRandomNumberGenerator {
    typealias Seed = UInt64

    private let shift1: UInt64 = 12
    private let shift2: UInt64 = 25
    private let shift3: UInt64 = 27

    private let multiplier: UInt64 = 2685821657736338717

    private let alternate: UInt64 = 0x9E3779B97F4A7C15

    public let seed: Seed

    private var state: UInt64

    /// Create a generator. If `seed == 0`, we remap to a fixed odd constant.
    /// Seeding with zero produces only zero for the lifetime of the generator
    init(seed: Seed) {
        self.seed = seed
        self.state = (seed == 0) ? alternate : seed
    }

    mutating func next() -> UInt64 {
        var x = state
        x ^= x >> shift1
        x ^= x << shift2
        x ^= x >> shift3
        state = x
        return x &* multiplier
    }
}

The shifts are chosen to ensure significant overlap between high and low bits, so they’re a relatively large fraction of the seed size. The shift sizes don’t match to avoid repeatedly shifting the same bits. These specific values are the values identified when the algorithm was created as empirically generating a good mix of full mixing and statistical properites.

What’s next?

From this point, the algorithms we could implement get progressively better at bit mixing, generally by applying more and more complicated mixing strategies. I won’t show implementation details for any more algorithms, but by now you should understand the general strategy a seeded random number generator can employ to do its job. There are a few noteworthy strategies that can be used to improve our approach, and I’ll briefly touch on a few.

Inncreasing Internal State

PCG64 increases the quality of random number generation by storing more state. It uses a 128 bit state seeded by generating two 64 bit random numbers using a separate set of bit mixing, using the first number as the low bits and the second as the high bits. This 128 bit number is then manipulated in a similar way to the generators we’ve examined, and we return 64 bits created from this state. The increased mixing provided by using a 128 bit state dramatically improves the quality of the generator, and actually means the period is be larger than the size of the seed.

Increasing Mixing Steps

Philox4x32_10 uses a similar state size as PCG64, but to increase quality it uses 10 rounds of mixing whenever the state is updated. The mixing used in each round is similar in complexity to our lower quality random number generators. This is one round of mixing in Philox4x32_10.

private mutating func round(
    _ c: (UInt32, UInt32, UInt32, UInt32), key: (UInt32, UInt32)
) -> (UInt32, UInt32, UInt32, UInt32) {
    // 32x32 -> 64 multiplies (use 64-bit intermediates to extract hi/lo)
    let p0 = UInt64(Self.m0) &* UInt64(c.0)
    let p1 = UInt64(Self.m1) &* UInt64(c.2)
    let lo0 = UInt32(truncatingIfNeeded: p0)
    let hi0 = UInt32(truncatingIfNeeded: p0 >> 32)
    let lo1 = UInt32(truncatingIfNeeded: p1)
    let hi1 = UInt32(truncatingIfNeeded: p1 >> 32)

    // Permutation + xors with key
    let n0 = hi1 ^ c.1 ^ key.0
    let n1 = lo1
    let n2 = hi0 ^ c.3 ^ key.1
    let n3 = lo0
    return (n0, n1, n2, n3)
}

Despite the increased number of rounds, this algorithm is still very fast. Taking it one step further, the ChaCha20 algorithm increases the mixing to 20 rounds, while also increasing complexity of overall stored state. ChaCha20 performs 8 quarter rounds each loop for 10 loops, leading to 20 total rounds of high quality mixing. This algorithm is the only crytographically acceptible one I’ve discussed, and is much slower than the rest.

private mutating func generateBlock() {
    var x = state
    for _ in 0..<10 {
        qround(&x, 0, 4, 8, 12)
        qround(&x, 1, 5, 9, 13)
        qround(&x, 2, 6, 10, 14)
        qround(&x, 3, 7, 11, 15)
        qround(&x, 0, 5, 10, 15)
        qround(&x, 1, 6, 11, 12)
        qround(&x, 2, 7, 8, 13)
        qround(&x, 3, 4, 9, 14)
    }
    for i in 0..<16 { block[i] = x[i] &+ state[i] }
    state[12] &+= 1
    index = 0
}

private func qround(_ x: inout [UInt32], _ a: Int, _ b: Int, _ c: Int, _ d: Int) {
    x[a] &+= x[b]; x[d] ^= x[a]; x[d] = rotl(x[d], 16)
    x[c] &+= x[d]; x[b] ^= x[c]; x[b] = rotl(x[b], 12)
    x[a] &+= x[b]; x[d] ^= x[a]; x[d] = rotl(x[d], 8)
    x[c] &+= x[d]; x[b] ^= x[c]; x[b] = rotl(x[b], 7)
}

Just the Beginning

This discussion is meant as the briefest of introductions to the subject of seeded random number generators. I hope though that it provides enough information to have some basic grasp of the uses of seeded random number generators, what properties belong in a discussion of their quality, and what levers we have available to adjust those properties.


Seeded Random Number Generation, part 1

Randomness is relatively easy to come by in Swift. Most types you might want to randomize support the method random(in:), which takes a closed range and returns a value within the range. But there are use cases where you don’t just want randomness, but predictable randomness. This is a concept called seeded randomness, where you provide a seed (in the form of a string, a number, or some bytes) to the data structure that provides your randomness. That seed ensures that all data structures will return the same string of randomness. Predictable randomness! This needs unpacking.

The Swift protocol

Swift provides the protocol RandomNumberGenerator that requires implementation of one method:

func next() -> UInt64

This is the protocol we’ll focus our discussion on, with the following expansion:

protocol SeededRandomNumberGenerator: RandomNumberGenerator {
    typealias Seed = UInt64

    var seed: Seed { get }

    init(seed: Seed)
}

We will enforce an initializer that takes a seed in the form of UInt64 and we expose it for examination. The rest of our discussion will focus on the implementation of the next method in this protocol.

The use case

Generally, the reason you’re reaching for a random number is that you don’t want predictability, but there are use cases where you want the impression of randomness without the problems that actual randomness introduces.

Perhaps the most universal reason to reach for seeded randomness is testing. While in production, randomness may be essential to proper function of your app or service, but in testing determinism is paramount. Instead of reaching for random(in:), you can inject any RandomNumberGenerator into your class. In production code, you can use the SystemRandomNumberGenerator provided by Swift. In test, you can provide a SeededRandomNumberGenerator to provide determinism.

What got me interested in seeded random number generation is its use in games. It’s common for terrain in a game to be randomly generated as players explore. How do you ensure that two players on the same map generate the same random terrain? Seeded random number generation! How do you allow players to share maps and starting points with each other? You can generate and share the entire map, or you can implement seeding and just share the seed.

Our first attempt

We’ll call this our incremental random.

///  It would be challenging to come up with a *more basic* generator.
struct IncrementalRandom: SeededRandomNumberGenerator {
    typealias Seed = UInt64

    private var state: Seed

    public let seed: Seed

    init(seed: Seed) {
        self.seed = seed
        self.state = seed
    }

    mutating func next() -> UInt64 {
        state &+= 1
        return state
    }
}

It accepts a seed, then when next is called it increments the state by one and returns that value. When it reaches the maximum value of a UInt64 it wraps around and begins at zero. While it may fail our intuitive understanding of a random number generator, it actually passes a surprising number of tests for a random number generator, and we’ll start our discussion with those.

Determinism

The most important quality of a seeded random number generator is that for a given seed, the values returned are the the same every time and in the same order. For IncrementalRandom, it’s easy to see that for a seed x, the returned values will be x+1, x+2, and so on. This is the core competency of the seeded random number generator.

Periodicity

Periodicity refers to the number of calls to next required until the random number sequence repeats. Generally, the goal is to maximize the size of the period. For UInt64, our period can be as large as the max value of UInt64. We can also validate this pretty intuitively for our IncrementalRandom seeded generator… It won’t repeat a value until it passes through the entire space of UInt64 and returns to our seed. This is as good as it gets for periodicity, and so far it seems like our simple implementation is doing kind of well, but that won’t last. We’ll start to examine its shortcomings next.

Predictability

Intuitively, I’m sure you can already tell that our random number generator fails at this aspect of its task… A seeded random number generator is never truly random, but the numbers need to feel random. For IncrementalRandom, this failure manifests itself when you can tell for any given random value what the next in the sequence is. When the predictability is this obvious, it’s easy to understand the algorithm’s shortcoming, but as we increase in complexity, what we mean by predictability will get harder to intuit.

As an metric, we want predictability to be low. In a game, this is what will make a river’s flow look natural, and prevent mountain ranges from having peaks that decrease in height from east to west. In a non-game scenario, if you’re using a random number generator to assign work tasks to agents, predictability (as manifested by correlation) coould be seen in the system as some agents getting assigned more or fewer tasks than others, or some agents often getting assigned tasks after others. Whether or not this is a big deal relates to how important true randomness is to your app.

To be continued

We can do much better than IncrementalRandom, but it was a useful way to introduce topics in a way that can be intuited. We look deeper at these concepts in part 2.


The Swift AWS Lambda Runtime

When we talk about Lambda runtimes and running Swift code in Lambda, it helps to start by understanding what a traditional Lambda implementation looks like. Selecting a provided runtime provides a scaffolding for your code to hook into that does the heavy lifting required for Lambda to function.

When providing a Java Lambda implementation, for example, you write a class that is typed to the Request and Response that your architecture will provide to and expect from the Lambda function respectively.

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;

/**
 * Lambda handler written in Java that receives HTTP requests and returns a response.
 */
public class HttpHandler implements RequestHandler<Request, Response> {
  // perform logic on the request and return response
}

The only code you write is the class that implements RequestHandler and any supporting code that object needs to function. The Java Lambda runtime provides all the networking code and object serialization required to handle requests. Under the hood, the runtime is making calls to Lambda APIs to poll for events; behavior that is abstracted away by the runtime code. But if you don’t want to write Java Lambdas…

The Alternative

While reading the list of provided runtimes linked above, two stand out: the OS only runtimes. To deploy a Swift lambda, you must use the AL2 runtime and perform all the tasks that were previously abstracted away. AWS doesn’t provide the code needed to interface with the lambda service, but provides documentation for the APIs that allow you to do the heavy lifting yourself.

You would of course need to write an HttpHandler in Swift and any supporting code for that object, but you are also now responsible for the networking code and object serialization required to wire up that class with the Lambda service itself. This is what the Swift AWS Lambda Runtime provides. It is a project that bridges that gap and makes Swift lambdas practical.

In contrast with the Java lambda, we are providing a main method for the runtime, which was notably absent in the Java example. Our code is responsible for the entire workflow and is loaded directly when the OS only runtime boots!

import AWSLambdaRuntime

/// Lambda handler written in Swift that receives HTTP requests and returns a response.
@main
struct HttpHandler: LambdaHandler {

  func handle(_ event: Request, context: LambdaContext) async throws -> Response {
    // perform logic on the request and return response
  }
}

The elegance of the Swift AWS Lambda Runtime is encapsulated in the @main annotation on the LambdaHandler.

A Beautiful Abstraction

Aside from having to learn slightly more about the inner workings of a Lambda runtime, once you import and use the Swift AWS Lambda Runtime in your package there is not a significant practical difference in what you’re coding. The runtime is provided by the implementation of main that the Lambda Runtime package provides. When your class implements SimpleLambdaHandler or LambdaHandler, an extension on those classes provides a main that calls the Lambda runtime loop that integrates your lambda code.

It’s a very clean abstraction for a huge amount of boilerplate code that you would otherwise have to write. When compared to the AWS provided runtime implementations, the work required for a Swift developer is nearly identical despite the increased complexity behind the scenes.

Diving Deeper

To explore Swift on Lambda, I wrote a small project that uses the Swift AWS Lambda runtime along with CDK to build and deploy multiple Lambda functions that back an API Gateway endpoint. The functions are built as separate products using Swift Package Manager while sharing common code provided by a third SPM package in the same repository. You can view the source code on GitHub.