Fast Recalculation for Throbol 2022-9-7
Throbol needs to recalculate spreadsheets quickly in three contexts:
-
it has to run in real time while controlling a robot, extending the time series of every cell 100 or 1000 times per second.
-
it has to recalculate the sheet to show the new behavior when a parameter is changed. To give a snappy interactive feel, it should do this in under 15 mS.
-
When visualizing a policy terrain surface, it needs to calculate hundreds of versions of the same spreadsheet, all with slightly different parameters.
Recalculating a sheet with 1000 cells over 1000 timesteps involves 1M cell evaluations, each one involving several opcodes. Opcodes range from simple arithmetic ops, to matrix multiplies, to assembling 3D graphics primitives.
The current implementation can execute about 11.5 billion opcodes per second using all the cores on a Macbook Pro (2021). It achieves this speed through a combination of SIMD and multicore operations, as well as some careful coding. The rest of this article describes how it all works.
Overview
First, Throbol parses the spreadsheet into an AST. Parsing each cell doesn't require context from other cells, so it can parse incrementally. (This isn't true of languages like C++, where you can't even parse without knowing all the typedefs).
Then it scans the AST for each cell to determine the cell's type. This may depend on other cells, so it scans recursively. In most cases it can deduce types without any annotations, except when a cell depends on itself. This is common for an integrator cell, like:
foo.pos = foo.pos[-dt] + dt * foo.vel
Then it traverses the cells in dependency order, emitting code. For example, in the program
bar = foo + 1
foo = 1
it has to calculate foo before bar.
Emitting code is done directly from the AST. The code it emits is essentially a sequence of functions to call. In C++ this is implemented as a vector<function<void(CpuState &cpu, Pad &pad, auto batchdef)>>
, where cpu
holds some global state and pad
holds temporary values.
Once the code has been emitted, it can be executed by C++ code like
for (auto &op : code) {
op(cpu, pad);
}
Although std::function
can be inefficient, it works pretty well here. Overhead is around 10 machine instructions per opcode. So it's important that opcodes do a lot.
SIMD
Because calculating each spreadsheet update uses straight-line code with no branches, it's easy to execute several spreadsheets in parallel in a SIMD manner. This is useful for interactive parameter tuning, where many combinations of parameter values are being evaluated to show in the UI. The current implementation does 8 operations in parallel, and most operation can use SIMD operations on x86 or ARM cpus. So in a few instructions it can apply an operation to 8 separate spreadsheet calculations.
Throbol has around 600 opcodes, so it's important that they're easy to implement. With some C++ cleverness, most can be a single line of code. For instance, here's a simplified version of add (the name for the +
operator):
defemit(add) // (simplified)
{
defop2(float, float, float, r = a + b);
defop2(vec2, vec2, vec2, r = a + b);
// ... more type combinations ...
return setInvalidArgs();
}
The last argument to defop2 is compiled in an environment where the arguments are assigned (with appropriate C++ types) to a
, b
, c
, ... while a reference to the result is in r
.
defop2 is a macro that expands to (approximately):
// if 2 arguments of type (float, float)
if (evalArgs() == 2 && args[0].ist<float>() && args[1].ist<float>()) {
// allocate space for a return value of type float in the pad
auto rv = allocResult<float>();
// append this lambda to the code
evalop([rv, av=args[0], bv=args[1]](CpuState &cpu, Pad &pad, auto batchdef) {
for (int batchi = 0; batchi < BATCH_SIZE; batchi++) {
// bind a and b as const refs into the pad
float const &a = pad.rdreg<float>(av, batchi);
float const &b = pad.rdreg<float>(bv, batchi);
// bind r as a mutable ref
float &r = pad.wrreg<float>(rv, batchi);
// perform the addition. This line is interpolated directly from the 4th macro argument
r = a + b;
}
});
return rv;
}
With some carefulness, the above code can be compiled down to 8 instructions on ARM: 4 loads, 2 adds, 2 store. It's possible because in the pad, the values for each of the SIMD lanes are allocated sequentially. The compiler can see that inside pad.rdreg
, batchi
indexes sequential locations, so it can use SIMD load instructions.
Multicore
Throbol spawns a worker thread for each CPU core to run updates. Each of these sits in a loop, taking work from a common queue. When the queue is non-empty, it will batch together up to 8 compatible recalculation jobs.
Summary
Throbol gets 8x performance from SIMD, and another 8X from multicore, when doing many recalculations in parallel.