In a plugin chain, every input frame has to be processed by every plugin in sequential order. How can a multi-core processor improve performance in this case? The solution is a technique called pipelining, which resembles an assembly line. Suppose we have three plugins, A, B, and C, each running on its own core. To simplify the explanation, we’ll assume that each plugin takes the same amount of time. Let’s say the pipeline is initially empty. While plugin A is processing the first frame, B and C have nothing to do, but after the first frame moves to plugin B, plugin A can start processing the second frame. After the first frame moves to plugin C, the second frame moves to B, A begins processing the third frame, and the pipeline is now full. So long as more input is available, the pipeline stays full, and our throughput is three times greater than it would be without pipelining, because the three plugins are working in parallel.
Pipelining demo: Symmetric loading
A pipeline's throughput gain comes not from shortening or eliminating stages, but from keeping all the stages busy, by letting them work ahead. Working ahead has an interesting side effect: it increases latency, or how far behind real time the system is. This becomes important when plugin parameters are controlled by user input. The greater the latency, the more the output lags behind the parameter changes. Suppose our example pipeline is full and we change a parameter of plugin A. The next two frames displayed won’t reflect the change, because they’re in plugins B and C, which means they’ve already passed through A and can't be affected by it anymore. Note however that if we change a parameter of plugin C, the next displayed frame will reflect the change, because C is the final stage. In a pipelined system, latency is proportional to distance from the output: the earlier the stage, the further ahead we are, so the greater the latency.
Pipelines exhibit latency not only during parameter changes, but also during startup. In our example, assuming the processing time per plugin is T, the wait for the first frame is 3T, while the wait for the second frame is only 2T, and after that we get a frame every T. A pipeline behaves like a garden hose: when the hose is empty, there’s a delay between turning on the tap and water coming out, but once the hose is full, there’s no delay.
For simplicity’s sake we’ve assumed each plugin can only get ahead by one frame, but in practice this is implementation-dependent. In a typical implementation, each plugin has an input queue, which can store at least one frame separate from the one currently being processed. Increased queuing can improve throughput, but worsens latency.
We’ve also assumed that each plugin takes the same amount of time, but this ideal case rarely occurs, and the assumption hides a serious weakness of pipelining. According to Amdahl’s law, the throughput of a pipeline is limited by the slowest sequential task that can’t be subdivided into smaller tasks. In our example, suppose plugin B takes twice as long as A or C. Now plugins A and C remain idle half the time, and instead of a 3X speedup, we only get a 2X speedup.
Pipelining demo: Asymmetric loading
There is a way around Amdahl’s law in certain cases, which we’ll call ganging. It can be illustrated using a car assembly line. Suppose installing the engine takes twice as long as any of the other stages. Why not have two engine installers? Now when a car gets to the engine installation stage, if one installer is busy, the car goes to the other one, and effective throughput is the same as if engine installation took half as long.
We’ve ignored a crucial detail however: if we allow a car to go to whichever engine installer happens to be free, the cars are no longer guaranteed to stay in order. This might work for cars but it wouldn’t work for video frames, or any other case where the output sequence is critical. The solution is to enforce alternation: for example, odd cars always go to engine installer #1, while even cars always go to engine installer #2. There’s no price for this in terms of throughput, but it does add complexity to the system.
Let’s return to our previous example, in which we have three plugins, but B takes twice as long as A or C. Assuming sufficient cores, we can achieve optimal throughput by having two instances of plugin B, each running on its own core. Plugin B now appears to take half as long, and overall throughput is the same it was when all three plugins took an equal amount of time.
Ganging can be scaled up arbitrarily, but it can become impractical if the performance asymmetries between the plugins are too extreme. For example, suppose instead of taking twice as long, plugin B took ten times as long. Plugins A and C would now remain idle 90% of the time, and instead of a 3X speedup, pipelining would only give us a 1.2X speedup, hardly worth the effort. To equalize this asymmetry, we would need twelve cores: ten for B, plus one each for A and C.
Another catch is that ganging only works with plugins that don’t have any history, or more technically, with stateless filters. Suppose a plugin stores the previous frame, and mixes it with the current frame (AKA time blur). Ganging the plugin causes it to have multiple instances, each of which only receives a subset of the input stream. Each instance now has its own frame history, but none of them are correct from the viewer’s perspective, and the merged output alternates rapidly between the different frame histories, spoiling the effect. This same limitation applies to any other type of stored data that the plugin modifies between input frames, such as oscillators.
copyleft 2010 Chris Korda