Minimize directly user visible freeze time upon chroniton capture
Chroniton capture uses CRIU and runs in two main phases: pre-dump and dump. Pre-dump captures do not freeze the container, occur in a back-to-back sequence, and are responsible for capturing progressively smaller memory change deltas since the previous dump. The final dump freezes the container, captures a final "still" memory dump, serializes a variety of system-level state, and (just prior to unfreezing) invokes filesystem state capture.
In early tests of the Chronostruct system, the typical pre-dump time was 100-200 milliseconds, but the final dump was about 3500 milliseconds. With TCP networking effects, the typical case for next event responsiveness within a captured container was close to 7500 milliseconds.
After extensive experimentation, I discovered that for typical containers, the process memory capture step represents only 3-6% of the time consumed for the final dump. Containers with large memory changes (into the gigabytes) were the only exception, but with functioning pre-dump deltas, most of those changes would be captured in the pre-freeze capture phase. Much of the rest of the time was consumed by serializing system-level data of various kinds.
As of late 2018-06, the following changes have improved freeze time baselines to approximately 1300 milliseconds:
- Disable systemd's journaling, as the journal file was being stored on a tmpfs, which requires a full copy upon capture
- Disable as many system background processes as possible leaving basically just systemd, chronitond, and the user login/bash processes
- Reduction of the chronode side TCP retransmission timeout toward chroniton containers to 5ms
Three "reasonable" scope modifications to further improve this time remain:
- Recompile systemd for the chroniton to use the underlying rootfs instead of mounting tmpfs for the running directory, and any other locations. tmpfs is less efficient for this use case than relying on the underlying rootfs snapshots. This behavior is not configurable at systemd runtime. There are still tmpfs mounts that (though are less costly to run without the journal) consume some non-trivial amount of time in capture. This might save 200 ms.
- Switch from fork-exec'ing LXC/CRIU/action-script commands to a liblxc and CRIU RPC method. This might save 100 ms.
- Add a new CRIU (advisory) notification at start-of-freeze rather than relying on (blocking) end-of-freeze. Use this advisory notification to begin the filesystem snapshot in parallel with the CRIU dump, though hold the end-of-freeze in place if the filesystem snapshot process does not complete before the CRIU dump. This might save 400 ms.
All added up, however, the "optimistic" baseline capture time will likely still be in excess of 600 milliseconds, or about 6x longer than the human perceptible threshold of "immediate" responses. Every additional process running within the container also adds to the capture time. Further, more invasive, system modifications to improve these capture times are likely possible (e.g. more kernel level involvement), but impractical to pursue.
Instead of trying to squeeze every last decisecond out of the capture process, a complementary approach is to "fake" responsiveness of the system. There are two approaches worth considering:
- Network latency hiding approaches
- Predicting user quiescence and scheduling captures between user activity periods
In the first area, the Mosh project (https://mosh.org/) implemented "local" echo and line editing approximation for latency hiding, as well as connectivity robustness improvements, but the former is most applicable. The "local" echo and line editing approximations utilize a "smart" terminal emulator on the client side, and a state reconciliation protocol rather than raw terminal command transmission over the network. The client side heuristically predicts the output effect of a user keystroke and immediately manifests that prediction in the user display (though underlined to signify it is a prediction), and then the reconciliation process confirms or corrects that prediction after communicating with the server.
There is similarity between latency hiding over a lousy network connection (as Mosh was designed to do), and responsiveness hiding over a connection to a periodically non-responsive server (as Chronostruct does). It is worth experimenting with Mosh's approach, as an internal implementation detail, to determine if it is sufficient to satisfy "immediacy" needs of the interactive sessions in Chronostruct.
There are, however, some potential problems:
- Mosh is written in C, and interfacing to it in Go is a non-trivial amount of work
- Mosh is GPLv3 licensed, meaning any such experimental Chronostruct version cannot be distributed
It is still worth experimenting with the idea, and then adapting some of the ideas (especially if they come from the Mosh paper instead of the code) for a clean-room reimplementation.