Alex C. Snoeren, David G. Andersen, and
Hari Balakrishnan.
Proc. of the Third Annual USENIX Symposium on Internet Technologies and Systems (USITS), March 2001
An earlier version appeared as MIT Laboratory for Computer Science Technical Report, MIT-LCS-TR-812.
This paper presents a set of techniques for providing fine-grained failover of long-running connections across a distributed collection of replica servers, and is especially useful for fault-tolerant and load-balanced delivery of streaming media and telephony sessions. Our system achieves connection-level failover across both local- and wide-area server replication, without requiring a front-end transport- or application-layer switch. Our approach is enabled by the recently-developed end-to-end ``connection migration'' mechanism for transport protocols such as TCP, combined with a soft-state session synchronization protocol between replica servers.
The end result is a robust, fast, and fine-grained server failover mechanism that is transparent to both the client and server applications. We describe the details of our design and Linux implementation, as well as experiments with our implementation that show that this approach to failover is an attractive way to engineer robust systems for distributing long-running streams; connections suffer relatively low performance degradation even when server redirection occurs every few seconds, and overhead is negligible when compared to standard techniques. In particular, we observe the performance impact of migrating TCP connections depends on the length of time between migration and the most recent loss-recovery event.
[PostScript (200KB)] [PDF (101KB)] [HTML (68KB)]