在Extended Hypercube網路架構上可併行返回毀壞復原之有效演算法
Concurrent Rollback for Crash Recovery in Extended Hypercube Networks

莊東穎            

 

摘要/Abstract

毀壞復原是設計及發展穩定可靠系統中一個重要的問題。在運算過程中,如果有處理機(Processors)損壞(Fail),將造成系統的不一致(Inconsistency)而使得系統無法正常的運作。因此如何使系統迅速而有效的恢復到一致的狀態是毀壞復原所面臨的最重要課題。本論文利用檢查點(Checkpoints)/返回(Rollback)的導觀訊息儲存法(Optimistic Message Logging)針對Extended Hypercube (EH) Network的架構找出Overhead低而使系統快速復原的復原演算法。我們將利用Dependency Chain 長度的技巧來分別各個事件發生的前後關係,以達到系統快速復原的目的。同時我們將利用Extended Hypercube Network 架構的特性,也就是有層級、直徑短及重覆結構的特性,提出一可併行快速復原演算法(Concurrent Recovery Algorithm)。所謂併行就是在復原的過程中,可同時執行Cluster-up Recovery,以節省復原的時間。我們預計復原所需的Message Complexity 為 O( )而Time Complexity 為O( )其中N為系統中處理機的數目。

關鍵詞:毀壞復原、分散式演算法、容錯、樂觀訊息儲存法、檢查點、返回。



Recovering from processor failures is an important problem in the design and development of reliable systems. In this paper, we present a concurrent rollback algorithm in Extended Hypercube networks to recover from crash failures which involves little message and time complexities. The network of an Extended Hypercube is a hierarchical, low diameter, recursive structure with a constant predefined building hypercubes. By appending only O(1) additional information to each message, we use less than O(N * logN) message exchanges and O(log2N) time elapsed for recovery work where N is the number of processors of the extended hypercube network.

Index -Terms : Crash recovery, distributed algorithms, fail-stop failures, optimistic message logging, checkpoint, rollback

關閉視窗