|Abstract: ||本研究計畫之主要目的為研究適用於建構在雲端計算環境的大型多人線上遊戲(Massively Multiplayer Online Game，簡稱MMOG) 之容錯機制。在雲端計算運作模式下，MMOG服務提供者在雲端運算平台當中租用了許多的虛擬機器(Virtual Machine)，每個虛擬機器則負責執行相當數量之行程(Process)，透過這些行程來執行玩家之遊戲場景。玩家與遊戲以及玩家與玩家彼此之間的互動，則是透過行程之間的訊息傳遞與交換來達成。此一運作模式，即形成了分散式計算領域中的Message-Passing System。 由於雲端環境運算過程中軟硬體發生故障的情況在所難免，MMOG的玩家有可能受到這些故障的影響，導致遊戲服務中斷。為了確保整體遊戲狀態不受故障影響，在此系統當中加入容錯機制實有其必要性。如此一來，假設某個虛擬機器發生故障並重新啟動，我們可將該虛擬機器的系統狀態回復至發生故障前的狀態，才不致使得位於該虛擬機器之玩家遊戲狀態與其他未發生故障之玩家遊戲狀態產生不一致之情形。 為了達成此一目標，我們將採用查核點與錯誤回復技術(Checkpointing and Rollback Recovery)：首先，我們將研究以雲端計算為基礎之MMOG多行程運作流程，然後基於Open Grid Forum所制訂之GFD.92與GFD.93標準的精神，進一步擴展建立適用於MMOG應用環境之查核點與錯誤回復機制。我們預計產出一套可供MMOG服務開發者使用之容錯函式庫，讓MMOG在運作進程可依需求建立查核點，並且在發生錯誤時提供錯誤回復功能，如此可使得MMOG之容錯能力不需受到雲端計算服務平台之功能限制。|
The main objective of this proposal is to research fault-tolerant technologies for Massively Multiplayer Online Games (MMOGs) deployed over a cloud-computing environment. Under this application scenario, a MMOG service provider rents a number of virtual machines from the cloud-computing platform provider, where these virtual machines host user processes that execute game scenes for the game players. Interactions between an individual player and the game and between multiple players are achieved by passing messages between the processes. The computing and communication paradigm forms a message-passing system in the field of distributed computing. Since hardware/software failures are inevitable during game executions, if failures are not handled carefully, MMOG players may suffer from service discontinuity. For instance, if we just restart a player’s process from scratch upon a failure, doing so would not acceptable since the player’s progress before the failure will be lost. Therefore, it is highest priority to augment cloud-based MMOG services with fault-tolerant capabilities in order to provide guaranteed quality of service (QoS) to the players. Although the issue of fault tolerance has long been discussed in the field of distributed computing, in nowadays, its importance is even more important than it was before. The technique of checkpointing and rollback-recovery is widely used in distributed computing systems as a means to provide fault tolerance, and it is also applicable to the cloud-computing environment. By saving player states as checkpoints, we are able to rollback the game states using these checkpoints for those players who suffer from failures. In this way, the loss of a player’s progress can be minimized. However, state dependencies among players exist; rolling back one player’s state may lead to an inconsistent global state. Therefore in designing checkpointing and rollback-recovery protocols we must consider the dependency issue, as well as minimization of the overhead brought to the user processes during run time. To achieve this goal, first we need to investigate the interaction behaviors of multi processes in a cloud-based MMOG, specifically taking into account the virtualization environment. Then we will develop a checkpointing and rollback recovery scheme suitable for cloud-based MMOGs. It shall be noted that instead of developing a system-level scheme, we choose to develop an application-level scheme so that MMOG service providers will not have to rely on specific cloud-computing platforms. Based on the GridCPR architecture defined in GFD.92 and GFD.93 of the Open Grid Forum, we will extend its scope to deal with the situation of multiple concurrent processes as in MMOGs. Finally we will propose a checkpointing and rollback recovery library, so that MMOG applications can perform checkpointing and rollback recovery at their own will.