雲端運算近年來十分火紅,從IBM、Microsoft到Amazon每家廠商都推出雲端服務,在雲端運算迅速崛起的同時也出現些許問題。將資料存放在雲端上,利用雲端做龐大資料分析與處理的同時,如果出現錯誤或是網路斷線該如何解決?本篇論文主要探討主題為雲端運算上容錯議題,主要著眼在如何在MapReduce中有效且正確判定節點中的緩慢任務,在判定之後能夠使用較有效率的方法做重新分配處理緩慢任務,以避免整體工作時間被緩慢任務所拖慢進而影響到工作完成時間。本文主要以Hadoop作為開發實驗環境,利用模擬比較Hadoop、LATE以及本篇所提出之方法並分析其優劣。 Cloud computing is gaining popularity in recent years. Many renowned companies such as IBM, Microsoft, Amazon, are providing services over the cloud. It is inevitable that failures may occur in the cloud, so how to make a cloud computing system fault-tolerant is very important. In this research, we try to identify true slow tasks in Hadoop MapReduce’s jobs and migrate them to other compute nodes before failures occur. Specifically, we modify the LATE algorithm to make MapReduce scheduler adapt to tasks with variable progress rates. We also study three rescheduling methods and compare their performances.