个性化文献订阅>期刊> IEEE Transactions on Computers
 

Adaptive Fault Management of Parallel Applications for High-Performance Computing

  作者 Lan, ZL; Li, YW  
  选自 期刊  IEEE Transactions on Computers;  卷期  2008年57-12;  页码  1647-1660  
  关联知识点  
 

[摘要]As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.

 
      被申请数(0)  
 

[全文传递流程]

一般上传文献全文的时限在1个工作日内