Idalmis Milián Sardiña, Cristina Bôeres, Lúcia Drummond.
A crucial problem in distributed systems is the probability to the occurrence of failures in resources. Recent studies search different forms to improve application execution time, also including fault-tolerant mechanisms, but in many cases testing its politicies on simulated environments. This work presents an MPI tool to execute parallel applications on a real architecture, recovering the application execution using fault-tolerant scheduling techniques.For such, it considers the information generated by a static scheduling heuristic, offering mechanisms for automatic failure detection and efficient application recovery.
http://www.lbd.dcc.ufmg.br:8080/colecoes/wcga/2006/st1_3.pdf
Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web