Um Serviço Baseado em SNMP para Detecção de Falhas de Processos Distribuídos em Múltiplos Sistemas Autônomos

Dionei M. MoraesElias P. Duarte Jr.

This work presents a failure detector service for Internet-based distributedsystems that span multiple autonomous systems. An SNMP agent, calledmonitor, implements a MIB used as the interface to obtain global process stateinformation. Monitors at different LANs communicate across the Internet usingWeb Services. Processes are monitored with heartbeats; if a working processremains silent for a timeout interval adaptively computed, the state is toggledto suspect. The process state is toggled to crashed only after this informationis confirmed at the local operating system. The system was implemented andevaluated for monitored processes running both at a LAN and distributed throughoutthe world in PlanetLab. Experimental results are presented, showingCPU usage, failure detection latency, and mistake rate.

