Systems Research

Analysis of message passing environments on large cluster performance

Principal Investigators: D.K. Panda, P. Sadayappan and P. Wyckoff
Funding Source: Sandia National Labs
Duration: 1/8/2001 - 2/28/2002

Description: Analysis of reliability/scalability/performance tradeoffs in Myrinet. Understand and analyze the reliability mechanism in GM, by listing all the different fault scenarios which GM handles, and using that to build state-transition diagrams to show how GM handles the faults. Derive a cost model to the state-transition diagram in terms of how much overhead (instruction count, interactions with host) GM requires to handle recovery for each fault. Analyze the impact of different types of faults and their frequencies on the overall performance. We can develop a small simulation framework to evaluate the performance. This simulation framework needs to be parameterized so that we can evaluate the impact of host speed, NIC speed, and network link speed, in addition to the failure characteristics.