A Data Communication Reliability and Trustability Study for Cluster Computing
Abstract
Abstract
In High Performance Computing (HPC), most of the problems under study will be either embarrassingly parallel or data dependent. Beyond the nature of the problem, scientists will be interested in either one or two additional characteristics. The first, performance, focuses in achieving an accurate solution in a fraction of the time of a sequential approach. The second is consecutive, accurate and steady time readings. In their quest for performance, some scientists forget not only that the chosen tool, in many cases a distributed-memory system, is a multi-user system, but also that its components are interconnected through a high-speed communications network to facilitate the interaction among processors. In this talk, we show why a cluster characterization is relevant, particularly for scientific kernels where multiple accurate and consecutive time readings are necessary to statistically validate a behavior. We present the characterization of two clusters by using two variants of the ping pong test. One of the clusters is a multi-user research oriented cluster, while the second is a one-user cluster with older technology.