

IBM Research GmbH, Zurich, Switzerland

## Trace-driven co-simulation of highperformance computing systems using OMNeT++

Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland

© 2009 IBM Corporation

## Overview

### Context

- ➢ MareNostrum → MareIncognito
- Design of interconnection networks for massively parallel computers

### Models & tools

- Computation Dimemas
- Communication Venus
- Visualization
  Paraver
- Integrated tool chain
  - Venus architecture
  - Co-simulation with Dimemas
  - Paraver tracing
  - Accuracy: topologies, routing, mapping, Myrinet/IBA/Ethernet models
- Sample results
- Conclusions & future work



## Background: MareNostrum

- Blade-based parallel computer at Barcelona Supercomputing Center (BSC)
- 2,560 nodes
- 10,240 IBM PowerPC 970MP processors at 2.3 GHz (2,560 JS21 blades)
- Peak performance of 94.21 Teraflops
- 20 TB of main memory
- 280 + 90 TB of disk storage
- Interconnection networks

Myrinet and Gigabit Ethernet

- Linux: SuSe Distribution
- 44 racks, 120 m<sup>2</sup> floor space





2<sup>nd</sup> International Workshop on OMNeT++, March 6, 2009, Rome, Italy

## MareIncognito

- Joint project between BSC and IBM to developed a follow-on for MareNostrum, codenamed *MareIncognito*
- 10+ PFLOP/s machine for 2011 timeframe comprising in the order of 10
   20K 1+ TFLOP blades
- Our focus: Design a performance- and cost-optimized interconnection network for such a system
  - Gain deeper understanding of HPC traffic patterns and the impact of the interconnect on overall system performance
  - Use this understanding to optimize the design of the MareIncognito interconnect
    - Reduce cost & power without sacrificing much performance
    - Topology, bisectional bandwidth, routing, contention (internal & external), task placement, collectives, adapter & switch implementation



## Interconnect design

- Ever-increasing levels of parallelism and distribution are causing shift from processor-centric to interconnect-centric computer architecture
- Interconnect represents a significant fraction of overall system cost
  - Switches, cables
  - > Maintenance
- We need tools to predict system performance with reasonable accuracy
  - > Absolute performance
  - Parameter sensitivity; trend prediction ("what if?")
  - Accurate model of application behavior (compute nodes)
  - Accurate model of communication behavior (interconnect)
- Many tools do either of these things well; very few manage both simultaneously with sufficient accuracy

## IBW

## Compute node model

- From the network perspective, compute node acts as traffic source and sink
- In communication network design (telco, internet) typical approaches are
  - Synthetic" models based on some kind of stochastic processes
    - may (or may not) be reasonable if a sufficient level of statistical multiplexing is present
    - relation to reality unclear at best
  - Replaying traces recorded on, e.g., some provider's backbone
  - In either case, semantics of traffic content are rarely considered: no causal dependencies between communications
- Traffic in HPC systems has strong causal dependencies
  - > These stem from control and data dependencies inherent in a given parallel program
  - These dependencies can be captured by running a program on a given machine and recording them in a trace file
  - If the program has been written using the Message Passing Interface (MPI) library, this basically amounts to a per-process list of send/recv/wait calls
  - Such a trace can be replayed observing the MPI call semantics to correctly reproduce communication dependencies

## Compute node model: *Dimemas*

- Dimemas (BSC simulator)
  - Rapidly simulates MPI traces collected from real machines
  - Faithfully models MPI semantics & node architecture
  - Sensitivity: helps identify coarse-grain factors and relevant communication phases
  - Leverages CEPBA Tools trace generation, handling and visualization
  - Models interconnection network at a very high abstraction level
- Paraver (BSC visualization)
  - Visualizes MPI communication patterns at the node level
  - Allows "debugging" of inter-process communication, e.g., load imbalance, contention



## IBW

## Interconnect model: Venus

### Simulates interconnect at flit level

- Event-driven simulator using OMNeT++ (originally based on MARS simulator)
- Supports wormhole routing and segmentation for long messages
- Provides various detailed switch and adapter implementations
  - > Myrinet, 10G Ethernet, InfiniBand
  - Generic input-, output-, and combined inputoutput-queued switches
- Provides various topologies and routing methods
  - Extended Generalized Fat Tree (XGFT), Mesh, 2D/3D Torus, Hypercube, arbitrary Myrinet topologies
- Supports various routing methods
  - Source routing, table lookup
  - > Algorithmic (online), Myrinet routes files (offline)
  - Static, dynamic
- Highly configurable
  - Topology, switch/adapter models, buffer sizes, link speed, flit size, segmentation, latencies, etc.

### Supports MareNostrum/MareIncognito

- Server mode to co-simulate with Dimemas via socket interface
- Outputs paraver-compatible trace files enabling detailed observation of network behavior
- Detailed models of Myrinet switch and adapter hardware
- Translation tool to convert Myrinet map file to OMNeT++ ned topology description
- Import facility to load Myrinet route files at simulation runtime; supports multiple routes per flow (adaptivity)
- > Flexible task to node mapping mechanism
- Tool to generate Myrinet map and routes files for arbitrary XGFTs



## **Tool Enhancement & Integration**



1 2<sup>nd</sup> International Workshop on OMNeT++, March 6, 2009, Rome, Italy



## Venus model structure



## Hybrid Dimemas & Venus simulation



## **D&V Co-simulation: SEND example**



## Network topologies



## Insights gained at multiple levels

### **Application & MPI library level** Blocking vs. nonblocking calls

- **MPI** protocol level
  - Eager vs. rendez-vous
- **Topology & routing** 
  - Static source based routing  $\geq$
  - Contention and pattern aware routings  $\geq$
  - Slimmed networks
  - External vs. internal contention  $\geq$

### Hardware

- Head-of-line blocking  $\succ$
- Switch arbitration policy  $\geq$
- Segmentation  $\geq$
- Deadlock  $\triangleright$
- Automatic communication-computation overlap  $\geq$

### Putting it all together: Validation with real traces in a real machine

- Identifying the sources of network performance losses  $\geq$
- Collectives variability  $\geq$

Interconnect layers

Protocol/Middleware

**Software: Application** 

& MPI library

Network topology & routing

Switch & adapter technology

## IBW

# At the protocol level

- Eager/Rendez-vous
  - Rendez-vous protocol needs control messages
    - Impact is relatively small if the control messages never wait after a data segment to get sent
  - If control messages are sent after a long segment, the sender will be delayed
    - Smaller segments
    - > Out of Band protocol messages
  - The delay propagates to the other threads
- Segment size trade-off?
  - 16KB as in Myrinet for long messages is too long to interleave urgent control messages





## Performance of WRF on two-level slimmed tree



WRF, 256 nodes

output-queued switchinput-queued switch (Myrinet)



Number of top-level switches



- Dimemas
  - number of input links = 256
  - number of output links = 256
  - latency = 8 us
  - cpu\_ratio = 1.0
  - eager threshold = 32768 bytes
- Venus
  - link speed = 2 Gb/s, flit size = 8 B (duration = 32 ns)
  - Myrinet switch and adapter models
  - Myrinet segmentation
  - adapter buffer size = 1024 KB
  - switch buffer size per port = 8 KB

# Conclusions & future work

- Conclusions
  - Created OMNeT-based interconnection network simulator that faithfully models real networks in terms of topology, routing, flow control, switch- and adapter architecture
  - Integrated network simulator with trace-based MPI simulator to capture reactive traffic behavior, thus obtaining high accuracy in simulating the interactions between computation and communication
  - Enabled entirely new insights into the effect of the interconnection network on overall system performance under realistic traffic patterns

### Future work

- Allow multiple simulators to connect to Venus simultaneously
- Modify MPI library to directly redirect communications to Venus
- Make Venus itself run in parallel
- Extended Venus to produce power estimates



## D&V co-simulation: The client



# D&V co-simulation: The server





## **Extended Generalized Fat Trees**

- XGFT ( h; m<sub>1</sub>, ..., m<sub>h</sub>; w<sub>1</sub>, ..., w<sub>h</sub> )
- h = height
  - ➢number of levels-1
  - > levels are numbered 0 through h
  - >level 0 : compute nodes
  - > levels 1 ... h : switch nodes
- *m*<sub>i</sub> = number of children per node at level i, 0 < i ≤ h</li>
- w<sub>i</sub> = number of parents per node at level i-1, 0 < i ≤ h</li>
- number of level 0 nodes =  $\prod_i m_i$
- number of level h nodes =  $\prod_i w_i$

### XGFT (3;3,2,2;2,2,3)



# Performance of 256-node WRF on various topologies – cpu ratio 1.0



### Network topology & routing

Switch & adapter technology

## Impact of scheduling policy

- The scheduling policy is the determining factor (infinite CPU, IQ switch)
  - Round-robin pointers are initialized randomly
  - When contention occurs (on output 15), the "wrong" selection (which depends on the position of the pointer) results in HOL blocking; the scheduler should first serve the input on which another message will arrive
  - Unfortunately, it has no crystal ball...





## Backup



## IBM

## Multi-rail system





## MultiHost module



## **D&V Co-simulation: Protocol (hints)**

- Human-readable ASCII text
- Sent through standard Unix Sockets
- Few Commands and Responses
- Extensible

![](_page_26_Figure_7.jpeg)

- Commands
  - SEND timestamp source destination size "extra strings"
    - The "extra strings" enconde the events that Dimemas will update upon reception of the message.
  - STOP timestamp
  - > END
  - > FINISH
  - PROTO (OK\_TO\_SEND, READY\_TO\_RECV)
    - (parameters as for SEND)
- Responses:
  - STOP REACHED timestamp
  - COMPLETED SEND [echo of the original data + extra data]

## Impact of switch architecture

### Infinite CPU speed, output-queued switch

![](_page_27_Picture_4.jpeg)

### Normal CPU speed, output-queued switch

![](_page_27_Figure_6.jpeg)

![](_page_27_Picture_7.jpeg)

![](_page_27_Figure_8.jpeg)

### Normal CPU speed, input-queued switch

2<sup>nd</sup> International Workshop on OMNeT++, March 6, 2009, Rome, Italy