## The OptoHPC simulator: Bringing OptoBoards to HPC-scale environments

<u>Pavlos Maniotis</u>, Nikos Terzenidis, Nikos Pleros Aristotle University of Thessaloniki (AUTH), Greece

> OMNeT++ Community Summit 2016 15 September 2016, Brno, Czech Republic



## Outline

- Introduction
- The OptoHPC simulator architecture
- An OptoHPC use case: comparison performance analysis using the OptoHPC
- Conclusion





## Data Movement is the Bottleneck to Performance, Not Flops

Source: AI Geist in "Paving the Roadmap to Exascale", SciDAC Review 2010



#### Ranked as the world's fastest supercomputer (Nov. 2015)

33.9 PFLOPS
 Analy reached 4% of the exascale target (set for ~2020-2025)
 Analy 17.6 MW
 Analy reached 89% of the 20 MW power limit target \*

The *OptoHPC* simulator

\*P. Kogge. The tops in flops. IEEE Spectrum, 48(2):48–54, 2011.



## **Motivation**

## Data Movement is the Bottleneck to Performance, Not Flops

Source: AI Geist in "Paving the Roadmap to Exascale", SciDAC Review 2010



The **OptoHPC** simulator

\*P. Kogge. The tops in flops. IEEE Spectrum, 48(2):48–54, 2011.



## **Optical Interconnects Evolution & RoadMap**



Source: IBM, B. Jan Offrein, "Silicon Photonics Packaging Requirements", Munich 2011



## **Optical Interconnects Evolution & RoadMap**



Source: IBM, B. Jan Offrein, "Silicon Photonics Packaging Requirements", Munich 2011



## The PhoxTroT Research Project & its Vision



PhoxTroT deals with optical:
(1) On-board,
(2) Board to board and
(3) Rack to Rack interconnects





## The PhoxTroT Research Project & its Vision

PhoxTroT deals with optical:

## How do all these technology improvements will affect the system-scale performance of an HPC?

Opto-HPC is an OMNeT++ based simulator that targets in simulating complete HPC network systems that make use of PhoxTroT technologies (and generally optical technologies)

14 waveguides MTP connector The OptoHPC simulator

#### titanStyleNetwork network module:

File Simulate Inspect View

The Or - Defines the connections among the HPC racks and declares the use of the (a) statisticsManager, (b) networkAddressesManager and (c) trafficPatternsManager simple modules

- Can be configured to any 3D Torus and Mesh network desired size







File Simulate Inspect View Help 🔓 🚔 i stepa rud radi editori and radi and radi editori and radi and radi editori and radioral and radiora SF-CRAY #0: titanStyleNetwork Event #133004 t=0.000003842759657883s Msg stats: 2238 scheduled / 4042 existing / 108691 created Next: (resourcesManagerTimer, id=108515) In: titanStyleNetwork.cabinet[0].chassis[1].pcb[3].router[0].resourcesManager At: last event + 0.000000000133736015s titanStyleNetwork (titanStyleNetwork titanStyleNetwork 🔶 - 기계 기계 庄 🗟 scheduled-events (cMessageHeap) (credit) 间 (titanStyleNetwork) titanS Fields Contents (15) cabinet[1] Info cabinet[0] cabinet[2] cabinet[3] Class Name numberOfCabinets cPar networkAddressesManager simple module: numberOfChassis cPar numberOfPCBsPerCha 8 - Responsible for addresses allocation to cPar numberOfRoutersPerP 2 InterCabiRouterToRou 37.5 statisticsManager InterCabiRouterToRou 5e-09 cPar network's nodes and routers (for both decimal InterCabiRouterToRou 75 InterCabiRouterToRou 5e-09 cPar and XYZ addresses) netwo networkAddressesMan id=2 networkAddressesManager traffic trafficPatternsManage id=3 statist statisticsManager id=4 - Responsible for defining the dateline cabine cabinet[0] id=5 cabine cabinet[1] id=6 routers that are necessary for resolving cabine cabinet[2] id=7 trafficPatternsManager cabine cabinet[3] id=8 Deadlocks in Torus networks titanStyleNetwork.cabinet[3].chassis[1].pcb[4].node[1].buffer (buffer, id=3670), on selfmsg `{}' (buffer 📄 Event #133001 t=0.000003842567355809 t=0.000003842619007433 titanStyleNetwork.cabinet[3].chassis[2].pcb[3].router[1].resourcesManager (resourcesManager, id=3956), or Event #133002 titanStyleNetwork.cabinet[3].chassis[2].pcb[4].router[1].buffer[7] (buffer, id=4006), on `{}' (data, id=104231) Event #133003 t=0.00003842625921868 titanStyleNetwork.cabinet[0].chassis[1].pcb[3].router[0].resourcesManager (resourcesManager, id=497), on selfmsg `{}' (resourcesManager) 003842759657883 

Pos-net









Zoom: 1.00x









# ++++|+ r/uter[+]

#### Node compound module:

- Represents the CPU chips used in the HPC
- Embodies all the key simple modules for having "cpu operation"

#### **Router compound module:**

- Represents the router chips used in the HPC
- Embodies all the key simple modules for having "router operation"
- Supports DOR and minimal Valiant routing algorithms
- Utilizes 3 auxiliary classes:
- 1) shortestPathsManager
- 2) routingTableManager
- 3) routingManager













## **Stats for Nerds**

| 6 Compound Modules                 | 7 Simple Modules               |  |
|------------------------------------|--------------------------------|--|
| 1) titanStyleNetwork.ned           | 1) networkAddressesManager.ned |  |
| 2) cabinet.ned                     | 2) trafficPatternsManager.ned  |  |
| 3) chassis.ned                     | 3) statisticsManager.ned       |  |
| 4) pcb.ned                         | 4) trafficGenerator.ned        |  |
| 5) node.ned                        | 5) buffer.ned                  |  |
| 6) router.ned                      | 6) resourcesManager.ned        |  |
| (5 & 6 implement also C++ classes) | 7) switchFabric.ned            |  |

#### 5 msg definitions

- 1) bufferTimer.msg
- 2) resourcesManagerTimer.msg
- 3) data.msg
- 4) flit.msg
- 5) credit.msg

#### C++ code

- 1) 23 new C++ class definitions
- 2) a total of ~8000 lines of C++ code
- 3) O(n^2) complexity for the Dijkstra algorithm
- 4) O(1) complexity for all the major functions (routing decisions, traffic generation etc...)

## An OptoHPC use case: Titan CRAY XK7 blade vs OPCB









## An OptoHPC use case: Titan CRAY XK7 blade vs OPCB



## An OptoHPC use case: Titan CRAY XK7 blade vs OPCB



# Performance Analysis Results – CRAY XK7 for both DOR & MOVR





## **Performance Analysis Results**





#### Parformanco Analycic Poculte

### Mean node Throughput Results

| <b>U</b> 1           |                               |                           |                            |  |
|----------------------|-------------------------------|---------------------------|----------------------------|--|
| Pattern              | Conventional<br>Router (Gbps) | OE-Router-<br>88ch (Gbps) | OE-Router-<br>168ch (Gbps) |  |
| Uniform<br>Random    | 14.28                         | 48 ( <u>3.36x</u> )       | 92 ( <u>6.44x</u> )        |  |
| Bit Rotation         | 20.2                          | 27.2 ( <u>1.34x</u> )     | 51.46 ( <u>2.54x</u> )     |  |
| Bit Complement       | 11.7                          | 23.67 ( <u>2.02x</u> )    | 48 ( <u>4.10x</u> )        |  |
| Bit Reverse          | 12                            | 17 ( <u>1.41x</u> )       | 32.8 ( <u>2.73x</u> )      |  |
| Shuffle              | 17.4                          | 19.25 ( <u>1.10x</u> )    | 36.43 ( <u>2.09x</u> )     |  |
| Tornado              | 5.23                          | 11.51 ( <u>2.20x</u> )    | 24 ( <u>4.58x</u> )        |  |
| Transpose            | 15.45                         | 21.63 ( <u>1.40x</u> )    | 41.76 ( <u>2.70x</u> )     |  |
| Nearest<br>Neighbour | 36                            | 30.7 (0.85x)              | 57.6 ( <u>1.60x</u> )      |  |
| Mean                 | ~16.5                         | ~24.9 ( <u>1.5x</u> )     | ~48 ( <u>2.90x</u> )       |  |



Throughput (Gbps)

Throughput (Gbps)

## Conclusions

#### Successfully developed a queue-based simulator for complete HPC

#### <u>systems</u>

Offers support for both electrical and optical components
Currently supports 3D Torus and Mesh Topologies

- Supports 8 synthetic traffic patterns as well as user-defined statistical distributions and trace files
- Features both SF and VCT operation like most state-of-the-art routers in the market
- Implements DOR and Minimal Oblivious Valiant Algorithms (with VC support) allowing for deadlock free operation
- Comparison between Conventional & O/E technologies using OptoHPC has shown 1.5x mean higher throughput for 88ch. case, 2.9x mean higher throughput for 168ch. case



## Thank you for your attention!

