# Capturing the Sensitivity of Optical Network Quality Metrics to its Network Interface Parameters

Marta Ortín-Obón<sup>§</sup>, Luca Ramini<sup>†</sup>, Victor Viñals<sup>§</sup>, Davide Bertozzi<sup>†</sup>
<sup>§</sup> gaz-DIIS-i3A, University of Zaragoza, Spain.
<sup>†</sup> ENDIF, University of Ferrara, Italy.
ortin.marta@unizar.es, luca.ramini@unife.it, victor@unizar.es, davide.bertozzi@unife.it

#### **ABSTRACT**

Optical Networks-on-chip (ONoCs) are gaining momentum as a way to improve energy consumption and bandwidth scalability in next generation multi and many-cores systems. Although many valuable research works have investigated their properties, the vast majority of them lack an accurate exploration of the network interface architecture (NI) required to support optical communications on the silicon chip. The complexity of this architecture is especially critical for a specific kind of ONoCs: wavelength-routed ones. From a logic viewpoint, they can be considered as full nonblocking crossbars, hence the control complexity is implemented at their NIs. To our knowledge, this paper proposes the first complete NI architecture for wavelength-routed optical NoCs, by coping with the intricacy of networking issues such as flow control, buffering strategy, deadlock avoidance, serialization, and above all, their codesign in a complete architecture.

### 1. INTRODUCTION

The current research frontier for on-chip interconnection networks consists of assessing the feasibility of the optical interconnect technology by exploiting the recent remarkable advances of silicon photonics [8]. The literature on this topic is starting to become quite rich, mainly projecting superior bandwidth, latency and energy with respect to electrical wires beyond a critical length [12]. This benefits are extended to on-chip communication architectures, either as standalone optical networks (ONoCs) [13], or as hybrid interconnect fabrics [4]. Nonetheless, projected quality metrics are overly optimistic for a number of reasons extensively discussed in [1], including optimistic technology assumptions, use of logical topology designs instead of physical ones, and overlooking static power. An important source of inaccuracy of many projected results comes from the lack of a complete network interface architecture for driving on-chip optical communication, which may account for a large fraction of the overall network complexity. This is especially true for a particular category of optical networks-on-chip: the Wavelength-Routed ones (WRONoCs). These networks deliver contention-free global connectivity without need for arbitration or routing. They achieve this goal by replicating the amount of wavelengths used, and by associating each wavelength with a different and non-conflicting optical routing path. Despite the limited scalability, these networks are attractive for specific application domains, where performance predictability and ultra-low latency communications are a must.

WRONoCs can be conceptualized as non-blocking full crossbars, therefore all the complexity of the control architecture is located at the boundary of the interconnect fabric. To our knowledge, no complete NI architecture has been reported so far in the open literature, with the exception of NIs for spacerouted ONoCs. However, these are conceptually simpler due to the intuitive conversion of electrical bit parallelism into optical wavelength parallelism [10]. In contrast, WRONoCs rely on serialization or on a limited bit parallelism, which questions the achievement of performance goals. Even neglecting this difference, NI design for an optical medium is a non-trivial task due to the number of interdependent design issues that come to the forefront, such as end-to-end flow control, buffer sizing, clock re-synchronization, and serialization ratio. This paper takes on the challenge of designing and characterizing the complete NI architecture for emerging WRONoCs, in an attempt to validate whether (and to what extent) the projected benefits of optical NoCs over their electrical counterparts are still preserved with the NI in the picture. The distinctive feature of this work is the completeness of the designed architecture, including both initiator and target sides. Especially, the digital part, concerning the true architecture-level design for mastering optical NoC operation, has been designed out of state-of-the-art basic building blocks (e.g., mesochronous synchronizers and dual-clock FIFOs), thus reflecting realistic quality metrics. The system-level requirements of a target multi-core processor with cache-coherent memory architecture have a large impact on th eNI design. Finally, for the optical and optoelectronic components, we used a consistent set of static and dynamic power values from the same literature source [2, 1]. Our evaluation methodology consisted of 2 steps: first, we synthesize and characterize latency and power for all the architecture components on a low power industrial 40 nm technology; second, we set up a complete SystemC-based simulation infrastructure (for both the optical and electronic parts) with RTL-equivalent accuracy, thus enabling to capture fine grained performance effects associated with the microarchitecture.

# 2. RELATED WORK

Early ONoC evaluation studies rely on coarse, high-level models and/or unrealistic traffic patterns [19, 24, 11, 18], while more recent ones come up with complete end-to-end evaluations using real application workloads [14] and/or more accurate optical network models [5]. Looking in retrospect, early results have been only partially confirmed,



Figure 1: Wavelength-Selective routing

nonetheless showing the potentials of ONoCs for on-chip communication. For instance, with an aggressive electrical baseline technology, it became more difficult to make a strong case for purely on-chip nanophotonic networks [14]. However, even in this case, it was still possible to show significant potential in using seamless intra-chip/inter-chip nanophotonic links. Moreover, other works (such as [1]) related network energy to total system energy, thus making the point for fast interconnect fabrics capable of cutting down the static energy of non-network components, although they are themselves not energy-efficient.

The refinement of comparative analysis frameworks is far from stabilizing. In fact, other missing aspects are progressively coming to the forefront, as the ONoC research concept strives to become an industry-relevant technology. So far, the NI architecture has been overlooked in most evaluation frameworks, or in the best case, only considered in the early stage of design. Some pioneer works account for the NI in their network analysis for wavelength-routed optical networks [17, 1, 3], or space-routed ONoCs [10]. In every case, they suffer from one of the following weaknesses: first, they model NI components only at behavioural level [17], or they target only the more abstract level of formalization of interface specification [3]; second, they consider only the signal driving section of the NI, basically up to the (de)serializers. This way, higher-level network architecture design issues such as flow control, synchronization, or buffering are overlooked.

The distinctive features of our approach are: architecture completeness, comparison with electrical interface counterparts, physical synthesis of digital components, RTL-equivalent SystemC modeling for microarchitectural performance characterization, and aalysis of the impact of NI parameters on global network quality metrics.

## 3. BACKGROUND ON WRONOCS

Wavelength-routed optical NoCs (WRONoCs) rely on the principle of wavelength-selective routing. As it is conceptually showed in Figure 1, every initiator can communicate with every target at the same time using different wavelengths. For instance, initiator I1 uses wavelengths 1, 2, 3, and 4 to reach targets 1, 2, 3, and 4, respectively. The topology connectivity pattern is chosen to ensure that wavelengths will never interfere with each other on the network optical paths. This way, all initiators can communicate with the same target by using differentiated wavelengths. WRONoCs support contention-free all-to-all communication with a modulation speed of 10 Gbps/wavelength. Our NI can work with any WRONoC topology. Without lack of generality, we model a wavelength-routed ring inspired by [15] and implemented on an optical layer vertically stacked on top of the baseline electronic layer.



Figure 2: Dependence between a request and reponse at the NI.

### 4. TARGET ARCHITECTURE

During the design of the NI, we consider a high-impact system requirement: message-dependent deadlock avoidance. This arises from the interactions and dependencies created at network endpoints between different message types [7, 9]. Figure 2 shows the dependence between a request and response at the NI. In a complete system, the combination of these effects may lead to cyclic dependencies. Messagedependent deadlocks, once they occur, block resources at both network endpoints and inside the network indefinitely, even if an algorithm is used to avoid routing-dependent deadlocks in the network-on-chip. This arises from the fact that network routers are unable to differentiate between message-dependent deadlocks and normal network congestion. When we apply these considerations to WRONoCs, the problem gets simplified by the fact that there is no buffering inside the network. Therefore, the ONoC automatically satisfies the consumption assumption, which is a necessary (but not sufficient) condition for deadlock avoidance. To enforce the sufficient condition, we must allocate a different buffer for each kind of message in the NI. This has direct implications on the buffering architecture of our target NI (that is, on the number of virtual channels), depending on the communication protocol the WRONoC needs to support.

As a consequence, we make an assumption on a target system architecture. Without lack of generality, we focus on a homogeneous chip multiprocessor with 16 cores, similar to the Tilera architecture [23]. Each core has a private L1 cache and a bank of the shared distributed L2 cache, both connected to a common NI through a crossbar. The system has directory-based coherence managed with a MESI protocol. By analysing the dependency chains of the protocol and deadlock-free buffer sharing opportunities, we came up with a requirement of 3 VCs for deadlock avoidance. Proof is omitted for lack of space.

#### 5. NI ARCHITECTURE

This section presents, to the best of our knowledge, the first complete network interface architecture for wavelength-routed optical networks, as depicted in Figure 3. As a consequence, the objective is not to present the best possible design point, but rather to start considering the basic components, and deriving guidelines about which ones deserve the most intensive optimization effort. Clearly, ONoCs move most of their control logic to the NIs, which should therefore not be oversimplified with abstract models.

To avoid message-dependent deadlock, every NI needs separate buffering resources (virtual channels, VCs) for each one of the three message classes of the MESI protocol. This should be combined with the requirements of wavelength routing: each initiator needs an output for each possible target, and each target needs an input for each possible initiator. As a result, in an initial version of the NI, each initiator came with 3 FIFOs for each potential target, and each target, with 3 FIFOs for each potential initiator. In a

more energy-efficient version of the NI (see Figure 3), the transmission side reuses the same 3 FIFOs for all destinations, and flits are dispatched to different paths afterwards (all the logic components after the 1x15 demultiplexers are replicated for each destination). All the FIFOs at both the transmission and the reception side must be dual-clock FIFOs (DC FIFOs) to move data between the processor frequency domain (we assume 1.2GHz) and the one used inside the NI. As hereafter explained, the latter depends on bit parallelism. We used the DC FIFO architecture presented in [22].

To size the DC FIFOs, we considered the size of the packets that would use each of the VCs: control packets need 2 flits, while data packets need 21 flits assuming flits are 32 bits long. The FIFO depth will be assessed in the experimental results, as well as the flit width. The minimum size for the DC FIFO to achieve perfect throughput is 5 slots [22], so all the VCs in the transmission side have been sized accordingly. For the reception side, we sized the data VC based on the round-trip latency in order to allow uninterrupted communications, ending up with 15-slot DC FIFOs. However, for the control VCs we decided to keep small 5-slot DC FIFOs because they can already fit two complete packets and we do not expect to send many back-to-back control packets with the target cache-coherence protocol.

After flits are sent to the appropriate path depending on their destination, they need to be translated into a 10 GHz bit stream in order to be transmited through the optical NoC. This serialization process is parallelized to some extent to increase bandwidth and reduce latency. 3-bit parallelism means that 3 serializers of 11 bits each work in parallel to serialize the 32 bits of a flit, resulting on a bandwidth of 30 Gbps. The bit-parallelism determines the frequency inside the optical NI: 1.1 ns (0.1\*number of bits) are needed to serialize a flit with 3-bit parallelism, but only 0.8 ns are needed with 4-bit parallelism. In turn, this also impacts the size of the reception DC FIFO based on round-trip latency, which increases from 15 to 17 slots when moving from 3 to 4-bit parallelism.

Another key issue to be considered in NI is the resynchronization of received optical pulses with the clock signal of the electronic receiver. In this paper we assume source-synchronous communication, which implies that each point-to-point communication requires a strobe signal to be transmitted along with the data on a separate wavelength. With current technology, this seems to be the most realistic solution, even considering the promising research effort that is currently being devoted to transmitting clock signals across an optical medium [16]. The source-synchronous clock is then used at the reception side of the NI to drive the deserializers and, after a clock divider, the front-end of the DC FIFOs. We assume that a form of clock gating is implemented, therefore when no data is transmitted, the optical clock signal is gated.

Another typically overlooked issue is the backpressure mechanism. We opt for credit-based flow control because it does not rely on timing assumptions, and credit tokens can reuse the existing communication paths. Besides, the low dynamic power of ONoCs can easily tolerate the signaling overhead of this flow control strategy. Credits are generated at the reception side of the NI when a flit leaves the DC FIFO (at the processor frequency) and forwarded to the transmission side so that they can be sent back to the source (at the NI



Figure 3: Optical Network Interface Architecture for 3-bit parallelism

frequency). In order to change from one frequency domain to the other, we opted for synchronizing the valid bits with a brute force synchronizer. For the scheme to work, credit flit data must be constant for three NI cycles; during that time, credits are accumulated in credit counters. As soon as the credit flit arrives at the transmission side, it has priority over the flits from the VCs. The mandatory waiting time guarantees VCs will not suffer from starvation. To make a better use of the 32 bits of a flit, credits for all VCs of the same destination are sent together in the same credit flit. When credits arrive at the reception side of the source NI, they need to go through a mesochronous synchronizer to adapt the frequency derived from the received clock to the local NI frequency. Dedicated FIFOs for each source are needed at the reception side of the NIs to support this credit-based flow control. This is a clear candidate for future optimizations.

#### 6. BASELINE ELECTRONIC NOC

The baseline electronic switch architecture is the consolidated  $\times$ pipesLite architecture [21], which represents an ultralow complexity design point for electronic NoCs. Each 32-bit switch includes 3 VCs to avoid message-dependent deadlock, with 5 slots each. It takes one cycle to traverse the switch and one cycle to traverse each link.

The network interface consists of two parts [14]. The first one is a packetizer, which acts as protocol converter from the IP-core protocol to the network one. This block is also required for the ONoC, therefore it is not considered in this comparison framework, and is not showed in Figure 3 either. The second one is the buffering stage. In order to preserve the generality of the design and support cores with different operating frequencies that access an ENoC with fixed common frequency, dual-clock FIFOs have been included at the electronic NIs, similar to the ONoC NI design. However, in this case all DC FIFOs have 5 slots at both initiator and target side, because round trip latency does not require larger buffers for maximum throughput operation.

## 7. EVALUATION

This section characterises the most important networkquality metrics for the electro-optical NI: latency, through-

| Table 1:   | Photonic   | ${\bf components}$ | parameters   | and | values |
|------------|------------|--------------------|--------------|-----|--------|
| with aggre | essive and | conservative       | technologies | S   |        |

| Terr degreesive dira conservati |                      |                  |  |
|---------------------------------|----------------------|------------------|--|
| Parameter                       | Cons.                | $\mathbf{Aggr.}$ |  |
| 1 arameter                      | tech.                | $_{ m tech.}$    |  |
| Coupler loss                    | $0.46~\mathrm{dB}$   | 0.46 dB          |  |
| Modulator insertion loss        | 4.0 dB               | 4.0 dB           |  |
| Photodetector loss              | 1.0 dB               | 1.0 dB           |  |
| Filter drop loss                | 1.0 dB               | 1.0 dB           |  |
| Theorem loss                    | 0.0001               | 0.0001           |  |
| Through ring loss               | dB/ring              | dB/ring          |  |
| Propagation loss                | 1.5 dB/cm            | 1.5 dB/cm        |  |
| Bending loss                    | $0.0005~\mathrm{dB}$ | $0.0005~{ m dB}$ |  |
| Crossing loss                   | $0.52~\mathrm{dB}$   | 0.18 dB          |  |
| Wall-plug laser efficiency      | 8%                   | 20%              |  |
| The annual transition           | 20                   | 20               |  |
| Thermal tuning                  | uW/ring              | uW/ring          |  |
| Transmitter (dyn. energy)       | 50 fJ/bit            | 20 fJ/bit        |  |
| Transmitter (fixed energy)      | 10 fJ/bit            | 2.5 fJ/bit       |  |
| Receiver (dyn. energy)          | 25 fJ/bit            | 10 fJ/bit        |  |
| Receiver (fixed energy)         | 15 fJ/bit            | 5 fJ/bit         |  |

put, static power, and energy-per-bit. Results for an ENoC configured with typical parameters from [21] are also included. This aims to set the bases for a future comprehensive crossbenchmarking study, which is out of the scope of this paper.

## 7.1 Methodology

To obtain accurate latency results, we implemented detailed RTL models of the optical and electronic network interfaces and NoCs using SystemC. We instantiated a 4x4 2D mesh for the ENoC, and a similar system connected through the optical ring for the ONoC. The network-wide focus, well beyond the NI, aims at relating NI quality metrics to network ones. Delay values for the optical ring have been backannotated from physical-layer analysis results [6], and have been differentiated on a per-path basis.

For power modeling, every electronic component has been synthesized, placed and routed using a low power 40 nm industrial technology library. Power metrics have been calculated by backannotating the switching activity of block internal nets, and then importing waveforms in the Prime-Time tool. We have applied clock gating to achieve realistic static power values. Energy-per-bit has been computed by assuming 50% switching activity. For the fast developing optical technology, we consider a coherent set of both conservative and aggresive values (obtained from [2, 1]). The photonic components and values are listed in Table 1. Table 2 sums up the static power and energy-per-bit for all the electronic and optical devices (all DC FIFOs, independently of size and frequency, are reported to have the same static power as a consequence of clock gating). These values are only realistic under the assumption of low network contention, which reflects the typical operating condition of cache-coherent multicore processors.

#### 7.2 NI Latency Breakdown

Figure 4 presents the latency breakdown for the NI components and the ONoC, obtained from our accurate RTL-equivalent simulations. We clearly see that the latency of the network is negligible, but it requires support from a time consuming NI. Inside the NI, the DC FIFOs are the components with the largest latency.



Total latency: ctrl flit = 9.04ns; data flit = 9.31ns

Figure 4: Latency breakdown of the optical NI with 3-bit parallelism and the optical ring.



Figure 5: Latency of the most common communication patterns. For the ENoC, we include minimum, maximum, and average paths.

## 7.3 Transaction Latency

We simulate the most common traffic patterns generated by a MESI coherence protocol in our RTL models without any contention. The increased accuracy of our analysis stems from the fact that our packet injectors and ejectors model actual transactions of the protocol, as well as their interdependencies. Table 3 describes the analysed compound transactions and Figure 5 presents the zero-load latency results. The messages included in these patterns amount to an average 99.9% of the total network traffic, as we observed from full-system simulations of realistic parallel benchmarks from PARSEC and SPLASH2 and multiprogrammed workloads built with SPEC applications (we only exclude communication with the memory controllers). Therefore, they are a very good indicator of the network latency improvements we can expect from the optical network, including its (nonnegligible) network interface overhead.

We observe that in all the patterns except the last one, the ONoCs either beat or obtain equal results to the ENoC with all path lenghts. As opposed to the ENoC, most of the latency of the ONoC is spent in the NI, which is needed to support the low latency optical communication. The tendency changes in pattern 5 because the replacement packet is using a VC designed for control to transmit data, and the smaller FIFO cannot store enough flits to support the round-trip latency. However, this messages are only 7.4% of the total network traffic.

#### 7.4 Throughput

In this section, we test the behaviour of the electronic and optical networks under contention. To do that, we focus only on requests and data replies. We leave the ACKs out because they are not in the critical path of the communications. We pick a node to be the main L1 and another node to be the L2, and count the number of completed transactions per second. Then, we gradually insert congestion into the network by having all the other nodes sending requests to the same L2 cache, and keep counting the transactions just between the main L1 and the L2. All the L1s support

Table 2: Static Power and Dynamic Energy of Electronic and Optical Devices.

| HARDWARE                   | 3-bit parallelism |                             | 4-bit parallelism             |                 |                             |                               |
|----------------------------|-------------------|-----------------------------|-------------------------------|-----------------|-----------------------------|-------------------------------|
| COMPONENTS                 | count<br>per NI   | STATIC<br>POWER<br>(mWatts) | DYNAMIC<br>ENERGY<br>(fJ/bit) | count<br>per NI | STATIC<br>POWER<br>(mWatts) | DYNAMIC<br>ENERGY<br>(fJ/bit) |
| DC_FIFO 5slots (TX)        | 3                 | 0.12                        | 10.65                         | 3               | 0.12                        | 12.72                         |
| DC_FIFO 5slots (RX)        | 30                | 0.12                        | 8.54                          | 30              | 0.12                        | 10.2                          |
| DC_FIFO 15-17 slots        | 15                | 0.12                        | 26.50                         | 15              | 0.12                        | 31.65                         |
| DEMUX1x3                   | 1                 | 0.000725                    | 0.92                          | 1               | 0.000725                    | 0.92                          |
| DEMUX1x15                  | 3                 | 0.0021                      | 25.21                         | 3               | 0.0021                      | 25.21                         |
| DEMUX1x4                   | 15                | 0.00056                     | 6.72                          | 15              | 0.00056                     | 6.72                          |
| MUX4x1 + ARB               | 15                | 0.08                        | 0.36                          | 15              | 0.11                        | 0.49                          |
| MUX45x1 + ARB              | 1                 | 0.9                         | 5.09                          | 1               | 0.9                         | 5.09                          |
| SERIALIZER                 | 45                | 0.0475                      | 9.41                          | 60              | 0.0417                      | 2.63                          |
| DESERIALIZER               | 45                | 0.0289                      | 7.74                          | 60              | 0.0281                      | 6.12                          |
| MESO-SYNCHRONIZER          | 45                | 0.041                       | 8.00                          | 45              | 0.0565                      | 11.1                          |
| COUNTER 2bits              | 45                | 0.01482                     | 1.014                         | 45              | 0.01482                     | 1.014                         |
| BRUTE FORCE SYNC           | 15                | 0.004234                    | 1.4                           | 15              | 0.00503                     | 1.66                          |
| CLOCK DIVIDER              | 15                | 0.01172                     | 0.6                           | 15              | 0.0139                      | 0.714                         |
| TSV                        | 120               | /                           | 2.50                          | 150             | /                           | 2.50                          |
| TRANSMITTER aggressive     | 60                | 0.025                       | 20                            | 75              | 0.025                       | 20                            |
| TRANSMITTER conservative   | 60                | 0.100                       | 50                            | 75              | 0.100                       | 50                            |
| RECEIVER aggressive        | 60                | 0.050                       | 10                            | 75              | 0.050                       | 10                            |
| RECEIVER conservative      | 60                | 0.150                       | 25                            | 75              | 0.150                       | 25                            |
| THERMAL TUNING /RING $20K$ | 180               | 0.020                       | /                             | 225             | 0.020                       | /                             |
| LASER POWER aggr           | /_                | 0.0421                      | /                             | /               | 0.0525                      | /                             |
| LASER POWER real           | /                 | 0.308                       | /                             | /               | 0.385                       | /                             |
| E-SWITCH (3VCs)            | /                 | 17.9                        | 193                           | /               | 17.9                        | 193                           |

| Table 3: Messages generated by the coherence protocol. |                                                 |                                                                                                                                                                              |  |  |
|--------------------------------------------------------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| id                                                     | Event                                           | Sequence of messages                                                                                                                                                         |  |  |
| P1a                                                    | L1 miss                                         | 1. Request from L1 to L2 2. Data reply from L2 to L1                                                                                                                         |  |  |
|                                                        |                                                 | 3. ACK from L1 to L2                                                                                                                                                         |  |  |
| P1b/c                                                  | L1 write<br>miss, 1/2<br>sharers                | <ol> <li>Request from L1 to L2</li> <li>L2 sends data reply and invalidates</li> <li>1/2 sharers</li> <li>Sharers sends ACK to L1 req.</li> <li>ACK from L1 to L2</li> </ol> |  |  |
| P2a                                                    | L1 needs<br>upgrade to<br>write                 | <ol> <li>Request from L1 to L2</li> <li>ACK reply from L2 to L1</li> <li>ACK from L1 to L2</li> </ol>                                                                        |  |  |
| P2b/c                                                  | L1 needs<br>upgrade to<br>write, 1/2<br>sharers | <ol> <li>Request from L1 to L2</li> <li>ACK reply from L2 to L1 and invalidates 1/2 sharers</li> <li>Sharers send ACK to L1 req.</li> <li>ACK from L1 to L2</li> </ol>       |  |  |
| Р3                                                     | L1 write<br>miss,<br>another<br>owner           | <ol> <li>Request from L1 to L2</li> <li>L2 forwards request to owner</li> <li>Owner sends data to L1</li> <li>ACK from L1 to L2</li> </ol>                                   |  |  |
| P4                                                     | L1 read<br>miss,<br>another<br>owner            | <ol> <li>Request from L1 to L2</li> <li>L2 forwards request to owner</li> <li>Owner sends data to L1 and L2</li> <li>ACK from L1 to L2</li> </ol>                            |  |  |
| P5                                                     | L1<br>replacement                               | 1. Writeback from L1 to L2 2. ACK from L2 to L1                                                                                                                              |  |  |



Figure 6: Number of completed transactions per 1K ns between two nodes as the number of interferers increases.

only one outstanding transaction and inject a new request as soon as they receive the reply.

Figure 6 presents the results for the ENoC and the 3, 4 and 6-bit parallelism ONoCs. Without contention, more transactions get completed in the optical NoC because their latency is lower. Including only one interferer does not affect results because all networks have enough bandwidth to support two concurrent L1 requestors at maximum throughput. As we keep increasing the number of interferers, the throughput for the 3-bit parallelism ONoC drops much faster than for the ENoC. This is because the former can eject a maximum of 30 Gbps, while the latter transmits flits at 38.4 Gbps. For this reason, replies need to wait much longer until they can be transmitted. However, when considering the ONoC with 4-bit parallelism, which has a bandwidth of 40 Gbps, we see results comparable and even superior to those of the

Table 4: Buffer sizes explored for the 3 VCs at each side of the NI. Note that the actual capacity of the DC FIFOs

is one flit less than the number of slots.

| is one mit less than the number of slots. |                   |                |  |  |
|-------------------------------------------|-------------------|----------------|--|--|
| id                                        | Transmission side | Reception side |  |  |
| Α                                         | 3, 3, 3           | 3, 3, 3        |  |  |
| В                                         | 3, 3, 5           | 3, 3, 5        |  |  |
| C                                         | 5, 5, 5           | 5, 5, 5        |  |  |
| D                                         | 5, 5, 5           | 5, 5, 15       |  |  |
| E                                         | 5, 5, 22          | 5, 5, 15       |  |  |
| F                                         | 10, 10, 44        | 10, 10, 44     |  |  |



Figure 7: Transaction latency with varying buffer sizes.

ENoC. At 6-bit parallelism, the increased bandwidth (60 Gbps) only gives the ONoC a slight advantage, which is not enough to justify the increase in static power (as it will be documented lated).

## 7.5 Buffer Size Exploration

In this section we analyse the effect of modifying the buffering of the optical network interface. We fix the bit parallelism at 3 and explore all the buffer size combinations detailed in Table 4. Using the same request-reply pattern as in the previous section with a maximum of 4 outstanding requests per node, we analyse how buffer size in the NI affects transaction latency. Results are depicted in Figure 7.

In case A, the minimum buffering has a very negative impact on performance because data packets are stalled waiting for credits from the reception side FIFOs, which can only store 2 flits (for a correct management of the DC FIFO, one slot is always left empty). This effect is slightly mitigated when we increase the buffer size for this VC to 5 slots in case B. Even though the DC FIFOs can achieve perfect throughput, backpressure is still preventing faster communications. We don't see any difference by increasing the size of control VCs in case C because the bottleneck is in the data VC. However, in case D, the reception side has been sized based on the round-trip latency and we achieve the maximum throughput possible. The larger buffers in cases E and F do not show any further improvements because the network is already using up all the bandwidth.

# 7.6 Flit Width Exploration

In this section we analyse the effect of modifying the flit width and the bit parallelism of the optical network interface. Figure 8 presents the transaction latency for several combinations using a request-reply pattern, while increasing the injection rate. 4 nodes that act as L1 caches send requests to the same L2 cache, with a maximum of 4 outstanding requests each. While the injection rate is low, there is enough time between requests to service all of them without degradation of the latency. When the injection rate



Figure 8: Transaction latency with varying flit width and bit parallelism.

increases, requests start accumulating at the L2 until a saturation point is reached. At that point, the number of outstanding requests is maximum and we obtain the transaction latency under congestion. This saturation point arrives later with higher bit parallelism and the maximum latency is lower because each transaction requires less time.

If we compare results without congestion for 32 and 64-bit flits, we see that, for a given bit parallelism, latency is shorter with 32-bit flits. This is because, even though 64-bit flit packets have less flits, it takes longer to serialize all their bits for optical transmission. However, this trend is reversed under congestion because the packet serialization latency that determines the length of the queue at the L2 is shorter. Compared to the latency for 32-bit flits with 3-bit parallelism under no contention, 64-bit flits with 4 and 6-bit parallelism are 6.4% and 27.3% faster, respectively.

# 7.7 Power and Energy-per-Bit

Figure 9 depicts the static power and (dynamic) energy-perbit for the ENoC vs. the 3 and 4-bit parallelism ONoCs. We do not consider ONoCs with less than 3-bit parallelism because the bandwidth of the optical paths would be too low, or ONoCs with more than 4-bit parallelism, because the static power becomes unacceptable (we can see a clear trend in Figure 9). We present a breakdown of the contributions of the NIs and NoCs. For the NI, we also separate the electronic components from the optical (and analogic) ones. The optical NoC is solely composed of laser power, so it has no impact on dynamic energy. In computing total power figures, we consider two sets of parameters for optical interconnect technology, corresponding to its high maturity (named aggressive parameters) and to its low maturity (conservative parameters).

We observe that the electronic switches dominate the static power, accounting for 95.8% of the total. However, this trend is reversed in the ONoC, with a contribution of only 10.6% and 11.8% for the aggressive technology with 3 and 4-bit parallelism, respectively. It is worth highlighting that most of the static power of the electronic components in the NI comes from the DC FIFOs. Also, the savings in execution time of the ONoC vs the ENoC may compensate the higher static power and result in overall energy reductions. This is especially true when we consider the power of the system as a whole, as claimed in [14].

For energy-per-bit we included minimum, maximum and average-length paths for the ENoC and specific values for control and data packets for the ONoC (which change due to the different size of the reception DC FIFOs). We clearly see



Figure 9: Static power and Energy-per-Bit of the NIs and the electronic a optical NoCs.

that the ONoC has significantly lower energy-per-bit than the ENoC, which confirms the trend observed in previous literature. Apart from that, we still see how the main contributor for the ENoC energy is the NoC, while the NI carries all the complexity for the ONoC.

Figure 10 shows how static power and energy change when modifying the flit width. We focus on the ONoC with 32-bit flits and 3-bit parallelism, and the ONoC with 64-bit flits and 6 bit parallelism. In both cases, it takes 1.1 ns to serialize the bits of the flit (with 64 bits, the flit width is double but so is the bit parallelism), but the latter needs less flits to transmit each packet. Therefore, it has better performance (as we showed in Section 7.6) and it will be interesting to explore the tradeoff with power and energy.

Static power for the 32-bit flit ONoC is 1.64 times larger than for the 64-flit ONoC with conservative technology, due to the larger number and size of electronic and optical components needed to support the increased flit width and bit parallelism. With aggressive technology, this factor is reduced to 1.54.

In order to check if the increased power is compensated by the reduced latency, we calculated the static energy burnt by the ONoC to complete a request-reply transaction under no contention. We consider only the contribution of the static power. This is by far the largest percentage of the total energy in optically enabled real cache coherent systems, which typically experience very low traffic loads [20]. The 64-bit flit ONoC consumes 17 and 11% more energy than the 32-bit flit ONoC with the conservative and aggressive technologies, respectively. Therefore, we note that the improved transaction latency is not able to compensate or reverse the energy trend. However, when introducing IP-core static power into the picture, the conclusion may significantly change depending on the potential of the enhanced parallelism to cut down on system execution time. This is left for future work.

#### 8. CONCLUSIONS

This paper presents an accurate design of NIs for WRONoCs, captures the effect on the most important network-quality metrics, and sets the scene for further improvements of comparative ONoC analysis. Regarding latency, the ONoC is always faster than its electronic counterpart even considering the NI, thus preserving the primary goal of a WRONoC.

The behaviour under contention depends mainly on the



Figure 10: Static power and energy for 32-bit flits with 3-bit parallelism and 64-bit flits with 6-bit parallelism.

available bandwidth of the interconnect technologies under test. For the WRONoC, such bandwidth can be modulated by tuning the bit parallelism, and adjusting buffer size to flow control requirements for maximum throughput operation. Similar tuning knobs do exist for ENoCs, namely flit width and buffer sizes. Therefore, the ultimate question is whether such tuning knobs are energy efficient in comparative terms, which depends on the sensitivity of system performance to such knobs for the application at hand. This is left for future work.

When we consider power figures, we note that while switches are the main contributors in ENoCs, the NI has the largest share in ONoCs. For static power with conservative optical technology parameters, this contribution is in the same order of magnitude than that from laser sources. However, by improving the optical technology, the role of the NI becomes dominant, thus making it the main target for future optimizations. Finally, the ONoC preserves its superior dynamic power properties over its ENoC counterpart, even in the presence of the NI.

This paper shows that the NI architecture should not be overlooked for realistic ONoC assessments, and comes up with new insights not provided by earlier photonic network evaluations. The most important one is that NI optimizations perhaps have higher priority over the relentless search for ultra-low-loss optical devices.

## 9. ACKNOWLEDGMENTS

This work was supported in part by grants TIN2010-21291-C02-01 (Spanish Government, European ERDF), IT FIRB Photonica project (RBFR08LE6V), and HiPEAC-3 NoE (European FP7/ICT 217068).

#### 10. REFERENCES

- C. Batten, A. Joshi, V. Stojanovic, and K. Asanovic. Designing chip-level nanophotonic interconnection networks. *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, pages 137–153, 2012.
- [2] S. Beamer, C. Sun, Y.-J. Kwon, A. Joshi, C. Batten, V. Stojanovic, and K. Asanovic. Re-architecting dram memory systems with monolithically integrated silicon photonics. In *ISCA*, pages 129–140. ACM, 2010.
- [3] M. Biere, L. Gheorghe, G. Nicolescu, I. O'Connor, and G. Wainer. Towards the high-level design of optical networks-on-chip. formalization of opto-electrical interfaces. In *Int. Conf. on Electronics, Circuits and Systems.*, pages 427–430, 2007.
- [4] J. Chan, G. Hendry, A. Biberman, and K. Bergman. Architectural design exploration of chip-scale photonic interconnection networks using physical-layer analysis. In Optical Fiber Communication (OFC), collocated National Fiber Optic Engineers Conference, 2010 Conference on (OFC/NFOEC), pages 1–3, 2010.
- [5] J. Chan, G. Hendry, A. Biberman, and K. Bergman. Architectural exploration of chip-scale photonic interconnection network designs using physical-layer analysis. *Journal of Lightwave Technology*, pages 1305–1315, 2010.
- [6] J. Chan, G. Hendry, A. Biberman, K. Bergman, and L. P. Carloni. PhoenixSim: A simulator for physical-layer analysis of chip-scale photonic interconnection networks. In *Procs. of the Conference* on Design, Automation and Test in Europe, DATE '10, pages 691–696, 3001 Leuven, Belgium, Belgium, 2010. European Design and Automation Association.
- [7] W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
- [8] C. G. et al. A cmos-compatible silicon photonic platform for high-speed integrated opto-electronics. *Proc. Integrated Photonics: Materials, Devices, and Applications*, 2013.
- [9] A. Hansson, K. Goossens, and A. RÄČdulescu. Avoiding message-dependent deadlock in network-based systems on chip. VLSI Design, 2007.
- [10] G. Hendry, J. Chan, S. Kamil, L. Oliker, J. Shalf, L. Carloni, and K. Bergman. Silicon nanophotonic network-on-chip using TDM arbitration. In *Annual Symp. on High Performance Interconnects (HOTI)*, pages 88–95, 2010.
- [11] A. Joshi, C. Batten, Y.-J. Kwon, S. Beamer, I. Shamim, K. Asanovic, and V. Stojanovic. Silicon-photonic clos networks for global on-chip communication. In *Int. Symp on Networks-on-Chip*, pages 124–133, 2009.
- [12] P. Kapur and K. C. Saraswat. Optical interconnects for future high performance integrated circuits. *Physica E: Low-dimensional Systems and Nanostructures*, pages 620 627, 2003.

- [13] S. Koohi, M. Abdollahi, and S. Hessabi. All-optical wavelength-routed NoC based on a novel hierarchical topology. In *Int. Symp. on Networks on Chip (NoCS)*, pages 97–104, 2011.
- [14] G. Kurian, C. Sun, C.-H. Chen, J. Miller, J. Michel, L. Wei, D. Antoniadis, L.-S. Peh, L. Kimerling, V. Stojanovic, and A. Agarwal. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. In Int. Parallel Distributed Processing Symposium (IPDPS), pages 1117–1130, 2012.
- [15] S. Le Beux, J. Trajkovic, I. O'Connor, G. Nicolescu, G. Bois, and P. Paulin. Optical ring network-on-chip (ORNoC): Architecture and design methodology. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 1–6, 2011.
- [16] J. Leu and V. Stojanovic. Injection-locked clock receiver for monolithic optical link in 45nm soi. In Solid State Circuits Conference (A-SSCC), 2011 IEEE Asian, pages 149–152, 2011.
- [17] I. e. a. O'Connor. Towards reconfigurable optical networks-on-chip. RECO SoC, pages 121–128, 2005.
- [18] Y. Pan, J. Kim, and G. Memik. Flexishare: Channel sharing for an energy-efficient nanophotonic crossbar. In *Int. Symp. on High Performance Computer Architecture*, pages 1–12, 2010.
- [19] Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary. Firefly: Illuminating future network-on-chip with nanophotonics. In *Procs. of the Int. Symp. on Computer Architecture*, ISCA '09, pages 429–440, New York, NY, USA, 2009. ACM.
- [20] L. Ramini, P. Grani, H. T. Fanken, A. Ghiribaldi, S. Bartolini, and D. Bertozzi. Assessing the energy break-even point between an optical noc architecture an dan aggressive electronic baseline. In Procs. of the Conference on Design, Automation and Test in Europe. (To be published), 2014.
- [21] S. Stergiou, F. Angiolini, S. Carta, L. Raffo, D. Bertozzi, and G. De Micheli. xpipes lite: a synthesis oriented design library for networks on chips. In *Design*, Automation and Test in Europe., pages 1188–1193 Vol. 2, 2005.
- [22] A. Strano, D. Ludovici, and D. Bertozzi. A library of dual-clock fifos for cost-effective and flexible mpsoc design. In *Int. Conf. on Embedded Computer Systems* (SAMOS), pages 20–27, 2010.
- [23] TileraCorporation. Tile-Gx8016 specification. http://www.tilera.com/sites/default/files/ productbriefs/Tile-Gx-8016-SB011-03.pdf.
- [24] D. Vantrease, N. Binkert, R. Schreiber, and M. Lipasti. Light speed arbitration and flow control for nanophotonic interconnects. In *Microarchitecture*, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 304–315, 2009.