











| Intel Stratix 10                   | Virtex Ultrascale                               | NVidia GPU                                        |
|------------------------------------|-------------------------------------------------|---------------------------------------------------|
| 14 nm Intel Tri-Gate               | 16 nm FinFET                                    | 12 nm                                             |
| 1 GHz<br>10 TF single precision    | ~600 MHz<br>6,840 DSPs (3.1 TF single<br>prec.) | 1455 MHz<br>5,120 cores (15.7 TF single<br>prec.) |
| 5.5M Logic Elements                | 2.5M Logic Elements                             | CUDA programming                                  |
| 4-input LUT, register, carry, etc. | 1,182,000 5-input LUTs                          | On-chip memory:                                   |
| Block RAM: 28.6 MiB                | 2,364,000 FFs                                   | Registers: 20.8 MiB                               |
| Hardened DRAM controller<br>DDR 4  | Block RAM: 9.1 MiB                              | L1/SM: 7.7 MiB                                    |
| Various options for memory         |                                                 | L2 Cache: 6.1 MiB                                 |
| Hyper Flex Interconnect with Regs. |                                                 |                                                   |
| TDP: 125W (estimated)              | TDP: 95 W<br>(Amazon F1 power limit)            | TDP: 300W                                         |





































| Dependences                                                                                                                                                                                                                                                                |                                                                                                   |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|--|
| <ul> <li>Scalar Variables <ul> <li>True Dependence</li> <li>A =</li> <li>= A</li> </ul> </li> <li>Anti Dependence <ul> <li>= A</li> <li>A =</li> </ul> </li> <li>Output Dependence <ul> <li>A =</li> <li>A =</li> <li>A =</li> <li>A =</li> <li>= A</li> </ul> </li> </ul> | • Loop Variables<br>for i= 2, 5 $a[i] = a[i] + 3readwritea[2]$ $a[3]$ $a[4]$ $a[4]$ $a[5]$ $a[5]$ |  |



## Loop Carried Dependence

- There exists a dependence from statement S1 to S2 in a common nest of loops iff there exist two iteration vectors i and j such that
  - i < j or i = j and there is a path from S1 to S2 in the body of the loop
  - S1 accesses memory location M on iteration i and S2 accesses M on iteration j
  - one of these accesses is a write
  - Loop Carried Dependence
    - Statement  $S_2$  has a loop-carried dependence on statement  $S_1$  if and only if  $S_1$  references location M on iteration i,  $S_2$  references M on iteration j

for i=2, 5a[i+1] = f[i] + 3f[i+1]=a[i]





## **Type Demotion** Demote Data Types - Less expensive alternatives Must meet precision Requirements • Reduce resource and energy consumption • Bandwidth requirements • Operation latency - Use less Resources • Compute Bound » Floating point to fixed point » Use Native Data types (16 bit for Xilinx) • Bandwidth Bound » Performance improves by the the same factor that the size of the data type can be reduced Latency Bound » Floating point ops $\rightarrow$ multiple cycles: Integer ops $\rightarrow$ 1 cycle

| Software Tr                        | ansformations In HLS                                                  |  |
|------------------------------------|-----------------------------------------------------------------------|--|
| bojtwure II                        |                                                                       |  |
| CPU transformation                 | In HLS                                                                |  |
| Loop interchange [2, 36]           | Used to resolve loop carried dependencies throughout Section 2.       |  |
| Strip-mining [77]                  | Central component of many HLS transformations, including              |  |
| Loop tiling [36, 40]               | accumulation interleaving (Section 2.2), vectorization (Section 3.1), |  |
| Cycle shrinking [56]               | replication (Section 3.2), and tiling (Section 3.4).                  |  |
| Loop distribution/fission [35, 36] | Useful for separating differently scheduled computations to allow     |  |
|                                    | pipelining (see Section 3.3).                                         |  |
| Loop fusion [36, 79, 83]           | Used for merging pipelines (see Section 2.7).                         |  |
| Loop unrolling [18]                | Essential tool for scaling up performance by generating more com      |  |
|                                    | putational hardware (Section 3.1 and 3.2).                            |  |
| Software pipelining [39]           | Used by the HLS tool to schedule loop bodies according to the         |  |
|                                    | interdependencies of operations.                                      |  |
| Loop coalescing/flattening [55]    | Used to save pipeline drains in nested loops (Section 2.6).           |  |
| Loop collapsing                    |                                                                       |  |
| Reduction recognition              | Prevent loop-carried dependencies in accumulation codes (Sec          |  |
|                                    | tion 2.1 and 2.3).                                                    |  |
| Loop idiom recognition             | Relevant for HLS backends, for example to recognize sliding           |  |
|                                    | window buffers (Section 2.5) in Intel OpenCL [72].                    |  |
| Procedure inlining                 | Required to pipeline code sections with function calls (Section 2.4)  |  |
| Procedure cloning                  | Every occurrence of a function is always specialized to all variables |  |
|                                    | that can be statically inferred.                                      |  |
| Loop unswitching [17]              | Often the <b>opposite</b> is beneficial (see Section 2.6 and 2.7).    |  |
| Loop peeling                       | Often the opposite is beneficial to allow coalescing (Section 2.6)    |  |
| Graph partitioning                 | Streaming is central to hardware algorithms (Section 3.3).            |  |
| SIMD transformations               | Covered in Section 3.1.                                               |  |



