# GPU Programming 2019/2020 Why parallelism?

#### Why parallelism?

### More efficient programs!



[1] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, sixth edition. Morgan Kaufmann, 2017.



"Cramming more components onto, vol. 38, no. 8, p. 114 ff, 1965.



[1] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, sixth edition. Morgan Kaufmann, 2017.

"The La-Z-Boy programmer era of relying on hardware designers to make their programs go faster without lifting a finger is officially over."

Hennessy & Patterson [2017]



[1] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, sixth edition. Morgan Kaufmann, 2014.



https://upload.wikimedia.org/wikipedia/commons/0/00/Transistor\_Count\_and\_Moore%27s\_Law\_-\_2011.svg



https://upload.wikimedia.org/wikipedia/commons/0/00/Transistor\_Count\_and\_Moore%27s\_Law\_-\_2011.svg



https://upload.wikimedia.org/wikipedia/commons/0/00/Transistor\_Count\_and\_Moore%27s\_Law\_-\_2011.svg









# What makes parallel programming difficult?

#### What makes parallel programming difficult?

- Many algorithms do not lend themselves naturally to a parallel implementation
- Parallel execution leads to indeterminism

#### Parallel programming?

"[Serial] algorithms have improved faster than clock over the last 15 years. [Parallel] computers are unlikely to be able to take advantage of these advances because they require new programs and new algorithms."

#### Gordon Bell (1992)

G. Bell, "Massively parallel computers: why not parallel computers for the masses?," in The Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992, pp. 292–297.

## Why parallel: the hardware side

$$P = C \cdot V^2 \cdot f$$

R. Gonzalez, B. M. Gordon, and M. A. Horowitz, "Supply and threshold voltage scaling for low power CMOS," IEEE J. Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, 1997.



R. Gonzalez, B. M. Gordon, and M. A. Horowitz, "Supply and threshold voltage scaling for low power CMOS," IEEE J. Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, 1997.



R. Gonzalez, B. M. Gordon, and M. A. Horowitz, "Supply and threshold voltage scaling for low power CMOS," IEEE J. Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, 1997.



https://forums.anandtech.com/threads/power-consumption-scaling-with-clockspeed-and-vcc-for-the-i7-2600k.2195927/





After: http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf



After: http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf



After: http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf

$$E=Pt$$
 energy time

#### Energy is critical:

- Handheld: major factor for customer satisfaction
- Warehouse scale computing: major cost factor

#### Energy is critical:

- Handheld: major factor for customer satisfaction
- Warehouse scale computing: major cost factor

... and to keep our planet alive.

# How to get around heat limit?

## How to get around heat limit?

(or be as energy efficient as possible)

processor heat limit

specialize parallelize

processor heat limit multi-core specialize parallelize instruction level data-parallel



field programm. gate array

processor

multi-core

#### specialize

parallelize

neuromorphic computing

tensor processing unit

instruction level

data-parallel



Parallelism
to avoid heat
limit
(and increase
energy
efficiency)

https://upload.wikimedia.org/wikipedia/commons/0/00/Transistor\_Count\_and\_Moore%27s\_Law\_-\_2011.svg

|                                                  | Nvidia Fermi<br>(2010)                     | Nvidia Kepler<br>(2012) |
|--------------------------------------------------|--------------------------------------------|-------------------------|
| Clock frequency                                  | 1.3 GHz                                    | 1.0 GHz                 |
| Power                                            | 250 Watt                                   | 195 Watt                |
| FP throughput                                    | 665 GFlops                                 | 1310 GFlops             |
| https://wiki.rice.edu/confluence/download/attach | nments/4435861/comp322-s16-lec1-slides.pdf |                         |



https://en.wikipedia.org/wiki/Von\_Neumann\_architecture

Von Neumann architecture



https://en.wikipedia.org/wiki/Von\_Neumann\_architecture

Von Neumann architecture



https://en.wikipedia.org/wiki/Von\_Neumann\_architecture

Von Neumann architecture



Latency: 1990: 6 and 8 cycles

2010: up to 180 cycles

2000

frequency latency bandwidth

1 GHz 20 ns 100 MT/s

Data: https://en.wikipedia.org/wiki/CAS\_latency, http://www.intel.com/pressroom/kits/quickreffam.htm

|      | frequency | latency | bandwidth |
|------|-----------|---------|-----------|
| 2000 | 1 GHz     | 20 ns   | 100 MT/s  |
| 2003 | 2 GHz     | 15 ns   | 333 MT/s  |

Data: https://en.wikipedia.org/wiki/CAS\_latency, http://www.intel.com/pressroom/kits/quickreffam.htm

|      | frequency | latency | bandwidth |
|------|-----------|---------|-----------|
| 2000 | 1 GHz     | 20 ns   | 100 MT/s  |
| 2003 | 2 GHz     | 15 ns   | 333 MT/s  |
| 2007 | 4.5 GHz   | 10 ns   | 800 MT/s  |

Data: https://en.wikipedia.org/wiki/CAS\_latency, http://www.intel.com/pressroom/kits/quickreffam.htm



Data: https://en.wikipedia.org/wiki/CAS\_latency, http://www.intel.com/pressroom/kits/quickreffam.htm

# How to get around von Neumann bottleneck?

von Neumann bottleneck



von Neumann bottleneck



#### Caching:



#### Caching:



von Neumann bottleneck



#### Pipelining:



#### Pipelining:



#### Pipelining:



#### Pipelining:



### Pipelining:



#### Pipelining:



#### Pipelining:





#### Pipelining:



compute memory

#### Pipelining:



compute memory

#### Pipelining:



compute memory

#### Pipelining:



compute memory

#### Pipelining:



compute memory

#### Pipelining:



compute memory

Instruction level parallelism: exploit indepence at assembler level

- Pipelining (memory access and computation)
- Different arithmetic units (diff. computations)

| Functional unit                     | Latency | Initiation interval |
|-------------------------------------|---------|---------------------|
| Integer ALU                         | 0       | 1                   |
| Data memory (integer and FP loads)  | 1       | 1                   |
| FP add                              | 3       | 1                   |
| FP multiply (also integer multiply) | 6       | 1                   |
| FP divide (also integer divide)     | 24      | 25                  |

J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, Seventh ed. Morgan Kaufmann, 2017, p. C-53

core / CPU



Out-of-order execution:

#### Instruction stream A

add a, b, c mul d, b, e mul f, a, e add a, d, g fmul h, a, f

Out-of-order execution:



Out-of-order execution:



Schaa D. R. Kaeli, P. I with OpenCL ;

Out-of-order execution:

processor extracts parallelim from instruction stream



Very long instruction word processors:



Very long instruction word processors:

compiler extracts parallelim from program code



Instruction level parallelism: exploit indepence at assembler level

- ° Pipelining
- Different arithmetic units

=> Exploited since 1980s and standard, but no longer significant improvements

#### Further reading

- J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, fourth edition. Morgan Kaufmann, 2007.
- http://cva.stanford.edu/classes/cs99s/
- http://research.ac.upc.edu/HPCseminar/SEM9900/Pollack1.pdf
- http://groups.csail.mit.edu/cag/raw/documents/Waingold-Computer-1997.pdf
- http://cacm.acm.org/magazines/2009/5/24648-spending-moores-dividend/fulltext