matrix multiplication: fast with GPU
cannot cocatenate linear units → equivalent to one big matrix...
⇒ add non-linear units in between
rectified linear units (RELU)
chain rule: efficient computationally
easy to compute the gradient as long as the function Y(X) is made of simple blocks ...