💡 딥러닝 모델은 유연성, 용량이 큰 관계로 train set이 충분히 크지 않을 경우 과적합이 문제가 될 수 있다. 과적합된 모델의 경우 train set 에서는 잘 작동하지 않고 새로운 형태의 데이터에서는 doesn't generalize to new examples 잘 작동하지 않는다. 이러한 문제를 해결할 수 있는 정규화 Regularization 방법에 대해 알아본다.

Regularization will help you reduce overfitting.
Regularization will drive your weights to lower values.
L2 regularization and Dropout are two very effective regularization techniques.

Configuration, packages

# packages
import packages
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import scipy.io
from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
from testCases import *
from public_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

Problem Statement

모델링 전 문제를 먼저 정의해보자.

프랑스 선수들이 헤딩을 할 수 있도록 골키퍼가 골킥을 차야할 위치를 추천하고자 한다. 골키퍼가 공중으로 공을 차면, 각 팀의 선수들이 공을 헤딩하기 위해 싸우고 있다.

Objective

*_축구장 내 골키퍼가 공을 찰 수 있는 가장 효율적인 위치 찾기

Data

축구팀의 지난 10개 경기를 담은 2D 데이터 셋
각각의 점은 골키퍼가 축구장 왼쪽으로부터 찬 공에 축구 선수가 헤딩을 성공시킨 축구장 내 위치
- 파란 색 점 : 프랑스 선수가 성공시킨 헤딩의의 위치
- 빨간 색 점 : 상대방 선수가 성공시킨 헤딩의 위치

# Load dataset
train_X, train_Y, test_X, test_Y = load_2D_dataset()

위의 문제상황에 있어 비정규화 모델, L2정규화, Dropout 정규화 모델 3가지를 살펴본다.

Non-Regularized Model

이미 구현된 비정규화 NN모델(baseline model)의 성능을 확인해보자.

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):

    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]

    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):
        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)

        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)

        # Backward propagation.
        assert (lambd == 0 or keep_prob == 1)   # it is possible to use both L2 regularization and dropout, 
                                                # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)

        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)

        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)

    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    return parameters

parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

# decision boundary
plt.title("Model without regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

)

train accuracy는 94.8%인 반면 test accuracy는 91.5%로 test에서는 더 낮은 성능을 보이며 결정경계에서도 모델이 과적합되었음을 확인할 수 있다. 과적합 문제를 해결할 수 있는 정규화 모델을 구현해보자.

L2 Regularization

L2 정규화는 작은 가중치를 가진 모델이 큰 가중치를 가지는 모델보다 더 간단하다는 가정을 전제로한다. 비용함수에 가중치 제곱만큼의 패널티를 적용해 가중치를 더 작은 값으로 만들어, 입력값이 변경될 때 출력값이 더 느리게 변경되는 smoother 모델로 만드는 것이다.

Cost function with L2 regularization

L2정규화는 비용함수를 새롭게 정의한다.

$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{L}\right) + (1-y^{(i)})\log\left(1- a^{L}\right) \large{)}$

👇 new definition of cost function

$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{L}\right) + (1-y^{(i)})\log\left(1- a^{L}\right) \large{)} }\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W{k,j}^{[l]2} }_\text{L2 regularization cost}$

# GRADED FUNCTION: compute_cost_with_regularization

def compute_cost_with_regularization(A3, Y, parameters, lambd):

    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    # A3 : post-activation, output of forward propagation, of shape (output size, number of examples)
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost

    # Cost function with L2 regularization
    L2_regularization_cost =  (1/m)*(lambd/2) *(np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
    cost = cross_entropy_cost + L2_regularization_cost

    return cost

A3, t_Y, parameters = compute_cost_with_regularization_test_case()
cost = compute_cost_with_regularization(A3, t_Y, parameters, lambd=0.1)
print("cost = " + str(cost))

cost = 1.7864859451590758

Backward propagation with L2 regularization

비용함수가 새롭게 정의됨에 따라 backpropagation도 수정해야한다. $dW1$, $dW2$, $dW3$에 적용되며, 각각에 새롭게 계산된 gradients ($\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$) 를 더해준다.

# GRADED FUNCTION: backward_propagation_with_regularization

def backward_propagation_with_regularization(X, Y, cache, lambd):

    m = X.shape[1]
    # cache output from forward_propagation()
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T) + (lambd*W3)/m
    db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T) + + (lambd*W2)/m
    db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T) + + (lambd*W1)/m
    db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

t_X, t_Y, cache = backward_propagation_with_regularization_test_case()

grads = backward_propagation_with_regularization(t_X, t_Y, cache, lambd = 0.7)
print ("dW1 = \n"+ str(grads["dW1"]))
print ("dW2 = \n"+ str(grads["dW2"]))
print ("dW3 = \n"+ str(grads["dW3"]))

Model with L2 regularization

Model with L2 regularization $(\lambda = 0.7)$

parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

# decision boundary
plt.title("Model with L2-regularization")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

)

L2 정규화로 학습시킨 모델은 test set에서도 train set과 비슷한 수준의 성능을 보이고, 결정경계에서도 과적합 문제가 해소됨을 확인 할 수 있다.

$\lambda$ 는 dev set 사용시 튜닝할 하이퍼파라미터
L2정규화로 결정경계를 부드럽게 만들지만, $\lambda$값이 너무 클 경우 모델이 편향을 가질 수 있다.

The cost computation: A regularization term is added to the cost.

The backpropagation function: There are extra terms in the gradients with respect to weight matrices.
Weights end up smaller ("weight decay"): Weights are pushed to smaller values.

Dropout regularization

Dropout은 딥러닝에 특화된 정규화 방법으로, 각 반복마다 뉴런 일부를 무작위로 삭제해 특정 층에 있는 뉴런의 일부분만 사용해 훈련하는 것이다. 일부 뉴런이 언제든지 shut down 될 수 있어 특정 뉴런의 활성화에 모델이 덜 민감해지게 만든다.

Dropout기법은 training 과정에서만 사용한다.
Forward/backward propagation 계산 모두에 적용해야한다.
모델 훈련에 있어, dropout이 적용되는 각 층을 keep_prob으로 나눠야한다. 활성화함수의 결과값의 기댓값expected value을 dropout 적용되지 않았을 때의 기댓값과 동일하게만들어 주기 위해 shutdown 되지 않은 노드들의 값을 재조정하는 것이다.
예를 들어, keep_prob=0.5 인 경우 절반의 노드만으로 training할 수 있도록 0.5로 나눈다.(2를 곱함)

Forward propagation with dropout

3-layer NN 모델의 첫번째, 두번째 은닉층에 dropout을 적용해본다. (input, output layer에는 적용하지 않는다 )

$a[1]$과 같은 shape으로, 0-1 사이의 랜덤 숫자로 np.random.rand() 구성된 행렬 $d[1]$을 만든다. Vectorized implementation : $A^{[1]}$와 동일한 차원의 $D^{[1]} = [d^{1} d^{1} ... d^{1}]$ 만든다.
$D^{[1]}$ 각 요소를 확률 keep_prob 1로 설정하고 그렇지 않으면 0으로 설정한다. 예를 들어, keep_prop = 0.8인 경우, 각 층의 80% 확률로 뉴런을 유지, 20% 확률로 뉴런을 shut down하는 것이다. 0-1 사이의 숫자로 구성된 벡터를 만들어 80%는 1이고 20%를 0으로 초기화하는 방식으로 구현한다.
keep_prob - probability of keeping a neuron active during drop-out, scalar

다차원인 경우, input array와 동일한 shape를 출력한다. astype(int) 결과값 boolean type ⇒ integer
X = (X < keep_prob).astype(int)
1차원 배열인 경우,
for i,v in enumerate(x): if v < keep_prob: x[i] = 1 else: # v >= keep_prob x[i] = 0

$A^{[1]}$ 을 $A^{[1]} * D^{[1]}$로 바꿔, 일부 뉴런이 shut down(0으로)되도록 만든다.
$A^{[1]}$ 을 keep_prob으로 나눈다. 최종 cost가 dropout 을 하지 않을 때와 동일한 값을 가지도록 하는 것이다. (inverted dropout)

forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.

# forward propagation
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):

    np.random.seed(1)

    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0], A1.shape[1])       # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = (D1 < keep_prob).astype(int)                   # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                        # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob                                 # Step 4: scale the value of neurons that haven't been shut down

    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    D2 = np.random.rand(A2.shape[0], A2.shape[1])       # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = (D2 < keep_prob).astype(int)                   # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                 # Step 4: scale the value of neurons that haven't been shut down

    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)

    return A3, cache

t_X, parameters = forward_propagation_with_dropout_test_case()
A3, cache = forward_propagation_with_dropout(t_X, parameters, keep_prob=0.7)
print ("A3 = " + str(A3))

A3 = [[0.36974721 0.00305176 0.04565099 0.49683389 0.36974721]]

(1,1) shape의 forward propagation 결과이다. 이 결과와 cache에 저장된 값들을 활용해 backward propagation을 계산해보자.

Backward propagation with dropout

역전파 계산은 비교적 간단하다.

forward propagation 구현할 때, A1 = A1 * D1 $D^{[1]}$ 행렬을 A1에 곱하는 방식으로 일부 뉴런을 shutdown했다. 마찬가지로 backward 계산에서는 $D^{[1]}$ 을 dA1 변수에 곱해준다.
forward propagation A1 = A1 / keep_prob 과 동일하게 dA1을 keep_prob으로 나눈다. (수학적으로 $A^{[1]}$이 keep_prob 로 스케일링되면 미분계수인 $dA^{[1]}$도 keep_prob으로 스케일링되어야한다.)

# backward propagation of our baseline model to which we added dropout
def backward_propagation_with_dropout(X, Y, cache, keep_prob):

    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims=True)

    dA2 = np.dot(W3.T, dZ3)
    dA2 = dA2 * D2               # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob        # Step 2: Scale the value of neurons that haven't been shut down

    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims=True)

    dA1 = np.dot(W2.T, dZ2)
    dA1 = dA1 * D1               # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob        # Step 2: Scale the value of neurons that haven't been shut down
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims=True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

Model with Dropout

parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

# Decision boundary
plt.title("Model with dropout")
axes = plt.gca()
axes.set_xlim([-0.75,0.40])
axes.set_ylim([-0.75,0.65])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

)

드롭아웃 정규화로 학습시킨 모델은 test set에서 더 좋은 성능을 보이며, 과적합 문제가 해소됨을 확인할 수 있다.

Dropout 사용할 때 흔히 하는 실수는 train, test에 모두 사용하는 것이다. Droupout은 일부 노드를 제거하는 것이기 때문에 training할 때만 사용한다
딥러닝 프레임워크 tensorflow, PaddlePaddle, keras, caffe 모두 미리 구현된 dropout layer 기법을 제공한다.

Conclusion

model	train accuracy	test accuracy
3-layer NN without regularization	95%	91.5%
3-layer NN with L2-regularization	94%	93%
3-layer NN with dropout	93%	95%

정규화를 통해 train set에 과적합하는 것을 피하고 train set 정확도는 줄어들지만 test set 정확도가 높아지기 때문에, 모델이 개선되었다고 할 수 있다.

'Certificate - DS > Deep learning specialization' 카테고리의 다른 글

[Optimization Algorithms] Gradient Descent (1) Batch, Stochastic, Mini-batch Gradient descent 배치, 확률적, 미니배치 경사하강법 (0)	2021.11.19
[코세라 딥러닝 정리] C4W4 - Special Applications: Face recognition & Neural Style Transfer (0)	2021.10.03
[코세라 딥러닝 정리] C3W1 - ML Strategy (1) (0)	2021.09.28
[코세라 딥러닝 정리] C1W4 - Deep Neural Network (0)	2021.09.26
[코세라 딥러닝 정리] C1W1 - Introduction to Deep Learning (0)	2021.09.23

JINSTORY

[Programming] Regularization - 과적합 문제를 해결하기 위한 정규화