Lec 09-2: Weight Initialization[Include Pytorch Code]

6 minute read

Deep_Learning study Lec09-2

Boostcourse의 ‘파이토치로 시작하는 딥러닝 기초’를 통한 공부와 정리 Post

Goal of Study

가중치 초기화(Weight Inititalization)에 대해 알아본다.

Keyword

가중치 초기화(Weight Inititalization)

Xavier / He inititalization

1. 강의 요약

RBM(Restricted Boltzmann Machine)

Restricted : no connections within a layer
같은 layer안에 있는 노드들 끼리는 연결이 없다. 같은 레이어의 노드끼리 연결이 제한되어 있어 모델의 이름에 Restricted가 붙은 것이다.

하지만, 다른 layer 사이에서는 각 노드들이 모두 연결되어 있는 형태(fully connected)를 RBM이라고 한다.

요즘은 잘 사용하지 않는다고 알려져있다.

Xavier Initialization

2010년 세이비어 글로럿과 요슈아 벤지오가 가중치 초기화가 모델에 미치는 영향을 분석하여 새로운 초기화 방법을 제안했다.

이 초기화 방법은 제안한 사람의 이름을 따서 세이비어(Xavier Initialization) 초기화 또는 글로럿 초기화(Glorot Initialization)라고 한다.

Xavier와 뒤에서 다룰 He Initialization은 무작위로 초기화하는 것이 아니라 Layer의 특성에 따라서 다르게 Initialization을 해야 된다는 내용이다.

Xavier Normal Initialization

위 식에서 평균은 0, 표준편차(Standard deviation)는 $ Var(W) $수식을 이용해 초기화를 하면 된다는 내용이다.

Xavier Uniform Initialization

이 또한, 위의 수식을 이용해 $W$를 초기화하면 적절히 초기화가 된다는 내용이다.

여기서, $ n_{in} $은 Layer의 input의 수를 나타내고, $ n_{out} $은 output의 수를 나타낸다.

결론, RBM으로 layer마다 학습하고 Fine-tuning하는 복잡한 단계가 필요없이 간단한 수식만으로 Weight Initialization이 가능하다.

He Initialization

이는, Xavier Initialization의 변형이라고 할 수 있으며,

마찬가지로, Normal과 Uniform의 형태가 있다.

수식의 형태도 상당히 비슷한데, Xavier Initialization에서 $ n_{out} $ term이 없어진 형태라 볼 수 있다.

Result With Xavier Initialization

(개념에 관한 포스팅은 끝났습니다. 아래는 학습시켜본 코드를 올려놓았습니다.)

mnist_nn_xavier

import torch
import torchvision.datasets as dsets
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import random

USE_CUDA = torch.cuda.is_available() # GPU를 사용가능하면 True, 아니라면 False를 리턴
device = torch.device("cuda" if USE_CUDA else "cpu") # GPU 사용 가능하면 사용하고 아니면 CPU 사용
print("다음 기기로 학습합니다:", device)

# for reproducibility
random.seed(777)
torch.manual_seed(777)
if device == 'cuda':
    torch.cuda.manual_seed_all(777)

다음 기기로 학습합니다: cpu

# parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100

# MNIST dataset
mnist_train = dsets.MNIST(root='E:\DeepLearningStudy',
                          train=True,
                          transform=transforms.ToTensor(),
                          download=True)

mnist_test = dsets.MNIST(root='E:\DeepLearningStudy',
                         train=False,
                         transform=transforms.ToTensor(),
                         download=True)

# dataset loader
data_loader = torch.utils.data.DataLoader(dataset=mnist_train,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          drop_last=True)

C:\Users\LeeJaeWon\anaconda3\envs\DeepLearningStudy\lib\site-packages\torchvision\datasets\mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ..\torch\csrc\utils\tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)

# nn layers
linear1 = torch.nn.Linear(784, 256, bias=True)
linear2 = torch.nn.Linear(256, 256, bias=True)
linear3 = torch.nn.Linear(256, 10, bias=True)
relu = torch.nn.ReLU()

# xavier initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)

Parameter containing:
tensor([[-0.0215, -0.0894,  0.0598,  ...,  0.0200,  0.0203,  0.1212],
        [ 0.0078,  0.1378,  0.0920,  ...,  0.0975,  0.1458, -0.0302],
        [ 0.1270, -0.1296,  0.1049,  ...,  0.0124,  0.1173, -0.0901],
        ...,
        [ 0.0661, -0.1025,  0.1437,  ...,  0.0784,  0.0977, -0.0396],
        [ 0.0430, -0.1274, -0.0134,  ..., -0.0582,  0.1201,  0.1479],
        [-0.1433,  0.0200, -0.0568,  ...,  0.0787,  0.0428, -0.0036]],
       requires_grad=True)

# model
model = torch.nn.Sequential(linear1, relu, linear2, relu, linear3).to(device)

# define cost/loss & optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)    # Softmax is internally computed.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(training_epochs):
    avg_cost = 0
    total_batch = len(data_loader)

    for X, Y in data_loader:
        # reshape input image into [batch_size by 784]
        # label is not one-hot encoded
        X = X.view(-1, 28 * 28).to(device)
        Y = Y.to(device)

        optimizer.zero_grad()
        hypothesis = model(X)
        cost = criterion(hypothesis, Y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

print('Learning finished')

Epoch: 0001 cost = 0.246781081
Epoch: 0002 cost = 0.092835627
Epoch: 0003 cost = 0.062099934
Epoch: 0004 cost = 0.043158282
Epoch: 0005 cost = 0.032552719
Epoch: 0006 cost = 0.024990607
Epoch: 0007 cost = 0.021460216
Epoch: 0008 cost = 0.019880820
Epoch: 0009 cost = 0.015151084
Epoch: 0010 cost = 0.015341577
Epoch: 0011 cost = 0.012235122
Epoch: 0012 cost = 0.012357231
Epoch: 0013 cost = 0.010972363
Epoch: 0014 cost = 0.008886398
Epoch: 0015 cost = 0.008259354
Learning finished

# Test the model using test sets
with torch.no_grad():
    X_test = mnist_test.test_data.view(-1, 28 * 28).float().to(device)
    Y_test = mnist_test.test_labels.to(device)

    prediction = model(X_test)
    correct_prediction = torch.argmax(prediction, 1) == Y_test
    accuracy = correct_prediction.float().mean()
    print('Accuracy:', accuracy.item())

    # Get one and predict
    r = random.randint(0, len(mnist_test) - 1)
    X_single_data = mnist_test.test_data[r:r + 1].view(-1, 28 * 28).float().to(device)
    Y_single_data = mnist_test.test_labels[r:r + 1].to(device)

    print('Label: ', Y_single_data.item())
    single_prediction = model(X_single_data)
    print('Prediction: ', torch.argmax(single_prediction, 1).item())
    
    plt.imshow(mnist_test.test_data[r:r + 1].view(28, 28), cmap='Greys', interpolation='nearest')
    plt.show()

Accuracy: 0.9807999730110168
Label:  9
Prediction:  9


C:\Users\LeeJaeWon\anaconda3\envs\DeepLearningStudy\lib\site-packages\torchvision\datasets\mnist.py:67: UserWarning: test_data has been renamed data
  warnings.warn("test_data has been renamed data")
C:\Users\LeeJaeWon\anaconda3\envs\DeepLearningStudy\lib\site-packages\torchvision\datasets\mnist.py:57: UserWarning: test_labels has been renamed targets
  warnings.warn("test_labels has been renamed targets")

이전에 기본 MNIST학습할 때 Normal_ distribution으로 초기화 할 때보다 체감상, 수치상 Accuracy가 더 높은 것 같다.

mnist_nn_deep

조금 더 deep하게 학습해보자.

import torch
import torchvision.datasets as dsets
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import random

USE_CUDA = torch.cuda.is_available() # GPU를 사용가능하면 True, 아니라면 False를 리턴
device = torch.device("cuda" if USE_CUDA else "cpu") # GPU 사용 가능하면 사용하고 아니면 CPU 사용
print("다음 기기로 학습합니다:", device)

# for reproducibility
random.seed(777)
torch.manual_seed(777)
if device == 'cuda':
    torch.cuda.manual_seed_all(777)

다음 기기로 학습합니다: cpu

# parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100

# MNIST dataset
mnist_train = dsets.MNIST(root='E:\DeepLearningStudy',
                          train=True,
                          transform=transforms.ToTensor(),
                          download=True)

mnist_test = dsets.MNIST(root='E:\DeepLearningStudy',
                         train=False,
                         transform=transforms.ToTensor(),
                         download=True)

# dataset loader
data_loader = torch.utils.data.DataLoader(dataset=mnist_train,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          drop_last=True)

# nn layers
linear1 = torch.nn.Linear(784, 512, bias=True)
linear2 = torch.nn.Linear(512, 512, bias=True)
linear3 = torch.nn.Linear(512, 512, bias=True)
linear4 = torch.nn.Linear(512, 512, bias=True)
linear5 = torch.nn.Linear(512, 10, bias=True)
relu = torch.nn.ReLU()

# xavier initialization
torch.nn.init.xavier_uniform_(linear1.weight)
torch.nn.init.xavier_uniform_(linear2.weight)
torch.nn.init.xavier_uniform_(linear3.weight)
torch.nn.init.xavier_uniform_(linear4.weight)
torch.nn.init.xavier_uniform_(linear5.weight)

Parameter containing:
tensor([[-0.0565,  0.0423, -0.0155,  ...,  0.1012,  0.0459, -0.0191],
        [ 0.0772,  0.0452, -0.0638,  ...,  0.0476, -0.0638,  0.0528],
        [ 0.0311, -0.1023, -0.0701,  ...,  0.0412, -0.1004,  0.0738],
        ...,
        [ 0.0334,  0.0187, -0.1021,  ...,  0.0280, -0.0583, -0.1018],
        [-0.0506, -0.0939, -0.0467,  ..., -0.0554, -0.0325,  0.0640],
        [-0.0183, -0.0123,  0.1025,  ..., -0.0214,  0.0220, -0.0741]],
       requires_grad=True)

# model
model = torch.nn.Sequential(linear1, relu, linear2, relu, linear3).to(device)

# define cost/loss & optimizer
criterion = torch.nn.CrossEntropyLoss().to(device)    # Softmax is internally computed.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(training_epochs):
    avg_cost = 0
    total_batch = len(data_loader)

    for X, Y in data_loader:
        # reshape input image into [batch_size by 784]
        # label is not one-hot encoded
        X = X.view(-1, 28 * 28).to(device)
        Y = Y.to(device)

        optimizer.zero_grad()
        hypothesis = model(X)
        cost = criterion(hypothesis, Y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

print('Learning finished')

Epoch: 0001 cost = 0.285531998
Epoch: 0002 cost = 0.090530835
Epoch: 0003 cost = 0.058814172
Epoch: 0004 cost = 0.039957844
Epoch: 0005 cost = 0.031853903
Epoch: 0006 cost = 0.025514400
Epoch: 0007 cost = 0.020236425
Epoch: 0008 cost = 0.019569544
Epoch: 0009 cost = 0.014461165
Epoch: 0010 cost = 0.014499786
Epoch: 0011 cost = 0.014732383
Epoch: 0012 cost = 0.011545545
Epoch: 0013 cost = 0.012704160
Epoch: 0014 cost = 0.010244285
Epoch: 0015 cost = 0.008064025
Learning finished

# Test the model using test sets
with torch.no_grad():
    X_test = mnist_test.test_data.view(-1, 28 * 28).float().to(device)
    Y_test = mnist_test.test_labels.to(device)

    prediction = model(X_test)
    correct_prediction = torch.argmax(prediction, 1) == Y_test
    accuracy = correct_prediction.float().mean()
    print('Accuracy:', accuracy.item())

    # Get one and predict
    r = random.randint(0, len(mnist_test) - 1)
    X_single_data = mnist_test.test_data[r:r + 1].view(-1, 28 * 28).float().to(device)
    Y_single_data = mnist_test.test_labels[r:r + 1].to(device)

    print('Label: ', Y_single_data.item())
    single_prediction = model(X_single_data)
    print('Prediction: ', torch.argmax(single_prediction, 1).item())
    
    plt.imshow(mnist_test.test_data[r:r + 1].view(28, 28), cmap='Greys', interpolation='nearest')
    plt.show()

Accuracy: 0.9767000079154968
Label:  8
Prediction:  8

따로 코드를 MyGitHub_MNIST_Xavier,Deep에 올려놓았다.

📣
본 포스팅의 학습 환경 : Anaconda, CPU, Pytorch, Jupyter Notebook
포스팅에서 오류나 궁금한 점은 Comments를 작성해주시면, 많은 도움이 됩니다.💡

참고자료

Pytorch로 시작하는 딥러닝 입문
Deep Learning Zero To All _ GitHub

Share on

Twitter Facebook LinkedIn

Lee Jaewon