PyTorch (7) VGG16

今回は、学習済みのVGG16を使ってImageNetの1000クラスの画像分類を試してみた。以前、Kerasでやった（2017/1/4）ことのPyTorch版。

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torchvision
from torchvision import datasets, models, transforms

import json
import numpy as np
from PIL import Image

モデルのロード

VGG16は vgg16 クラスとして実装されている。pretrained=True にするとImageNetで学習済みの重みがロードされる。

vgg16 = models.vgg16(pretrained=True)

print(vgg16) するとモデル構造が表示される。(features)と(classifier)の2つのSequentialモデルから成り立っていることがわかる。VGG16は Conv2d => ReLU => MaxPool2d の繰り返しからなる単純な構造。

VGG(
  (features): Sequential(
    (0): Conv2d (3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (5): Conv2d (64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d (128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (10): Conv2d (128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d (256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (17): Conv2d (256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace)
    (19): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace)
    (21): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace)
    (23): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
    (24): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace)
    (26): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace)
    (28): Conv2d (512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace)
    (30): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  )
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096)
    (4): ReLU(inplace)
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=1000)
  )
)

VGG16以外にもVGG11、VGG13、VGG19もあり、それぞれにBatch Normalizationを加えたモデルも公開されている。これは便利。

推論するときは eval() で評価モードに切り替えること！

Some models use modules which have different training and evaluation behavior, such as batch normalization. To switch between these modes, use model.train() or model.eval() as appropriate. See train() or eval() for details. http://pytorch.org/docs/master/torchvision/models.html

vgg16.eval()

これを忘れると推論するたびに出力結果が変わってしまうのでおかしいと気づく。

前処理

ImageNetで学習した重みを使うときはImageNetの学習時と同じデータ標準化を入力画像に施す必要がある。

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. You can use the following transform to normalize: http://pytorch.org/docs/master/torchvision/models.html

normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
])

入力画像に対して以下の変換を施している。

256 x 256にリサイズ
画像の中心部分の 224 x 224 のみ取り出す
テンソルに変換
ImageNetの訓練データのmeanを引いてstdで割る（標準化）

試しに画像を入れてみよう。PyTorchでは基本的に画像のロードはPILを使う。先ほど作成した preprocessに通してみよう。

img = Image.open('./data/20170104210653.jpg')
img_tensor = preprocess(img)
print(img_tensor.shape)

f:id:aidiary:20180212105211p:plain

torch.Size([3, 224, 224])

画像が3Dテンソルに変換される。

Keras (TensorFlow）と違ってチャンネルが前にくる
バッチサイズがついた4Dテンソルではない

ことに注意。画像として表示したい場合は ToTensor() する前までの変換関数を用意すればよい。これならPILのままなので普通にJupyter Notebook上で画像描画できる。

preprocess2 = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224)
])
trans_img = preprocess2(img)
print(type(trans_img))  # <class 'PIL.Image.Image'>
trans_img

f:id:aidiary:20180212105510p:plain

中心部の 224 x 224 が切り取られていることがわかる。

画像分類

画像をモデルに入力するときは3Dテンソルではなく、バッチサイズの次元を先頭に追加した4Dテンソルにする必要がある。次元を増やすのは unsqueeze_() でできる。アンダーバーがついた関数はIn-placeで属性を置き換える。

img_tensor.unsqueeze_(0)
print(img_tensor.size())  # torch.Size([1, 3, 224, 224])

モデルへ入れるときは4DテンソルをVariableに変換する必要がある。

out = vgg16(Variable(img_tensor))
print(out.size())  # torch.Size([1, 1000])

outはsoftmaxを取る前の値なので確率になっていない（足して1.0にならない）。だが、分類するときは確率にする必要がなく、出力が最大値のクラスに分類すればよい。

np.argmax(out.data.numpy())  # 332

出力が大きい順にtop Kを求めたいときは topk() という関数がある。

out.topk(5)

下のように出力とそのインデックスが返ってくる。

(Variable containing:
  28.5678  18.9699  18.1706  16.8523  16.8499
 [torch.FloatTensor of size 1x5], Variable containing:
  332  338  333  283  331
 [torch.LongTensor of size 1x5])

ImageNetの1000クラスの332番目のインデックスのクラスに分類されたけどこれはなんだろう？

ImageNetの1000クラスラベル

PyTorchにはImageNetの1000クラスのラベルを取得する機能はついていないようだ。ImageNetの1000クラスのラベル情報はここからJSON形式でダウンロードできるので落とす。

!wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json

class_index = json.load(open('imagenet_class_index.json', 'r'))
print(class_index)

{'0': ['n01440764', 'tench'],
 '1': ['n01443537', 'goldfish'],
 '2': ['n01484850', 'great_white_shark'],
 '3': ['n01491361', 'tiger_shark'],
 '4': ['n01494475', 'hammerhead'],
 '5': ['n01496331', 'electric_ray'],
 '6': ['n01498041', 'stingray'],
 '7': ['n01514668', 'cock'],
 '8': ['n01514859', 'hen'],
 '9': ['n01518878', 'ostrich'],
 '10': ['n01530575', 'brambling'],

labels = {int(key):value for (key, value) in class_index.items()}
print(labels[0])   # ['n01440764', 'tench']
print(labels[1])    # ['n01443537', 'goldfish']

332番目のクラスは・・・

print(labels[np.argmax(out.data.numpy())])

['n02328150', 'Angora'] アンゴラちゃんでした！

テスト

関数化していくつかの画像で評価してみよう。

def predict(image_file):
    img = Image.open(image_file)
    img_tensor = preprocess(img)
    img_tensor.unsqueeze_(0)

    out = vgg16(Variable(img_tensor))

    # 出力を確率にする（分類するだけなら不要）
    out = nn.functional.softmax(out, dim=1)
    out = out.data.numpy()

    maxid = np.argmax(out)
    maxprob = np.max(out)
    label = labels[maxid]
    return img, label, maxprob

img, label, prob = predict('./data/20170104210653.jpg')
print(label, prob)   # ['n02328150', 'Angora'] 0.999879
img

f:id:aidiary:20180212105211p:plain

img, label, prob = predict('./data/20170104210658.jpg')
print(label, prob)  # ['n04147183', 'schooner'] 0.942729
img

f:id:aidiary:20180212112100p:plain

img, label, prob = predict('./data/20170104210705.jpg')
print(label, prob)  # ['n02699494', 'altar'] 0.823404
img

f:id:aidiary:20180212112141p:plain

Kerasのときと微妙に結果が違うな。

人工知能に関する断創録

このブログでは人工知能のさまざまな分野について調査したことをまとめています（更新停止: 2019年12月31日）

モデルのロード

前処理

画像分類

ImageNetの1000クラスラベル

テスト

参考