The data for this project was scraped from three different sources:
From these sources we chose 12 different styles to train the model on. Those styles are:
Each image was resized to 299x299. We then split this data into three different sets. A training, validation, and testing set. The split was 60/20/20. The number of images in each group is 2387 in the training set, 802 in the validation set, and 800 in the testing set. Due to class imbalance each style was itself split randomly by the above percentages and placed into the three groups. The number of images for each style and set can be found below:
In order to create the model, we chose to fine-tune the Xception model. Given the small amount of data at our disposal, we will use data augmentation techniques to increase the diversity of training images. This will hopefully help us prevent overfitting. The strategy to augment the training data was done by producing images by randomly adjusting the shear range, randomly zooming in, and randomly flipping an image on its horizontal axis.
To begin we first replaced the top of the model, which used to classify the input into a set of discrete classes. The diagram of this new classifier is shown below. It includes regularization to the tune of a .0075 L2 penalty in the dense layer and a 50% dropout layer.
We first pre-trained the model by freezing every layer besides for those we just added. During this procedure the model was trained for 25 epochs with a batch size of 16. The optimizer used is RMSprop with a learning rate of .001. The plot below details the training and validation accuracy.
Now that the newly added top classifier’s weights are initialized we can now begin fine-tuning the model. The obvious question that comes to mind when doing so is how many layers do we fine-tune? Do we allow all of the weights in the network to be modified or only a portion? In order to determine this, a number of separate models are built each fine-tuning a different number of layers in the network. We will then evaluate each model and see which one performs best on the validation set. Each model begins with the same weights (which was just previously trained) and uses the same procedure. Which layers are allowed to be modified or not is determined by the block it is contained in. Of important note is the fact that the Xception module contains 14 blocks. The following models were built:
Each sub-model was trained for 50 epochs with a batch size of 16. The optimizer used was Stochastic Gradient Descent with a learning rate of .0001 and a momentum of .9. After each epoch the current model was evaluated on the validation set. The results of both the accuracy and the loss on the validation set for each model is shown below (Note: Each model is referred to by the number of blocks frozen):
As you can see, the top few models each do fairly similair. Upon closer inspection I’ve determined that the ‘block3’ model performed best, this being the model with only the first three blocks frozen. For this reason this was chosen as my final model.
We first evaluate it on the test set as a whole. Using our model and the test sculptures we look to see how well it does it predicting their correct classes. We find that following results:
A more detailed examination can be found by looking at a normalized confusion matrix as shown below. This shows us that the model has issues distinguishing certain styles from the rest - particularly Rococo, Mannerism, and Romanticism.