Sculpture Style Classifier


NOTE: All the code used to create this model can be found here.


The data for this project was scraped from three different sources:

From these sources we chose 12 different styles to train the model on. Those styles are:

Each image was resized to 299x299. We then split this data into three different sets. A training, validation, and testing set. The split was 60/20/20. The number of images in each group is 2387 in the training set, 802 in the validation set, and 800 in the testing set. Due to class imbalance each style was itself split randomly by the above percentages and placed into the three groups. The number of images for each style and set can be found below:

Model Training

In order to create the model, we chose to fine-tune the Xception model. Given the small amount of data at our disposal, we will use data augmentation techniques to increase the diversity of training images. This will hopefully help us prevent overfitting. The strategy to augment the training data was done by producing images by randomly adjusting the shear range, randomly zooming in, and randomly flipping an image on its horizontal axis.

To begin we first replaced the top of the network, which used to classify the input into a set of discrete classes. The diagram of this new classifier is shown below. It includes regularization to the tune of a .0075 L2 penalty in the dense layer and a 50% dropout layer.

We first pre-trained the network by freezing every layer besides for those we just added. During this procedure the network was trained for 25 epochs with a batch size of 16. The optimizer used is RMSprop with a learning rate of .001. The plot below details the training and validation accuracy.

Now that the newly added top classifier’s weights are initialized we can now begin fine-tuning the network. The obvious question that comes to mind when doing so is how many layers do we fine-tune? Do we allow all of the weights in the network to be modified or only a portion? In order to determine this, a number of separate models are built each fine-tuning a different number of layers in the network. We will then evaluate each model and see which one performs best on the validation set. Each model begins with the same weights (which was just previously trained) and uses the same procedure. Which layers are allowed to be modified or not is determined by the block it is contained in. Of important note is the fact that the Xception module contains 14 blocks. The following models were built:

Each sub-model was trained for 50 epochs with a batch size of 16. The optimizer used was Stochastic Gradient Descent with a learning rate of .0001 and a momentum of .9. After each epoch the current model was evaluated on the validation set. The results of both the accuracy and the loss on the validation set for each model is shown below (Note: Each model is referred to by the number of blocks frozen):

As you can see, the top few models each do fairly similair. Upon closer inspection I’ve determined that the ‘block3’ model performed best, this being the model with only the first three blocks frozen. For this reason this was chosen as my final model.


We first evaluate it on the test set as a whole. Using our model and the test sculptures we look to see how well it does it predicting their correct classes. We find that following results:

A more detailed examination can be found by looking at a normalized confusion matrix as shown below. This shows us that the model has issues distinguishing certain styles from the rest - particularly Rococo, Mannerism, and Romanticism.

Update - 01/21/2020

I recently talked to someone who asked me how I handled the issue of an imbalanced dataset for this task and I told them that besides for ensuring the same percentage of images for each class across each set I didn't do anything. At the time I figured I wouldn't bother and would just maximize the total accuracy/loss. But I figured that due to the very low accuracy among some classes, I may as well see what I could do and how it would effect the classifier.

In order to do so I computed the balanced class weights for the dataset. The method for doing so can be found here. What this does is weight instances of classes with lesser training data as counting more towards the loss function and those with more training data as counting less. This is as oppossed to earlier where each data point was worth the same, causing those classes with more data to factor more into the loss function. One can really choose whatever weights one wants but here I chose to make them balanced. This means that the weighted sum of the instances for each class is the same - so each has the same total effect on the loss function.

I now did the training process again, with the class weights being the only difference from before. Because the whole training process can take some time I choose to use the same hyperparameters as before and to fine-tune the same amount of blocks. This means I first pre-trained the network as before and then fine-tuned the resulting network with the only the first 3 block frozen (which originally performed the best). Here is the comparison of the accuracy and loss by epoch for the original final model and the new weighted model:

It's easy to see that it just doesn't do close to as well. When testing the network on the test set we get an accuracy of 47.64% that is about a ~6.5% drop from the original model. I don't think this should be unexpected. We'll now take a look at the normalized confusion matrix to see how well the weighted classifier does across classes. As a reminder, the weakness with the previous iteration was that some classes like Rococo and Romanticism performed very poorly due to being similar to other styles and having a small sample size. By weighting each class equally they should now perform better.

And this is what we see. Looking at the accuracy for each class we see a big jump for classes that struggled in the previous iteration - Romanticism, Rococo, and Mannerism. We also see big decreases for the two classes with the most data - Baroque and Early Renaissance. Lastly, there is a much lower overall false positive rate. This can be seen clearly with Baroque. The accuracy for both networks and the difference between them can be seen below:

Style Non-Weighted Accuracy% Weighted Accuracy% % Difference
Baroque 61% 30% -31%
Early Renaissance 53% 39% -24%
High Renaissance 20% 29% +9%
Impressionism 64% 54% -10%
Mannerism 12% 29% +17%
Medieval 78% 83% +5%
Minimalism 90% 89% -1%
NeoClassicism 57% 47% -10%
Realism 56% 48% -8%
Rococo 7% 30% +22%
Romanticism 15% 35% +20%
Surrealism 62% 66% +4%

To conclude I this this exercise turned out like we thought it would. Adding class weights resulted in a lower overall accuracy while making the accuracy between classes more balanced. I'd argue this small change makes the network more viable in a real world setting. More work tweaking the hyperparameters and fine-tuning different blocks would likely lessen the gap between the two networks.