Building an AI Bot to Play Rock, Paper, Scissors

Learn How You Can Do the Same

Darren Broderick (DBro)
Towards Data Science

--

AI Bot Name: Janken
Technology Used: TensorFlow, Keras, Numpy, SqueezeNet & Open CV

All executed through python scripts. Game settings through keyboard strokes.

TL;DR

If you want to get running you can go straight to my GitHub and clone.
https://github.com/DarrenBro/rock-paper-scissors-ai
There are detailed instructions you can follow on the Readme.

In the Github project, I’ve included model examples, dependencies for the project, SqueezeNet models(will mention later on) and test images.

I’ve tried to make it as minimal as possible to get started even if you’ve never worked with ML or python before.

The rest of this article will focus on the 4 steps that made Janken;

Local webcam — Janken doesn’t always get it right
  1. Gathering the data
    (what seemed appropriate)
  2. SqueezeNet and training a Neural Network
  3. Testing the Model
  4. Play the game!

Gathering the data — what seemed appropriate

We want to collect 4 types of images (Rock, Paper, Scissors & Background/Noise) and map them as our input labels.
We just need them indexed.

In machine learning, when a dataset with correct answers is available, it can looked at as a form of super-supervised learning. This form of training is “Image Classification” and as we know the correct labels, Janken falls under Supervised Image Classification.

So we want 100s of images across each category, this requires a bit of work on our end but we reduce the heavy lifting with a simple open CV script to collect all these images in seconds.

This is explained in the Github Readme

Things to watch out for.

Keeping all input shapes consistent
You can easily get caught up in errors like below if have inconsistent data. shapes.

ValueError: Error when checking: expected squeezenet_input to have shape (None, 300, 300, 3) but got array with shape (1, 227, 227, 3).

So having something like my diagram above size, capture and then store your data you can automate and collect 100s of images in seconds.

Local webcam — Try to get as many different angles as possible

Remove Bias

Local webcam — My gloved hand to increase data range

A huge area in ML is having a model that has good generalisation.

Is the diversity of data good enough to train with? 100s of images of just my own hand will likely have trouble identifying anything other than me as a player.

My temporary solution was too used a gloved hand to encourage the CNN to focus on the features of the gesture.

I then extended this by reaching out and gathering images from the Quantum unit and train to keep my bias down further.

An Improvement idea — Keep all trained images black and white and convert game images also to BW colour.

One last thing to note; as I’m angling my hand during image capturing, some came out blurred through the motion. It was important to clean those out by removal as they will only serve to confuse the model to identify static imagery as it would during gameplay.

Noise Pollution / Lighting
Noise’ meaning background or anything other than a hand playing a gesture. One point on this was the trouble I had keeping Janken from recognising nothing was played yet and it would skip between predictions. So keeping a stable background in training was not the best choice, I could have mixed in more contrast, shading, posters, doors, just more natural house and office items into the data.

SqueezeNet and training a Neural Network

There is a lot to be said here, I added lots of comments into the code in my GitHub (linked at top and bottom), the file to focus on here is called “train_model.py”.

Keras
Everything revolves around keras for this project, it is a Python API for Deep Learning.

Lets have our label inputs map to index values as that’s how the NN will identify them.

INPUT_LABELS = {
"rock": 0,
"paper": 1,
"scissors": 2,
"noise": 3
}

After that I decided to go the ‘Sequential Model’.

A design that allows a simple 1–1 mapping, in ML terms it’s allowing layers that take in 1 input tensor and create 1 output tensor. For example, sequential allows the Neural Network (NN) to identity a paper image as value of 1.

model = Sequential([
SqueezeNet(input_shape=(300, 300, 3), include_top=False),
Dropout(0.2),
Convolution2D(LABELS_COUNT, (1, 1), padding='valid'),
Activation('relu'),
GlobalAveragePooling2D(),
Activation('softmax')

Above is our model definition, the heart of Janken and this is all the architecture designing we’ll need to do, most of the work is around collecting, cleaning and shaping the data before we compile and fit the model.

I will talk about ‘SqueezeNet’ shortly. Everything else below is considered a layer (keras.layers), they’re all important so let’s take a minute to explain each.

Dropout added to reduce overfitting, 0.2 = 20%, so throughout the training process drop 1/5 of what the resultant to keep the model learning new approaches and not become stale. It’s not unusual to see this as high as 50%.

Convolution2D controls the size of the layer by the first argument LABELS_COUNT which will be 4 in total, (3 gestures + noise) label. It is appended to the already defined neural network ‘SqueezeNet’.

Activation (ReLU) Rectified Linear Unit turns negative values into 0 and outputs positive values.
Why? An Activation function is responsible for taking inputs and assigning weights to output nodes, a model can’t coordinate itself well with negative values, it becomes un-uniform in the sense of how we expect the model to perform.

𝑓(𝑥) = max{0, 𝑥} => The output of a ReLU unit is non-negative. Returning x, if less than zero then max returns 0.

ReLU produces a Linear straight line on the positive side, negative side being a flat zero

ReLU has become the default activation function for many types of neural networks because a model that uses ReLU is easier to train and often achieves better performance and good model convergence.

It’s not be all end all, it can cause a problem called “dying ReLU” if you are returning too many negatives you’ll not want them all turned 0 but return in a negative linear fashion. Search for “Leaky ReLU” if you’re interested in learning more.

GlobalAveragePooling2D performs classification and calculates average output of each feature map in previous layer.
i.e data reduction layer, prepares model for Activation(‘softmax’).

Activation (softmax) gives probabilities of each hand sign.
We have a 4 image class (problem), ‘softmax’ handles multi-class, anything more than 2.

SqueezeNet

It is a pre-built neural network for image classification, meaning we can focus on it’s extension for our purpose to build Janken which is enough work in itself than the added effort to create a neural network from scratch. The training time alone for that could be days.

Bonuses you get with SqueezeNet;

  1. Smaller Convolutional Neural Networks (CNNs) require less communication across servers during distributed training.
  2. Less bandwidth to export a new model.
  3. Smaller CNNs are more feasible to deploy and uses less memory.

To re-visit the line of code from the training script.

SqueezeNet(input_shape=(300, 300, 3), include_top=False)

# input_shape is an image size of 300 x 300 pixels. 3 is for RGB colour range.

# include_top lets you select if you want a final dense layer or not.

# Dense layers are capable of interpreting found patterns in order to classify images.
e.g. this image contains rock.

# Set to False as we have labeled what rock data looks like already.

The convolutional layers work as feature extractors. They identify a series of patterns in the image, and each layer can identify more elaborate patterns by seeing patterns of patterns.

Note on the weights:

  • The weights in a convolutional layer are fixed-size. A convolutional layer doesn’t care about the size of the input image.
    It just does the training and presents a resulting image based on the same size of the input image.
  • The weights in a dense layer are totally dependent on the input size. It’s one weight per element of the input. So this demands that your input be always the same size, or else you won’t have proper learned weights.

Because of this, setting the final dense layer to false allows you to define the input size (300 x 300). (And the output size will increase/decrease accordingly).

Testing the model

In the script “test_model” you can see model predictions on images you’ve already processed or new images the model has never seen before.

Sometimes it’s spot on
Sometimes you get lucky
The other 90% of the time you just sigh

The script handles any new image you want to provide as it will reshape to 300x300 using open CV.

Play the game!

Predictions on Janken’s play result
I imagined the prediction Janken would make will “flicker” a lot as a moving camera image will always be provide different inputs to analyse and run the model against.

Lighting will play a big part so I tried to split my dataset and collect images at different times of the day.

Static backgrounds or controls to freeze the image will help make more stable gesture predictions.

How did it play out?

Janken wasn’t made with lockdown in mind so an ‘elegant’… solution was made to play others through the mac’s webcam onto a shared screen with another player on the monitor.

I know you’re impressed

However Janken could only ever beat me constantly, it was able to win 50% of played gestures from other players through the monitor but I suspected the nature of camera imagery and processing made it difficult for Janken to probably make out all gestures.

To improve the model I should have gathered images from the the users side through my webcam to give Janken more generalisation.

--

--