Tic-Tac-Toe and AI: A Winning Board

As a first learning task for using AI with Tic-Tac-Toe, let us take the task of determining whether a board is a winning one or not (i.e. either X or O has won the game). We cannot directly tell the neural network what the rules are for an assignment to be winning. Instead, we need to train it by examples. For that we already have prepared some data in a previous post.

The idea is a to train a TensorFlow model with several Dense layers. By varying the model’s configuration, we want to determine how complex (e.g. how many parameters we need) such a model needs to be to fulfill this requirement. Also looking at the accuracy will be an interesting topic.

As an information in advance: The computation were done using TensorFlow 2.13.1 on a Windows WSL2 machine having an NVIDIA Geforce RTX 4060 (8GB) installed. Mixed Precision was not enabled.

As usual you may download the entire example using the following link:

tictactoetf.zip (9.6 KiB, 450 hits)

Let’s get started…

[continued on the next page]

Reading Data for Training and Evaluation

First of all, we need to read the data for training and evaluation. In the preparation blog post, we already created the test data file tictactoe_valid.txt for that. We will reuse the data loader for further cases as well, that is why we build it a little generic using Pyrecords.

import gzip
from pyrecord import Record # https://pythonhosted.org/pyrecord/

import itertools

TTTRecord = Record.create_type("TTTRecord", "vector", "valid", "winning", "winner", "move")

def tttRecordGenerator():
    with open("tictactoe_valid.txt", "rt") as f:
        line = f.readline()
        while line:
            # print (line)
            vector = [int(line[0]), int(line[1]), int(line[2]), int(line[3]), int(line[4]), int(line[5]), int(line[6]), int(line[7]), int(line[8])]
            valid = line[10] == "1"
            winning = int(line[11])
            winner = int(line[12])
            move = int(line[13])
            record = TTTRecord(vector, valid, winning, winner, move)
            yield record
            line = f.readline()

validtttRecords = filter(lambda o : o.valid, tttRecordGenerator())

# Warning: will take a couple of seconds!
validtttRecordsList = list(validtttRecords)
len(validtttRecordsList)

Having set up TensorFlow with

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

import pandas as pd
import numpy as np
import datetime

we then transform or Pyrecord data into a Pandas DataFrame.

allDataFrame = pd.DataFrame(list(zip([x.vector[0] for x in validtttRecordsList], 
                                     [x.vector[1] for x in validtttRecordsList],
                                     [x.vector[2] for x in validtttRecordsList],
                                     [x.vector[3] for x in validtttRecordsList],
                                     [x.vector[4] for x in validtttRecordsList],
                                     [x.vector[5] for x in validtttRecordsList],
                                     [x.vector[6] for x in validtttRecordsList],
                                     [x.vector[7] for x in validtttRecordsList],
                                     [x.vector[8] for x in validtttRecordsList],
                                     [x.winning for x in validtttRecordsList])), 
             columns =['pos1', 'pos2', 'pos3','pos4','pos5','pos6','pos7','pos8','pos9', 'winning'])

print(allDataFrame.tail())

This gives us a first glimpse of the data:

        pos1  pos2  pos3  pos4  pos5  pos6  pos7  pos8  pos9  winning
362875     9     8     7     6     5     4     1     3     2        1
362876     9     8     7     6     5     4     2     1     3        0
362877     9     8     7     6     5     4     2     3     1        0
362878     9     8     7     6     5     4     3     1     2        1
362879     9     8     7     6     5     4     3     2     1        1

Essentially, what we are doing is that we unpack the vector into columns (this will be our features later on), having it located next to the winning information (which will be our labels later).

Training (Basic Model)

For training, we take the usual 80% random cut. The remainder will serve as test for evaluation later. Moreover, we take the usual statistical information for our training set.

train_dataset = allDataFrame.sample(frac=0.8, random_state=42)
test_dataset = allDataFrame.drop(train_dataset.index)

print(allDataFrame.shape, train_dataset.shape, test_dataset.shape)
train_dataset.describe().transpose()

(362880, 10) (290304, 10) (72576, 10)

	count	mean	std	min	25%	50%	75%	max
pos1	290304.0	5.004037	2.582516	1.0	3.0	5.0	7.0	9.0
pos2	290304.0	4.999249	2.580750	1.0	3.0	5.0	7.0	9.0
pos3	290304.0	4.996063	2.580538	1.0	3.0	5.0	7.0	9.0
pos4	290304.0	5.003014	2.581781	1.0	3.0	5.0	7.0	9.0
pos5	290304.0	4.996579	2.582223	1.0	3.0	5.0	7.0	9.0
pos6	290304.0	5.000279	2.581773	1.0	3.0	5.0	7.0	9.0
pos7	290304.0	4.999029	2.583215	1.0	3.0	5.0	7.0	9.0
pos8	290304.0	4.998505	2.581489	1.0	3.0	5.0	7.0	9.0
pos9	290304.0	5.003245	2.583642	1.0	3.0	5.0	7.0	9.0
winning	290304.0	0.448692	0.497361	0.0	0.0	0.0	1.0	1.0

We can easily see that there is a uniform distribution on all positions, and that the winning value behaves like a boolean for categorization.

For easier access, we can now split features and labels:

# split features from labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('winning')
test_labels = test_features.pop('winning')

Obviously our features are not normalized yet. That is why we prepare a Keras Normalization Layer.

normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())

[[5.004032  4.9992476 4.9960785 5.0030107 4.9965568 5.0002723 4.999036
  4.9984956 5.0032406]]

Soon we will make this the first layer of our model.

Apropos talking about the model: For a starter, let’s take the following model:

model = keras.models.Sequential([
    normalizer,
    layers.Dense(units=64, activation='relu'), #1
    layers.Dense(units=64,activation='relu'), #2 
    layers.Dense(units=128,activation='relu'), #3
    layers.Dense(units=1)
])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization (Normalizati  (None, 9)                 19        
 on)                                                             
                                                                 
 dense (Dense)               (None, 64)                640       
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dense_2 (Dense)             (None, 128)               8320      
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
=================================================================
Total params: 13268 (51.83 KB)
Trainable params: 13249 (51.75 KB)
Non-trainable params: 19 (80.00 Byte)
_________________________________________________________________
None

As you can see, after the normalizer, we have two Dense layers with 64 units each, followed by a Dense layer with 128 units. As we want to have a single boolean-like result (“winning or not”), the result layer is a Dense layer with a single unit. As we go with the logit approach, there is no activation function on the result layer. Let’s see how far we get with this.

To ensure that we don’t overfit our model, let’s make sure that we will stop fitting at latest, if we have an accuracy of 1.0. For that we define a brief custom fitting callback:

class Accuracy1Stopping(keras.callbacks.Callback):
    def __init():
        super.__init__()

    def on_epoch_end(self, epoch, logs=None):
        if round(logs.get('accuracy'), 4) == 1.0:
            self.model.stop_training = True

Then let’s compile and fit the model right away:

model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=keras.optimizers.Adam(learning_rate=0.01), 
              metrics = ["accuracy"])

history = model.fit(train_features, train_labels, 
          batch_size=512, 
          epochs=20, 
          shuffle=True,
          callbacks=[
              tf.keras.callbacks.EarlyStopping(monitor='accuracy', mode="max", restore_best_weights=True, patience=5, verbose=1), 
              Accuracy1Stopping(),
              tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.002)
          ], 
          verbose=1)

Epoch 1/50
567/567 [==============================] - 6s 7ms/step - loss: 0.5521 - accuracy: 0.6821 - lr: 0.0100
Epoch 2/50
567/567 [==============================] - 3s 5ms/step - loss: 0.3881 - accuracy: 0.7986 - lr: 0.0100
Epoch 3/50
567/567 [==============================] - 3s 5ms/step - loss: 0.2644 - accuracy: 0.8709 - lr: 0.0100
Epoch 4/50
567/567 [==============================] - 3s 5ms/step - loss: 0.1828 - accuracy: 0.9161 - lr: 0.0100
Epoch 5/50
567/567 [==============================] - 3s 5ms/step - loss: 0.1332 - accuracy: 0.9435 - lr: 0.0100
Epoch 6/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0902 - accuracy: 0.9631 - lr: 0.0100
Epoch 7/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0713 - accuracy: 0.9716 - lr: 0.0100
Epoch 8/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0696 - accuracy: 0.9740 - lr: 0.0100
Epoch 9/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0528 - accuracy: 0.9808 - lr: 0.0100
Epoch 10/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0445 - accuracy: 0.9841 - lr: 0.0100
Epoch 11/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0496 - accuracy: 0.9836 - lr: 0.0100
Epoch 12/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0285 - accuracy: 0.9907 - lr: 0.0100
Epoch 13/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0433 - accuracy: 0.9860 - lr: 0.0100
Epoch 14/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0181 - accuracy: 0.9955 - lr: 0.0100
Epoch 15/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0552 - accuracy: 0.9852 - lr: 0.0100
Epoch 16/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0022 - accuracy: 0.9999 - lr: 0.0100
Epoch 17/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0032 - accuracy: 0.9995 - lr: 0.0100
Epoch 18/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0914 - accuracy: 0.9779 - lr: 0.0100
Epoch 19/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0030 - accuracy: 0.9999 - lr: 0.0020
Epoch 20/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0022 - accuracy: 1.0000 - lr: 0.0020

Note that the training was through in less than 63s – and we achieved a mind-blowing accuracy of 1.0 after already 20 epochs!

Evaluation (Basic Model)

Let’s check whether the model has not tried to trick us and use the test dataset:

evaluationResult = model.evaluate(test_features, test_labels, batch_size=256, verbose=1)
print(evaluationResult)

284/284 [==============================] - 4s 10ms/step - loss: 0.0021 - accuracy: 1.0000
[0.002135399729013443, 1.0]

Also for test dataset, the accuracy is 1.0! That impressively confirms that the neural network was able to learn what a winning board is.

BTW: This Keras neural network can be downloaded here.

tic-tac-toe-Winning-Model.zip (134.8 KiB, 262 hits)

Sensitive Analysis

Now, let’s play around a little with the model and see how sensitive our model configuration is. For that, we automate the evaluation in a function, which reads like this:

import time
def runTrainingAndMeasureTestAccuracy(model, learning_rate): 
    model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=keras.optimizers.Adam(learning_rate=learning_rate), 
              metrics = ["accuracy"])
    start = time.time()
    history = model.fit(train_features, train_labels, 
          batch_size=512, 
          epochs=50, 
          shuffle=True, 
          callbacks=[
              tf.keras.callbacks.EarlyStopping(monitor='accuracy', mode="max", patience=5, verbose=1), 
              Accuracy1Stopping(),
              tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.002)
          ],
          verbose=1)
    stop = time.time()
    print(f"Elapsed Training time: {stop-start}")

    print("Evaluating...")
    evaluationResult = model.evaluate(test_features, test_labels, batch_size=256, verbose=1)
    print(evaluationResult)

The training “loop” then itself calls this method like this:

model = keras.models.Sequential([
     normalizer,
     # here goes the model definition.
     layers.Dense(units=1)
])
runTrainingAndMeasureTestAccuracy(model, 0.01)

Running a set of measurements, this results in the following data points:

Learning Rate	Dense Layer 1	Dense Layer 2	Dense Layer 3	Dense Layer 4	Dense Layer 5	Dense Layer 6	Dense Layer 7	Dense Layer 8	Dense Layer 9	Epochs	Training Accuracy	Test Accuracy	Training Duration [s]	Avg Training Time/Epoch
0,01	64	64	128							27	1	0,99994	41,5	1,54 s
0,01	64	64	128							18	1	1	28,5	1,58 s
0,01	64	64	128							15	1	1	23,3	1,55 s
0,01	32	32	128							34	1	0,9997	50,6	1,49 s
0,01	32	32	128							50	0,9963	0,9957	77,7	1,55 s
0,01	32	32	128							50	0,9934	0,9966	74,6	1,49 s
0,01	128	128	128							12	1	1	20,9	1,74 s
0,01	128	128	128							20	1	0,9999	41,4	2,07 s
0,01	128	128	128							9	1	1	15	1,67 s
0,01	128	128								16	1	0,9999	22,7	1,42 s
0,01	128	128								38	0,9993	0,9993	51,8	1,36 s
0,01	128	128								50	0,9998	0,9998	71,6	1,43 s
0,01	256	128								50	0,9961	0,9969	69,8	1,40 s
0,01	256	128								44	1	0,9999	82,3	1,87 s
0,01	256	128								34	1	1	49	1,44 s
0,01	128	128	64							13	1	0,9999	20,7	1,59 s
0,01	128	128	64							11	1	1	17,7	1,61 s
0,01	128	128	64							16	1	1	26,1	1,63 s
0,01	384									50	0,9503	0,9477	91,7	1,83 s
0,01	384									50	0,9472	0,9458	108,2	2,16 s
0,01	384									50	0,9417	0,9414	93,4	1,87 s
0,01	1024									50	0,9506	0,9429	64,8	1,30 s
0,01	1024									50	0,9556	0,9603	82,2	1,64 s
0,01	1024									50	0,9513	0,9554	65,9	1,32 s
0,01	32	32	32	32	32	32				40	0,9997	0,9998	80,4	2,01 s
0,01	32	32	32	32	32	32				23	1	1	45,6	1,98 s
0,01	32	32	32	32	32	32				30	1	1	59,1	1,97 s
0,01	16	16	16	16	16	16	16	16	16	50	0,9808	0,9591	133,5	2,67 s
0,01	16	16	16	16	16	16	16	16	16	50	0,9478	0,9506	137,3	2,75 s
0,01	16	16	16	16	16	16	16	16	16	47	0,9361	0,9417	118,6	2,52 s

It is to be admitted that this small series of measurement are way too small to allow concluding general statements, but the following hypothesizes may appear worthwhile having a closer look at:

More nodes in the layers does not automatically mean better results.
Also the number of layers do not yield better results automatically.
Reduction of the number of nodes “in the base layers” may mean slower training progress.
There seems to be a lower limit of units in the dense layer that are necessary to achieve an accuracy value of 1.
Reducing the number of hidden layers below 2 does not allow anymore to reach an accuracy value of 1 – even if the number of units is increased drastically.
A higher learning rate (0.01 -> 0.05) may lead to training problems.

By looking at which neural networks were able to achieve an accuracy of 1, and mentally comparing those, then it two more aspects become apparent:

Models, which consist of only layers with 32 nodes, were able to achieve an accuracy of 1.
A large number of layers with only small amounts of units (16) does not guarantee to also achieve an accuracy value of 1. Moreover, models with many layers take longer to train.

In short, there seems to be an optimum between number of layers and units in the layers: two to three layers seem to be most efficient, plus going below 64 units per Dense layer also is not very promising.

Perhaps with a later blog post we may further analyze these aspects.

Nico's Blog

Hints that matter

Tic-Tac-Toe and AI: A Winning Board (Part 2)

Reading Data for Training and Evaluation

Training (Basic Model)

Evaluation (Basic Model)

Sensitive Analysis

3 Comments

Leave a Reply