As a first learning task for using AI with Tic-Tac-Toe, let us take the task of determining whether a board is a winning one or not (i.e. either X or O has won the game). We cannot directly tell the neural network what the rules are for an assignment to be winning. Instead, we need to train it by examples. For that we already have prepared some data in a previous post.
The idea is a to train a TensorFlow model with several Dense layers. By varying the model’s configuration, we want to determine how complex (e.g. how many parameters we need) such a model needs to be to fulfill this requirement. Also looking at the accuracy will be an interesting topic.
As an information in advance: The computation were done using TensorFlow 2.13.1 on a Windows WSL2 machine having an NVIDIA Geforce RTX 4060 (8GB) installed. Mixed Precision was not enabled.
As usual you may download the entire example using the following link:
tictactoetf.zip (9.6 KiB, 274 hits)
Let’s get started…
[continued on the next page]
Reading Data for Training and Evaluation
First of all, we need to read the data for training and evaluation. In the preparation blog post, we already created the test data file tictactoe_valid.txt
for that. We will reuse the data loader for further cases as well, that is why we build it a little generic using Pyrecords.
import gzip
from pyrecord import Record # https://pythonhosted.org/pyrecord/
import itertools
TTTRecord = Record.create_type("TTTRecord", "vector", "valid", "winning", "winner", "move")
def tttRecordGenerator():
with open("tictactoe_valid.txt", "rt") as f:
line = f.readline()
while line:
# print (line)
vector = [int(line[0]), int(line[1]), int(line[2]), int(line[3]), int(line[4]), int(line[5]), int(line[6]), int(line[7]), int(line[8])]
valid = line[10] == "1"
winning = int(line[11])
winner = int(line[12])
move = int(line[13])
record = TTTRecord(vector, valid, winning, winner, move)
yield record
line = f.readline()
validtttRecords = filter(lambda o : o.valid, tttRecordGenerator())
# Warning: will take a couple of seconds!
validtttRecordsList = list(validtttRecords)
len(validtttRecordsList)
Having set up TensorFlow with
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import pandas as pd
import numpy as np
import datetime
we then transform or Pyrecord data into a Pandas DataFrame.
allDataFrame = pd.DataFrame(list(zip([x.vector[0] for x in validtttRecordsList],
[x.vector[1] for x in validtttRecordsList],
[x.vector[2] for x in validtttRecordsList],
[x.vector[3] for x in validtttRecordsList],
[x.vector[4] for x in validtttRecordsList],
[x.vector[5] for x in validtttRecordsList],
[x.vector[6] for x in validtttRecordsList],
[x.vector[7] for x in validtttRecordsList],
[x.vector[8] for x in validtttRecordsList],
[x.winning for x in validtttRecordsList])),
columns =['pos1', 'pos2', 'pos3','pos4','pos5','pos6','pos7','pos8','pos9', 'winning'])
print(allDataFrame.tail())
This gives us a first glimpse of the data:
pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 winning
362875 9 8 7 6 5 4 1 3 2 1
362876 9 8 7 6 5 4 2 1 3 0
362877 9 8 7 6 5 4 2 3 1 0
362878 9 8 7 6 5 4 3 1 2 1
362879 9 8 7 6 5 4 3 2 1 1
Essentially, what we are doing is that we unpack the vector into columns (this will be our features later on), having it located next to the winning information (which will be our labels later).
Training (Basic Model)
For training, we take the usual 80% random cut. The remainder will serve as test for evaluation later. Moreover, we take the usual statistical information for our training set.
train_dataset = allDataFrame.sample(frac=0.8, random_state=42)
test_dataset = allDataFrame.drop(train_dataset.index)
print(allDataFrame.shape, train_dataset.shape, test_dataset.shape)
train_dataset.describe().transpose()
(362880, 10) (290304, 10) (72576, 10)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
pos1 | 290304.0 | 5.004037 | 2.582516 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos2 | 290304.0 | 4.999249 | 2.580750 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos3 | 290304.0 | 4.996063 | 2.580538 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos4 | 290304.0 | 5.003014 | 2.581781 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos5 | 290304.0 | 4.996579 | 2.582223 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos6 | 290304.0 | 5.000279 | 2.581773 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos7 | 290304.0 | 4.999029 | 2.583215 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos8 | 290304.0 | 4.998505 | 2.581489 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
pos9 | 290304.0 | 5.003245 | 2.583642 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
winning | 290304.0 | 0.448692 | 0.497361 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
We can easily see that there is a uniform distribution on all positions, and that the winning value behaves like a boolean for categorization.
For easier access, we can now split features and labels:
# split features from labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('winning')
test_labels = test_features.pop('winning')
Obviously our features are not normalized yet. That is why we prepare a Keras Normalization Layer.
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())
[[5.004032 4.9992476 4.9960785 5.0030107 4.9965568 5.0002723 4.999036
4.9984956 5.0032406]]
Soon we will make this the first layer of our model.
Apropos talking about the model: For a starter, let’s take the following model:
model = keras.models.Sequential([
normalizer,
layers.Dense(units=64, activation='relu'), #1
layers.Dense(units=64,activation='relu'), #2
layers.Dense(units=128,activation='relu'), #3
layers.Dense(units=1)
])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
normalization (Normalizati (None, 9) 19
on)
dense (Dense) (None, 64) 640
dense_1 (Dense) (None, 64) 4160
dense_2 (Dense) (None, 128) 8320
dense_3 (Dense) (None, 1) 129
=================================================================
Total params: 13268 (51.83 KB)
Trainable params: 13249 (51.75 KB)
Non-trainable params: 19 (80.00 Byte)
_________________________________________________________________
None
As you can see, after the normalizer, we have two Dense layers with 64 units each, followed by a Dense layer with 128 units. As we want to have a single boolean-like result (“winning or not”), the result layer is a Dense layer with a single unit. As we go with the logit approach, there is no activation function on the result layer. Let’s see how far we get with this.
To ensure that we don’t overfit our model, let’s make sure that we will stop fitting at latest, if we have an accuracy of 1.0. For that we define a brief custom fitting callback:
class Accuracy1Stopping(keras.callbacks.Callback):
def __init():
super.__init__()
def on_epoch_end(self, epoch, logs=None):
if round(logs.get('accuracy'), 4) == 1.0:
self.model.stop_training = True
Then let’s compile and fit the model right away:
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(learning_rate=0.01),
metrics = ["accuracy"])
history = model.fit(train_features, train_labels,
batch_size=512,
epochs=20,
shuffle=True,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor='accuracy', mode="max", restore_best_weights=True, patience=5, verbose=1),
Accuracy1Stopping(),
tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.002)
],
verbose=1)
Epoch 1/50
567/567 [==============================] - 6s 7ms/step - loss: 0.5521 - accuracy: 0.6821 - lr: 0.0100
Epoch 2/50
567/567 [==============================] - 3s 5ms/step - loss: 0.3881 - accuracy: 0.7986 - lr: 0.0100
Epoch 3/50
567/567 [==============================] - 3s 5ms/step - loss: 0.2644 - accuracy: 0.8709 - lr: 0.0100
Epoch 4/50
567/567 [==============================] - 3s 5ms/step - loss: 0.1828 - accuracy: 0.9161 - lr: 0.0100
Epoch 5/50
567/567 [==============================] - 3s 5ms/step - loss: 0.1332 - accuracy: 0.9435 - lr: 0.0100
Epoch 6/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0902 - accuracy: 0.9631 - lr: 0.0100
Epoch 7/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0713 - accuracy: 0.9716 - lr: 0.0100
Epoch 8/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0696 - accuracy: 0.9740 - lr: 0.0100
Epoch 9/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0528 - accuracy: 0.9808 - lr: 0.0100
Epoch 10/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0445 - accuracy: 0.9841 - lr: 0.0100
Epoch 11/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0496 - accuracy: 0.9836 - lr: 0.0100
Epoch 12/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0285 - accuracy: 0.9907 - lr: 0.0100
Epoch 13/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0433 - accuracy: 0.9860 - lr: 0.0100
Epoch 14/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0181 - accuracy: 0.9955 - lr: 0.0100
Epoch 15/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0552 - accuracy: 0.9852 - lr: 0.0100
Epoch 16/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0022 - accuracy: 0.9999 - lr: 0.0100
Epoch 17/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0032 - accuracy: 0.9995 - lr: 0.0100
Epoch 18/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0914 - accuracy: 0.9779 - lr: 0.0100
Epoch 19/50
567/567 [==============================] - 3s 5ms/step - loss: 0.0030 - accuracy: 0.9999 - lr: 0.0020
Epoch 20/50
567/567 [==============================] - 3s 6ms/step - loss: 0.0022 - accuracy: 1.0000 - lr: 0.0020
Note that the training was through in less than 63s – and we achieved a mind-blowing accuracy of 1.0 after already 20 epochs!
Evaluation (Basic Model)
Let’s check whether the model has not tried to trick us and use the test dataset:
evaluationResult = model.evaluate(test_features, test_labels, batch_size=256, verbose=1)
print(evaluationResult)
284/284 [==============================] - 4s 10ms/step - loss: 0.0021 - accuracy: 1.0000
[0.002135399729013443, 1.0]
Also for test dataset, the accuracy is 1.0! That impressively confirms that the neural network was able to learn what a winning board is.
BTW: This Keras neural network can be downloaded here.
tic-tac-toe-Winning-Model.zip (134.8 KiB, 161 hits)
Sensitive Analysis
Now, let’s play around a little with the model and see how sensitive our model configuration is. For that, we automate the evaluation in a function, which reads like this:
import time
def runTrainingAndMeasureTestAccuracy(model, learning_rate):
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
metrics = ["accuracy"])
start = time.time()
history = model.fit(train_features, train_labels,
batch_size=512,
epochs=50,
shuffle=True,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor='accuracy', mode="max", patience=5, verbose=1),
Accuracy1Stopping(),
tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.002)
],
verbose=1)
stop = time.time()
print(f"Elapsed Training time: {stop-start}")
print("Evaluating...")
evaluationResult = model.evaluate(test_features, test_labels, batch_size=256, verbose=1)
print(evaluationResult)
The training “loop” then itself calls this method like this:
model = keras.models.Sequential([
normalizer,
# here goes the model definition.
layers.Dense(units=1)
])
runTrainingAndMeasureTestAccuracy(model, 0.01)
Running a set of measurements, this results in the following data points:
Learning Rate | Dense Layer 1 | Dense Layer 2 | Dense Layer 3 | Dense Layer 4 | Dense Layer 5 | Dense Layer 6 | Dense Layer 7 | Dense Layer 8 | Dense Layer 9 | Epochs | Training Accuracy | Test Accuracy | Training Duration [s] | Avg Training Time/Epoch |
0,01 | 64 | 64 | 128 | 27 | 1 | 0,99994 | 41,5 | 1,54 s | ||||||
0,01 | 64 | 64 | 128 | 18 | 1 | 1 | 28,5 | 1,58 s | ||||||
0,01 | 64 | 64 | 128 | 15 | 1 | 1 | 23,3 | 1,55 s | ||||||
0,01 | 32 | 32 | 128 | 34 | 1 | 0,9997 | 50,6 | 1,49 s | ||||||
0,01 | 32 | 32 | 128 | 50 | 0,9963 | 0,9957 | 77,7 | 1,55 s | ||||||
0,01 | 32 | 32 | 128 | 50 | 0,9934 | 0,9966 | 74,6 | 1,49 s | ||||||
0,01 | 128 | 128 | 128 | 12 | 1 | 1 | 20,9 | 1,74 s | ||||||
0,01 | 128 | 128 | 128 | 20 | 1 | 0,9999 | 41,4 | 2,07 s | ||||||
0,01 | 128 | 128 | 128 | 9 | 1 | 1 | 15 | 1,67 s | ||||||
0,01 | 128 | 128 | 16 | 1 | 0,9999 | 22,7 | 1,42 s | |||||||
0,01 | 128 | 128 | 38 | 0,9993 | 0,9993 | 51,8 | 1,36 s | |||||||
0,01 | 128 | 128 | 50 | 0,9998 | 0,9998 | 71,6 | 1,43 s | |||||||
0,01 | 256 | 128 | 50 | 0,9961 | 0,9969 | 69,8 | 1,40 s | |||||||
0,01 | 256 | 128 | 44 | 1 | 0,9999 | 82,3 | 1,87 s | |||||||
0,01 | 256 | 128 | 34 | 1 | 1 | 49 | 1,44 s | |||||||
0,01 | 128 | 128 | 64 | 13 | 1 | 0,9999 | 20,7 | 1,59 s | ||||||
0,01 | 128 | 128 | 64 | 11 | 1 | 1 | 17,7 | 1,61 s | ||||||
0,01 | 128 | 128 | 64 | 16 | 1 | 1 | 26,1 | 1,63 s | ||||||
0,01 | 384 | 50 | 0,9503 | 0,9477 | 91,7 | 1,83 s | ||||||||
0,01 | 384 | 50 | 0,9472 | 0,9458 | 108,2 | 2,16 s | ||||||||
0,01 | 384 | 50 | 0,9417 | 0,9414 | 93,4 | 1,87 s | ||||||||
0,01 | 1024 | 50 | 0,9506 | 0,9429 | 64,8 | 1,30 s | ||||||||
0,01 | 1024 | 50 | 0,9556 | 0,9603 | 82,2 | 1,64 s | ||||||||
0,01 | 1024 | 50 | 0,9513 | 0,9554 | 65,9 | 1,32 s | ||||||||
0,01 | 32 | 32 | 32 | 32 | 32 | 32 | 40 | 0,9997 | 0,9998 | 80,4 | 2,01 s | |||
0,01 | 32 | 32 | 32 | 32 | 32 | 32 | 23 | 1 | 1 | 45,6 | 1,98 s | |||
0,01 | 32 | 32 | 32 | 32 | 32 | 32 | 30 | 1 | 1 | 59,1 | 1,97 s | |||
0,01 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 50 | 0,9808 | 0,9591 | 133,5 | 2,67 s |
0,01 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 50 | 0,9478 | 0,9506 | 137,3 | 2,75 s |
0,01 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 47 | 0,9361 | 0,9417 | 118,6 | 2,52 s |
It is to be admitted that this small series of measurement are way too small to allow concluding general statements, but the following hypothesizes may appear worthwhile having a closer look at:
- More nodes in the layers does not automatically mean better results.
- Also the number of layers do not yield better results automatically.
- Reduction of the number of nodes “in the base layers” may mean slower training progress.
- There seems to be a lower limit of units in the dense layer that are necessary to achieve an accuracy value of 1.
- Reducing the number of hidden layers below 2 does not allow anymore to reach an accuracy value of 1 – even if the number of units is increased drastically.
- A higher learning rate (0.01 -> 0.05) may lead to training problems.
By looking at which neural networks were able to achieve an accuracy of 1, and mentally comparing those, then it two more aspects become apparent:
- Models, which consist of only layers with 32 nodes, were able to achieve an accuracy of 1.
- A large number of layers with only small amounts of units (16) does not guarantee to also achieve an accuracy value of 1. Moreover, models with many layers take longer to train.
In short, there seems to be an optimum between number of layers and units in the layers: two to three layers seem to be most efficient, plus going below 64 units per Dense layer also is not very promising.
Perhaps with a later blog post we may further analyze these aspects.
Pingback: Tic-Tac-Toe and AI: Who is the Winner? | Nico's Blog
Pingback: Tic-Tac-Toe and AI: And what about the Winning Move? | Nico's Blog
Pingback: Tic-Tac-Toe: Wrapping Up the Four Models - Multi-Output | Nico's Blog