Tic-Tac-Toe and AI: Wrapping Up the Four Models - Multi-Output

As we have seen in the blog posts before, determining whether a board has a winner, determining which player the winner is, and with which move the winner won the game can all be done using neural networks. These networks only require Dense layers. However, each case has an inherent complexity, so a “minimal” number of units in these layers (and the number of layers) are necessary:

Use Case	Layers	Dense Layer Configuration	Count of Parameters
Winning	3	64/64/128	13.3k
Winner	1	40	460
Move	2	80/128	11.7k

This suggests a kind of “complexity ranking” for the three challenges: The task with the highest complexity is the “winning problem”, followed by the “move problem”. Finally, the “winner problem” is the easiest to solve, because it may only be answered accurately with a single small Dense layer.

Using the principle of multi-output, you may ask, whether it is possible to retrieve all three pieces of information within a single model. Well, let us try this:

You may download these commands again from this Jupyter Notebook:

tic-tac-toe-Multi-notebook.zip (6.4 KiB, 355 hits)

Having done the usual imports and initialization, let’s define our dataset like this:

allDataFrame = pd.DataFrame(list(zip([x.vector[0] for x in validtttRecordsList], 
                                     [x.vector[1] for x in validtttRecordsList],
                                     [x.vector[2] for x in validtttRecordsList],
                                     [x.vector[3] for x in validtttRecordsList],
                                     [x.vector[4] for x in validtttRecordsList],
                                     [x.vector[5] for x in validtttRecordsList],
                                     [x.vector[6] for x in validtttRecordsList],
                                     [x.vector[7] for x in validtttRecordsList],
                                     [x.vector[8] for x in validtttRecordsList],
                                     [x.winning for x in validtttRecordsList],
                                     [x.winner for x in validtttRecordsList],
                                     [1 if x.move == 3 else 0 for x in validtttRecordsList],
                                     [1 if x.move == 4 else 0 for x in validtttRecordsList],
                                     [1 if x.move == 5 else 0 for x in validtttRecordsList],
                                     [1 if x.move == 9 else 0 for x in validtttRecordsList]
                                    )),
             columns =['pos1', 'pos2', 'pos3','pos4','pos5','pos6','pos7','pos8','pos9', 'winning', 'winner', 'move3', 'move4', 'move5', 'move9'])

print(allDataFrame.tail())

We do again the usual 80/20 split between training and test data:

train_dataset = allDataFrame.sample(frac=0.8, random_state=42)
test_dataset = allDataFrame.drop(train_dataset.index)

print(allDataFrame.shape, train_dataset.shape, test_dataset.shape)
train_dataset.describe().transpose()

Similar to what we did before, we now split up the three different labels:

train_features = train_dataset.copy()
test_features = test_dataset.copy()

winning_train_labels = train_features.pop('winning')
winning_test_labels = test_features.pop('winning')

winner_train_labels = train_features.pop('winner')
winner_test_labels = test_features.pop('winner')

moveColumns = ['move3', 'move4', 'move5', 'move9']

move_train_labels = train_features[moveColumns].copy()
train_features = train_features.drop(moveColumns, axis=1)
move_test_labels = test_features[moveColumns].copy()
test_features = test_features.drop(moveColumns, axis=1)

print(train_features)
print(winning_train_labels)
print(winner_train_labels)
print(move_train_labels)

This will give us data samples like this:

        pos1  pos2  pos3  pos4  pos5  pos6  pos7  pos8  pos9
329603     9     2     4     7     6     3     8     5     1
56013      2     5     1     8     7     9     4     6     3
296034     8     3     7     1     9     6     2     4     5
224081     6     5     4     2     3     8     9     7     1
245695     7     1     8     3     5     4     2     9     6
...      ...   ...   ...   ...   ...   ...   ...   ...   ...
177025     5     4     1     9     3     2     6     8     7
204707     6     1     7     3     9     4     8     5     2
312923     8     7     1     5     6     3     9     4     2
47930      2     3     6     7     5     1     8     4     9
323085     9     1     2     7     4     8     5     6     3

[290304 rows x 9 columns]
329603    0
56013     1
296034    1
224081    1
245695    1
         ..
177025    0
204707    1
312923    1
47930     1
323085    0
Name: winning, Length: 290304, dtype: int64
329603    0
56013     0
296034    0
224081    1
245695    0
         ..
177025    0
204707    0
312923    1
47930     0
323085    0
Name: winner, Length: 290304, dtype: int64
        move3  move4  move5  move9
329603      0      0      0      1
56013       0      1      0      0
296034      1      0      0      0
224081      1      0      0      0
245695      0      1      0      0
...       ...    ...    ...    ...
177025      0      0      0      1
204707      0      1      0      0
312923      1      0      0      0
47930       0      1      0      0
323085      0      0      0      1

[290304 rows x 4 columns]

Getting the usual normalizer:

normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())

We may define our multi-output model like this:

inputs = keras.Input(shape=(9))
dense1 = keras.layers.Dense(64, activation='relu')
dense2 = keras.layers.Dense(80, activation='relu')
dense3 = keras.layers.Dense(256, activation='relu')

denseWinningOutput = keras.layers.Dense(1, name="winning_output")
denseWinnerOutput = keras.layers.Dense(1, name="winner_output")

denseMoveOutput = keras.layers.Dense(4, name="move_output")

x = dense1(normalizer(inputs))
x = dense2(x)
x = dense3(x)

outputWinning = denseWinningOutput(x)
outputWinner = denseWinnerOutput(x)
outputMove = denseMoveOutput(x)

model = keras.Model(inputs=inputs, outputs=[outputWinning, outputWinner, outputMove], name="tictactoe_model")

print(model.summary())

Note that

a Sequential model is no longer sufficient.
We have three different “final layers” for representing our results.
For accessing the three different outputs, we also have to give them names: winning_output, winner_output and move_output.
Naturally, we would initially define the hidden Dense layers along the requirements of the “most complex problem”, i. e. the winning use case. We therefore would take a 64/64/128 layering. However, tests have shown that this is not sufficient. Already a 64/64/256 layering hardly achieves to really hit a 1.0/1.0/1.0 accuracy after 100 epochs. That is why we define a 64/80/256 layering.

This model then will have around 28k parameters to solve:

Model: "tictactoe_model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_5 (InputLayer)           [(None, 9)]          0           []                               
                                                                                                  
 normalization (Normalization)  (None, 9)            19          ['input_5[0][0]']                
                                                                                                  
 dense_12 (Dense)               (None, 64)           640         ['normalization[4][0]']          
                                                                                                  
 dense_13 (Dense)               (None, 80)           5200        ['dense_12[0][0]']               
                                                                                                  
 dense_14 (Dense)               (None, 256)          20736       ['dense_13[0][0]']               
                                                                                                  
 winning_output (Dense)         (None, 1)            257         ['dense_14[0][0]']               
                                                                                                  
 winner_output (Dense)          (None, 1)            257         ['dense_14[0][0]']               
                                                                                                  
 move_output (Dense)            (None, 4)            1028        ['dense_14[0][0]']               
                                                                                                  
==================================================================================================
Total params: 28,137
Trainable params: 28,118
Non-trainable params: 19
__________________________________________________________________________________________________
None

For compiling (and later training) the model, we need to have loss functions. As each output may have a different loss value – and the output types are different (binary vs. categorical), we also will need separate loss functions for each output.

losses = {
    "winning_output": keras.losses.BinaryCrossentropy(from_logits=True),
    "winner_output": keras.losses.BinaryCrossentropy(from_logits=True),
    "move_output": keras.losses.CategoricalCrossentropy(from_logits=True)
}

model.compile(loss=losses,
              optimizer=keras.optimizers.Adam(learning_rate=0.015), 
              metrics = ["accuracy"])

Moreover, when training, we want to make sure that we don’t stop if the first output has reached an accuracy of 1.0 (rounded to 5 digits), but only if all of them have reached that limit. That is why we need an adjusted Accuracy Stopping callback:

class Accuracy1Stopping(keras.callbacks.Callback):
    def __init():
        super.__init__()

    def on_epoch_end(self, epoch, logs=None):
        if round(logs.get('winning_output_accuracy'), 5) == 1.0 and round(logs.get('winner_output_accuracy'), 5) == 1.0 and round(logs.get('move_output_accuracy'), 5) == 1.0:
            self.model.stop_training = True

We then can initiate training:

history = model.fit(train_features, y = {
        "winning_output": winning_train_labels, 
        "winner_output": winner_train_labels,
        "move_output": move_train_labels
    },
    batch_size=512, 
    epochs=100, 
    shuffle=True,
    callbacks=[
      Accuracy1Stopping(),
      tf.keras.callbacks.ReduceLROnPlateau(monitor='winning_output_loss', factor=0.7, patience=6, min_lr=0.0015),
      tensorboard_callback
    ],
    verbose=1)

It needs to be admitted that this training is much more fragile – and is significantly more sensitive to the initial random values of the model. Therefore, it may take a couple of times until you really get a “good model” trained properly, which ends up like this:

Epoch 1/100
  5/567 [..............................] - ETA: 8s - loss: 2.2032 - winning_output_loss: 0.6967 - winner_output_loss: 0.4011 - move_output_loss: 1.1054 - winning_output_accuracy: 0.5570 - winner_output_accuracy: 0.9066 - move_output_accuracy: 0.5039   WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0044s vs `on_train_batch_end` time: 0.0296s). Check your callbacks.
567/567 [==============================] - 10s 14ms/step - loss: 1.5532 - winning_output_loss: 0.5537 - winner_output_loss: 0.2232 - move_output_loss: 0.7764 - winning_output_accuracy: 0.6820 - winner_output_accuracy: 0.9250 - move_output_accuracy: 0.6664 - lr: 0.0150
Epoch 2/100
567/567 [==============================] - 8s 14ms/step - loss: 1.0364 - winning_output_loss: 0.3951 - winner_output_loss: 0.1086 - move_output_loss: 0.5327 - winning_output_accuracy: 0.7946 - winner_output_accuracy: 0.9563 - move_output_accuracy: 0.7795 - lr: 0.0150
Epoch 3/100
567/567 [==============================] - 8s 14ms/step - loss: 0.6892 - winning_output_loss: 0.2779 - winner_output_loss: 0.0443 - move_output_loss: 0.3670 - winning_output_accuracy: 0.8664 - winner_output_accuracy: 0.9833 - move_output_accuracy: 0.8516 - lr: 0.0150
Epoch 4/100
567/567 [==============================] - 8s 14ms/step - loss: 0.4397 - winning_output_loss: 0.1836 - winner_output_loss: 0.0195 - move_output_loss: 0.2366 - winning_output_accuracy: 0.9187 - winner_output_accuracy: 0.9940 - move_output_accuracy: 0.9080 - lr: 0.0150
Epoch 5/100
567/567 [==============================] - 8s 14ms/step - loss: 0.2816 - winning_output_loss: 0.1173 - winner_output_loss: 0.0149 - move_output_loss: 0.1494 - winning_output_accuracy: 0.9516 - winner_output_accuracy: 0.9960 - move_output_accuracy: 0.9455 - lr: 0.0150
Epoch 6/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1702 - winning_output_loss: 0.0711 - winner_output_loss: 0.0134 - move_output_loss: 0.0857 - winning_output_accuracy: 0.9745 - winner_output_accuracy: 0.9967 - move_output_accuracy: 0.9731 - lr: 0.0150
Epoch 7/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1674 - winning_output_loss: 0.0681 - winner_output_loss: 0.0129 - move_output_loss: 0.0864 - winning_output_accuracy: 0.9788 - winner_output_accuracy: 0.9967 - move_output_accuracy: 0.9767 - lr: 0.0150
Epoch 8/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0143 - winning_output_loss: 0.0078 - winner_output_loss: 6.8342e-05 - move_output_loss: 0.0064 - winning_output_accuracy: 0.9984 - winner_output_accuracy: 1.0000 - move_output_accuracy: 0.9990 - lr: 0.0150
Epoch 9/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1652 - winning_output_loss: 0.0615 - winner_output_loss: 0.0167 - move_output_loss: 0.0870 - winning_output_accuracy: 0.9831 - winner_output_accuracy: 0.9965 - move_output_accuracy: 0.9794 - lr: 0.0150
Epoch 10/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0089 - winning_output_loss: 0.0046 - winner_output_loss: 1.6532e-04 - move_output_loss: 0.0041 - winning_output_accuracy: 0.9991 - winner_output_accuracy: 0.9999 - move_output_accuracy: 0.9994 - lr: 0.0150
Epoch 11/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1202 - winning_output_loss: 0.0445 - winner_output_loss: 0.0128 - move_output_loss: 0.0629 - winning_output_accuracy: 0.9881 - winner_output_accuracy: 0.9972 - move_output_accuracy: 0.9850 - lr: 0.0150
Epoch 12/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0054 - winning_output_loss: 0.0027 - winner_output_loss: 1.0714e-04 - move_output_loss: 0.0025 - winning_output_accuracy: 0.9995 - winner_output_accuracy: 1.0000 - move_output_accuracy: 0.9996 - lr: 0.0150
Epoch 13/100
567/567 [==============================] - 8s 14ms/step - loss: 8.3936e-04 - winning_output_loss: 4.9668e-04 - winner_output_loss: 5.7897e-06 - move_output_loss: 3.3689e-04 - winning_output_accuracy: 1.0000 - winner_output_accuracy: 1.0000 - move_output_accuracy: 1.0000 - lr: 0.0150

Let’s evaluate again the model with our test data:

print(test_features)
ytest={
    "winning_output": winning_test_labels,
    "winner_output": winner_test_labels,
    "move_output": move_test_labels
}
print(ytest)
evaluationResult = model.evaluate(test_features, y = ytest, batch_size=256, verbose=1)
print(evaluationResult)

This yields a rocking 1.0/1.0/1.0 accuracy:

284/284 [==============================] - 2s 8ms/step - loss: 6.3467e-04 - winning_output_loss: 3.7916e-04 - winner_output_loss: 4.0316e-06 - move_output_loss: 2.5148e-04 - winning_output_accuracy: 1.0000 - winner_output_accuracy: 1.0000 - move_output_accuracy: 1.0000
[0.0006346721202135086, 0.00037915774737484753, 4.0315990190720186e-06, 0.00025148282293230295, 1.0, 1.0, 1.0]

As usual, you may download this model again from here.

tic-tac-toe-Multi-Model.zip (243.6 KiB, 331 hits)

There are a couple of things, which we may conclude from this:

Using multi-output models for the Tic-Tac-Toe case requires additional units in the hidden layer than compared to the most complex model of the three use cases.
Training a multi-output model is much more complex than training three models separately.
The total number of parameters needed was around 25.5k. For a rather speedy training without less than 100 epochs, the multi-output model needed around 28k parameters. A model with only 23k parameters is also possible¹, but training is very tricky and incurs a lot of luck.

Obviously, there seems to be little benefit to gain between the use cases: Knowing that there is a winner may be beneficial, if you need to say which player is the winner. The same also applies vice-versa. Also knowing which player is the winner may provide benefits for saying which move the winner has won (for example, if player X has won, then all moves with even places cannot be valid response). At least with this trivial Dense multi-layer approach, the modelling could not make use of this yet.

Footnotes

The model with 64/64/256 configuration can be downloaded below. ↩︎

tic-tac-toe-Multi-Model-64-64-256.zip (199.3 KiB, 326 hits)

Nico's Blog

Hints that matter

Tic-Tac-Toe and AI: Wrapping Up the Four Models – Multi-Output (Part 5)

Footnotes

One Comment

Leave a Reply