As we have seen in the blog posts before, determining whether a board has a winner, determining which player the winner is, and with which move the winner won the game can all be done using neural networks. These networks only require Dense layers. However, each case has an inherent complexity, so a “minimal” number of units in these layers (and the number of layers) are necessary:

Use Case | Layers | Dense Layer Configuration | Count of Parameters |
---|---|---|---|

Winning | 3 | 64/64/128 | 13.3k |

Winner | 1 | 40 | 460 |

Move | 2 | 80/128 | 11.7k |

This suggests a kind of “complexity ranking” for the three challenges: The task with the highest complexity is the “winning problem”, followed by the “move problem”. Finally, the “winner problem” is the easiest to solve, because it may only be answered accurately with a single small Dense layer.

Using the principle of multi-output, you may ask, whether it is possible to retrieve all three pieces of information within a single model. Well, let us try this:

You may download these commands again from this Jupyter Notebook:

**tic-tac-toe-Multi-notebook.zip** (6.4 KiB, 148 hits)

Having done the usual imports and initialization, let’s define our dataset like this:

```
allDataFrame = pd.DataFrame(list(zip([x.vector[0] for x in validtttRecordsList],
[x.vector[1] for x in validtttRecordsList],
[x.vector[2] for x in validtttRecordsList],
[x.vector[3] for x in validtttRecordsList],
[x.vector[4] for x in validtttRecordsList],
[x.vector[5] for x in validtttRecordsList],
[x.vector[6] for x in validtttRecordsList],
[x.vector[7] for x in validtttRecordsList],
[x.vector[8] for x in validtttRecordsList],
[x.winning for x in validtttRecordsList],
[x.winner for x in validtttRecordsList],
[1 if x.move == 3 else 0 for x in validtttRecordsList],
[1 if x.move == 4 else 0 for x in validtttRecordsList],
[1 if x.move == 5 else 0 for x in validtttRecordsList],
[1 if x.move == 9 else 0 for x in validtttRecordsList]
)),
columns =['pos1', 'pos2', 'pos3','pos4','pos5','pos6','pos7','pos8','pos9', 'winning', 'winner', 'move3', 'move4', 'move5', 'move9'])
print(allDataFrame.tail())
```

We do again the usual 80/20 split between training and test data:

```
train_dataset = allDataFrame.sample(frac=0.8, random_state=42)
test_dataset = allDataFrame.drop(train_dataset.index)
print(allDataFrame.shape, train_dataset.shape, test_dataset.shape)
train_dataset.describe().transpose()
```

Similar to what we did before, we now split up the three different labels:

```
train_features = train_dataset.copy()
test_features = test_dataset.copy()
winning_train_labels = train_features.pop('winning')
winning_test_labels = test_features.pop('winning')
winner_train_labels = train_features.pop('winner')
winner_test_labels = test_features.pop('winner')
moveColumns = ['move3', 'move4', 'move5', 'move9']
move_train_labels = train_features[moveColumns].copy()
train_features = train_features.drop(moveColumns, axis=1)
move_test_labels = test_features[moveColumns].copy()
test_features = test_features.drop(moveColumns, axis=1)
print(train_features)
print(winning_train_labels)
print(winner_train_labels)
print(move_train_labels)
```

This will give us data samples like this:

```
pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9
329603 9 2 4 7 6 3 8 5 1
56013 2 5 1 8 7 9 4 6 3
296034 8 3 7 1 9 6 2 4 5
224081 6 5 4 2 3 8 9 7 1
245695 7 1 8 3 5 4 2 9 6
... ... ... ... ... ... ... ... ... ...
177025 5 4 1 9 3 2 6 8 7
204707 6 1 7 3 9 4 8 5 2
312923 8 7 1 5 6 3 9 4 2
47930 2 3 6 7 5 1 8 4 9
323085 9 1 2 7 4 8 5 6 3
[290304 rows x 9 columns]
329603 0
56013 1
296034 1
224081 1
245695 1
..
177025 0
204707 1
312923 1
47930 1
323085 0
Name: winning, Length: 290304, dtype: int64
329603 0
56013 0
296034 0
224081 1
245695 0
..
177025 0
204707 0
312923 1
47930 0
323085 0
Name: winner, Length: 290304, dtype: int64
move3 move4 move5 move9
329603 0 0 0 1
56013 0 1 0 0
296034 1 0 0 0
224081 1 0 0 0
245695 0 1 0 0
... ... ... ... ...
177025 0 0 0 1
204707 0 1 0 0
312923 1 0 0 0
47930 0 1 0 0
323085 0 0 0 1
[290304 rows x 4 columns]
```

Getting the usual normalizer:

```
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))
print(normalizer.mean.numpy())
```

We may define our multi-output model like this:

```
inputs = keras.Input(shape=(9))
dense1 = keras.layers.Dense(64, activation='relu')
dense2 = keras.layers.Dense(80, activation='relu')
dense3 = keras.layers.Dense(256, activation='relu')
denseWinningOutput = keras.layers.Dense(1, name="winning_output")
denseWinnerOutput = keras.layers.Dense(1, name="winner_output")
denseMoveOutput = keras.layers.Dense(4, name="move_output")
x = dense1(normalizer(inputs))
x = dense2(x)
x = dense3(x)
outputWinning = denseWinningOutput(x)
outputWinner = denseWinnerOutput(x)
outputMove = denseMoveOutput(x)
model = keras.Model(inputs=inputs, outputs=[outputWinning, outputWinner, outputMove], name="tictactoe_model")
print(model.summary())
```

Note that

- a Sequential model is no longer sufficient.
- We have three different “final layers” for representing our results.
- For accessing the three different outputs, we also have to give them names:
`winning_output`

,`winner_output`

and`move_output`

. - Naturally, we would initially define the hidden Dense layers along the requirements of the “most complex problem”, i. e. the winning use case. We therefore would take a 64/64/128 layering. However, tests have shown that this is not sufficient. Already a 64/64/256 layering hardly achieves to really hit a 1.0/1.0/1.0 accuracy after 100 epochs. That is why we define a 64/80/256 layering.

This model then will have around 28k parameters to solve:

```
Model: "tictactoe_model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 9)] 0 []
normalization (Normalization) (None, 9) 19 ['input_5[0][0]']
dense_12 (Dense) (None, 64) 640 ['normalization[4][0]']
dense_13 (Dense) (None, 80) 5200 ['dense_12[0][0]']
dense_14 (Dense) (None, 256) 20736 ['dense_13[0][0]']
winning_output (Dense) (None, 1) 257 ['dense_14[0][0]']
winner_output (Dense) (None, 1) 257 ['dense_14[0][0]']
move_output (Dense) (None, 4) 1028 ['dense_14[0][0]']
==================================================================================================
Total params: 28,137
Trainable params: 28,118
Non-trainable params: 19
__________________________________________________________________________________________________
None
```

For compiling (and later training) the model, we need to have loss functions. As each output may have a different loss value – and the output types are different (binary vs. categorical), we also will need separate loss functions for each output.

```
losses = {
"winning_output": keras.losses.BinaryCrossentropy(from_logits=True),
"winner_output": keras.losses.BinaryCrossentropy(from_logits=True),
"move_output": keras.losses.CategoricalCrossentropy(from_logits=True)
}
model.compile(loss=losses,
optimizer=keras.optimizers.Adam(learning_rate=0.015),
metrics = ["accuracy"])
```

Moreover, when training, we want to make sure that we don’t stop if the first output has reached an accuracy of 1.0 (rounded to 5 digits), but only if all of them have reached that limit. That is why we need an adjusted Accuracy Stopping callback:

```
class Accuracy1Stopping(keras.callbacks.Callback):
def __init():
super.__init__()
def on_epoch_end(self, epoch, logs=None):
if round(logs.get('winning_output_accuracy'), 5) == 1.0 and round(logs.get('winner_output_accuracy'), 5) == 1.0 and round(logs.get('move_output_accuracy'), 5) == 1.0:
self.model.stop_training = True
```

We then can initiate training:

```
history = model.fit(train_features, y = {
"winning_output": winning_train_labels,
"winner_output": winner_train_labels,
"move_output": move_train_labels
},
batch_size=512,
epochs=100,
shuffle=True,
callbacks=[
Accuracy1Stopping(),
tf.keras.callbacks.ReduceLROnPlateau(monitor='winning_output_loss', factor=0.7, patience=6, min_lr=0.0015),
tensorboard_callback
],
verbose=1)
```

It needs to be admitted that this training is much more fragile – and is significantly more sensitive to the initial random values of the model. Therefore, it may take a couple of times until you really get a “good model” trained properly, which ends up like this:

```
Epoch 1/100
5/567 [..............................] - ETA: 8s - loss: 2.2032 - winning_output_loss: 0.6967 - winner_output_loss: 0.4011 - move_output_loss: 1.1054 - winning_output_accuracy: 0.5570 - winner_output_accuracy: 0.9066 - move_output_accuracy: 0.5039 WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0044s vs `on_train_batch_end` time: 0.0296s). Check your callbacks.
567/567 [==============================] - 10s 14ms/step - loss: 1.5532 - winning_output_loss: 0.5537 - winner_output_loss: 0.2232 - move_output_loss: 0.7764 - winning_output_accuracy: 0.6820 - winner_output_accuracy: 0.9250 - move_output_accuracy: 0.6664 - lr: 0.0150
Epoch 2/100
567/567 [==============================] - 8s 14ms/step - loss: 1.0364 - winning_output_loss: 0.3951 - winner_output_loss: 0.1086 - move_output_loss: 0.5327 - winning_output_accuracy: 0.7946 - winner_output_accuracy: 0.9563 - move_output_accuracy: 0.7795 - lr: 0.0150
Epoch 3/100
567/567 [==============================] - 8s 14ms/step - loss: 0.6892 - winning_output_loss: 0.2779 - winner_output_loss: 0.0443 - move_output_loss: 0.3670 - winning_output_accuracy: 0.8664 - winner_output_accuracy: 0.9833 - move_output_accuracy: 0.8516 - lr: 0.0150
Epoch 4/100
567/567 [==============================] - 8s 14ms/step - loss: 0.4397 - winning_output_loss: 0.1836 - winner_output_loss: 0.0195 - move_output_loss: 0.2366 - winning_output_accuracy: 0.9187 - winner_output_accuracy: 0.9940 - move_output_accuracy: 0.9080 - lr: 0.0150
Epoch 5/100
567/567 [==============================] - 8s 14ms/step - loss: 0.2816 - winning_output_loss: 0.1173 - winner_output_loss: 0.0149 - move_output_loss: 0.1494 - winning_output_accuracy: 0.9516 - winner_output_accuracy: 0.9960 - move_output_accuracy: 0.9455 - lr: 0.0150
Epoch 6/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1702 - winning_output_loss: 0.0711 - winner_output_loss: 0.0134 - move_output_loss: 0.0857 - winning_output_accuracy: 0.9745 - winner_output_accuracy: 0.9967 - move_output_accuracy: 0.9731 - lr: 0.0150
Epoch 7/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1674 - winning_output_loss: 0.0681 - winner_output_loss: 0.0129 - move_output_loss: 0.0864 - winning_output_accuracy: 0.9788 - winner_output_accuracy: 0.9967 - move_output_accuracy: 0.9767 - lr: 0.0150
Epoch 8/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0143 - winning_output_loss: 0.0078 - winner_output_loss: 6.8342e-05 - move_output_loss: 0.0064 - winning_output_accuracy: 0.9984 - winner_output_accuracy: 1.0000 - move_output_accuracy: 0.9990 - lr: 0.0150
Epoch 9/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1652 - winning_output_loss: 0.0615 - winner_output_loss: 0.0167 - move_output_loss: 0.0870 - winning_output_accuracy: 0.9831 - winner_output_accuracy: 0.9965 - move_output_accuracy: 0.9794 - lr: 0.0150
Epoch 10/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0089 - winning_output_loss: 0.0046 - winner_output_loss: 1.6532e-04 - move_output_loss: 0.0041 - winning_output_accuracy: 0.9991 - winner_output_accuracy: 0.9999 - move_output_accuracy: 0.9994 - lr: 0.0150
Epoch 11/100
567/567 [==============================] - 8s 14ms/step - loss: 0.1202 - winning_output_loss: 0.0445 - winner_output_loss: 0.0128 - move_output_loss: 0.0629 - winning_output_accuracy: 0.9881 - winner_output_accuracy: 0.9972 - move_output_accuracy: 0.9850 - lr: 0.0150
Epoch 12/100
567/567 [==============================] - 8s 14ms/step - loss: 0.0054 - winning_output_loss: 0.0027 - winner_output_loss: 1.0714e-04 - move_output_loss: 0.0025 - winning_output_accuracy: 0.9995 - winner_output_accuracy: 1.0000 - move_output_accuracy: 0.9996 - lr: 0.0150
Epoch 13/100
567/567 [==============================] - 8s 14ms/step - loss: 8.3936e-04 - winning_output_loss: 4.9668e-04 - winner_output_loss: 5.7897e-06 - move_output_loss: 3.3689e-04 - winning_output_accuracy: 1.0000 - winner_output_accuracy: 1.0000 - move_output_accuracy: 1.0000 - lr: 0.0150
```

Let’s evaluate again the model with our test data:

```
print(test_features)
ytest={
"winning_output": winning_test_labels,
"winner_output": winner_test_labels,
"move_output": move_test_labels
}
print(ytest)
evaluationResult = model.evaluate(test_features, y = ytest, batch_size=256, verbose=1)
print(evaluationResult)
```

This yields a rocking 1.0/1.0/1.0 accuracy:

```
284/284 [==============================] - 2s 8ms/step - loss: 6.3467e-04 - winning_output_loss: 3.7916e-04 - winner_output_loss: 4.0316e-06 - move_output_loss: 2.5148e-04 - winning_output_accuracy: 1.0000 - winner_output_accuracy: 1.0000 - move_output_accuracy: 1.0000
[0.0006346721202135086, 0.00037915774737484753, 4.0315990190720186e-06, 0.00025148282293230295, 1.0, 1.0, 1.0]
```

As usual, you may download this model again from here.

**tic-tac-toe-Multi-Model.zip** (243.6 KiB, 125 hits)

There are a couple of things, which we may conclude from this:

- Using multi-output models for the Tic-Tac-Toe case requires additional units in the hidden layer than compared to the most complex model of the three use cases.
- Training a multi-output model is much more complex than training three models separately.
- The total number of parameters needed was around 25.5k. For a rather speedy training without less than 100 epochs, the multi-output model needed around 28k parameters. A model with only 23k parameters is also possible
^{1}, but training is very tricky and incurs a lot of luck.

Obviously, there seems to be little benefit to gain between the use cases: Knowing that there is a winner may be beneficial, if you need to say which player is the winner. The same also applies vice-versa. Also knowing which player is the winner may provide benefits for saying which move the winner has won (for example, if player X has won, then all moves with even places cannot be valid response). At least with this trivial Dense multi-layer approach, the modelling could not make use of this yet.

## Footnotes

- The model with 64/64/256 configuration can be downloaded below. ↩︎

**tic-tac-toe-Multi-Model-64-64-256.zip** (199.3 KiB, 129 hits)

Pingback: Tic-Tac-Toe and AI: Stacked Multi-Output Model | Nico's Blog