Help with setting up structure

Hi,

I have a game board (a grid of squares, with some value set for each square) as input to the observations. That is, the entire board is the observation.

I am looking for best practices on how to pass that into the learning mechanism.

The NextMove in this case is simply one of the board squares (think tic-tac-toe). I have a couple concrete questions:

  1. There is no IntegerObservation. There is a CountObservation though. If I am predicting the x and y coordinates for the next square, does it make sense to use CountObservation for each of the coordinates? Float doesn’t seem to make sense.

The board the looks like this:

Similar for something like “My Player Number”:

  1. The Agent Action is picking a square to move. It currently looks like this:

And in PerformAgentAction, I just use the returned index as the x, y coordinates of the square.

Does that make sense? Is there a better way? (I realize I can write custom code for all of this, but I am first trying to make this work with the built-in pieces).

  1. Should I somehow filter out moves that are not valid (already taken). I am not sure how I would go about doing that with the above setup. GetExclusiveDiscreteAction just gives me a number between 0 and 9 with no way to filter them.

Or do I just end the training episode early if the agent picks an invalid square? It seems that since the allowed squares will always be known, there is no point in forcing the agent to learn that bit on its own.

Thank you kindly
Dan