Learning to Play Tic Tac Toe via Reinforcement Learning

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Ethan Cao, Jocelyn Ho, Sameeh Maayah, Maria Rufova, Adithya Devraj


Machine learning algorithms are implemented in a wide variety of modern applications to train computers to make efficient predictions and decisions based on their observations of past data. Reinforcement learning is a paradigm of machine learning that is concerned with making the most beneficial decision in a particular environment in order to maximize the reward. I The Q-learning algorithm is a reinforcement learning algorithm that solves decision-making problems by assigning different scores to different decisions at each state of the problem. When applied to learn to play a game such as Tic-Tac-Toe, at each board position, the goal of the algorithm is to assign a numerical score to each move, where the score is proportional to the chance of winning. This trains the machine to perform the most beneficial moves by simultaneously exploiting already learned beneficial information and exploring its possibilities by constantly trying out new game moves.

Using numerically mapping algorithms of Q-Learning to predict a game’s outcome through iterative training, we have trained an artificial Tic-Tac-Toe opponent to refer to Q values for all possible player decisions in a current environment and chose the one with the highest value. With the example of Tic-Tac-Toe, we examine how reinforcement learning allows machines to implement dynamic programming techniques to maximize their performance with minimal human intervention and assistance.


In 2015, AlphaGo became the first computer to beat a human professional player over Go, an ancient chinese game that contains over 10172 possible board positions [1]. In the proceeding years, AlphaGo continued to improve by training against both human and artificial opponents, eventually defeating numerous top-ranked players around the world [1]. With such, AlphaGo presents the possibilities of artificial intelligence (AI) to reach a higher level of expertise in a specific field usually dominated by human players. Through a distinct point of view from AI, it is possible to realize many applications that existed in fictions, such as autonomous robot agents, trading systems, business intelligence, and chat bots [4].

In order to have a similar automated approach to solve real-world problems, we decided to look into board games. In contrast to video games, board games present more adequate environments to train AI opponents because board games have exact rules and possibilities for its outcomes [4]. With AI knowing all potential actions, it is able to calculate exact outcomes based on the states of the board. This allows AI to obtain the goal of interest by figuring out how to maximize the utility. Out of the millions of today’s board games, we looked into the well-known and easily comprehensible game of Tic-Tac-Toe. Although in comparison with Go Tic-Tac-Toe has only less than 360,000 possible board positions, it is still complicated enough to be solved with deep neural networks [4].

In general, there are at least two ways for a machine to reach a certain outcome from a series of actions: decision tree and reinforcement learning [5]. Decision tree contains branches that represent specific conditions, where the outcomes are located at the ends of the branches [5]. On each node of the decision tree, an action is made to determine which condition is reached. However, the decision tree requires storing all patterns of a game, which takes up an enormous amount of memory. Therefore, some use an extended version of the decision tree called Min-Max algorithm to estimate how good the action made is through backtracking. In this algorithm, the opponents will try to obtain the best move (which is the worst move for the other player) and a value will be assigned to the outcome [5]. Depending on the number of moves the player made, the depth of the tree will be set. In order to optimize AI’s performance, an evaluation function and the tree’s depth will be used to estimate the final value of the match. Nevertheless, the Min-Max algorithm requires massive state space and there are no evaluation functions that are efficient enough. Hence, we attempted to use reinforcement learning—which automatically finds the balance between exploration of unknown pathways and exploitation of current knowledge—to train the machine how to play Tic-Tac-Toe [3].


In reinforcement learning, the machine will aim to optimize the rewards through interactions with the environment and updating itself with better policy based on its experiences. Taking Q-learning as an example. For each action the agent of the function has made, it comes with a reward (based on the outcome) and a value that is calculated based on the current state and the optimal action that the agent has previously made [7]. Using these variables, the machine calculates the new value and repeats this process until the game terminates. Over many simulations, the machine will have experienced a range of patterns of actions and states and be able to estimate the probability of winning the game [7]. Through reinforcement learning, the machine will start off playing the game terribly, but through iterations, it improves gradually and ultimately reaches a high winning possibility.

Methods and Materials

The goal of reinforcement learning is to discover how to maximize a numerical reward of a task without any prior information of actions’ potential value. Instead the machine must individually explore its environment and exploit learned material to understand which actions will yield the greatest cumulative reward at the end of the game. In case of Tic-Tac-Toe, the learning agent conducts the trial-and-error learning through repetitive plays of the standard Tic-Tac-Toe game on a 3X3 field, where a win is obtained by consecutively placing either three X’s or three O’s horizontally, vertically, or diagonally. To achieve the delayed reward of such winning combination, the agent must evaluate which actions most beneficially affect not only the immediate reward, but also all subsequent future rewards.


In any reinforcement learning environment, the learning agent’s behavior at a given time is determined by its policy—a learned strategy that dictates the agent’s actions as a function of its current state and environment. Additionally, reward signals—numerical payoff received after each action that agent strives to maximize—acts as the primary basis for altering policy. Actions with high rewards influence the policy to prioritize them in the future, thus altering the agent’s strategy. The ultimate reward of Tic-Tac-Toe is a winning combo of three consecutive characters, which awards the agent +1. The agent’s loss would be punished by -1, thus influencing its policy to avoid the actions that led to the low reward.

The agent-environment interaction considering the state, action, and reward possibilities of a problem is best framed in a finite Markov Decision Process. The Markov Decision Process represents a stochastic sequence of agent’s possible actions in which the outcomes are partially random and partially under the control of the machine. A variety of actions gives the motivation to seek different rewards. Markov Decision Processes thus operates through a value function where at a time step  t , the agent may choose action  a available in that state  s , thus moving the system into a new state  s' and providing the corresponding reward  R . Next state depends only on the current state and the action taken, and the probability of moving into a new state  s' is given by the transition function  p_{a}(s,s') . The value function is given as a solution to the Bellman equation which expresses the relationship between the value of the current state and the values of its successor states. Beginning with initial state  s , the agent can take a variety of actions  a according to its policy  \pi , to which the environment can respond with one of potential next states  s' along with a reward,  r , depending on the dynamics given by probability function  p . The Bellman equation thus averages over all possible next states, weighing each by its probability of occurring. It ultimately tells the value of the immediate reward produced by an action a in some state s, and the maximum expected reward you can get in the next state.

 Q^{\text {new }}\left(s_{t}, a_{t}\right) \leftarrow \underbrace{Q\left(s_{t}, a_{t}\right)}_{\text {old value }}+\underbrace{\alpha}_{\text {learning rate }} \cdot \overbrace{(\underbrace{\underbrace{r_{t}}_{\text {reward }}+\underbrace{\gamma}_{\text {discount factor }} \cdot \underbrace{\max _{a} Q\left(s_{t+1}, a\right)}_{\text {estimate of optimal future ralue }}}_{\text{new value (temporal difference target)}}-\underbrace{Q\left(s_{t}, a_{t}\right)}_{\text {old value }})}^{\text {temporal difference }}

Our Tic-Tac-Toe project uses the Bellman equation as a part of the Q-Learning method, in which the machine iteratively approximates the “Q” values—the expected reward for an action in a given state. Q-learning assigns each state-action pair in a game of Tic-Tac-Toe a particular reward, with higher Q values indicating the most desirable actions. An equation iteratively updates the Q values depending on the current state of the board, potential actions, and future states. In the case of Tic-Tac-Toe, board positions are states and game moves are actions. When the end of a match is reached during training, the result of the game is the move that led to that result. The machine then works back recursively through the history of the game and updates the Q-values for each action taken during the game. In order to make our learning agent familiar with all possible Tic-Tac-Toe moves instead of just reinforcing already high Q values, we use the epsilon-greedy strategy which either selects a random move with probability  \epsilon , or uses a move from Q-table with probability  1- \epsilon . This ensures a balanced exploration – exploitation learning strategy for our agent.


The Q-learning algorithm was created in Python. By converting each board state to a string we created a unique hash for that board state. The hash and its reward were stored as a key and value pair in a dictionary data structure. During each episode, all moves made are recorded and the reward received by the algorithm will be distributed to each board state hash via the Q-learning equation previously described.

Using the epsilon greedy method, our algorithm determines whether to follow the path of greatest reward by looking at the reward associated with future game states or explore through a random move. The computer will pick one of the two moves by generating a random number between 0 and 1. If this number is smaller than a predetermined epsilon value the algorithm will play a random move however if the generated number was larger than epsilon the algorithm will take the q-learning approach.

After a lot of testing we decided to first set epsilon to 1 (this ensures that the algorithm will take a random action) for the first 1000 episodes. Then epsilon would be divided by 2 for each 1000 episodes played, enabling the algorithm to discover a large number of the possible states and then play strategically based on the q-value of the board. After each move taken by the agent the q-table (a table that holds all the state action pairs of the board and its corresponding q-value) would be updated using the bellman equation.


To yield great results and make the game as challenging for the agent as possible, the opponent of the agent works with the same q-learning algorithm but with the q-table of the main agent from 100 episodes earlier. The q-table of the opponent will keep updating every 100 episodes played.

Figure 1 demonstrates the results after the agent has played for 300,000 games. The agent was able to win about 75% of the games played, 15% of the games were a tie and 10% were a loss.


Our research has demonstrated that with sufficient amount of training, a learning agent is capable of utilizing reinforcement learning to master playing simple games such as Tic-Tac-Toe with a reasonably high winning outcome. Our agent has utilized the Bellman equation and the Q-learning algorithm to track and re-trace its Tic-Tac-Toe moves to improve its playing strategy. By the conclusion of our experiment, the agent is capable of winning on average 85% of its games after exactly 300,000 episodes of training. We believe that with further improvement to the Q-learning algorithm and a longer training period we will be able to increase the winning percentage even more and completely eradicate losing scenarios.

Future Directions

Future work on the project would include finding ways to optimize the machine’s learning strategies in order to guarantee mostly a winning outcome for the agent, with perhaps some occasional ties. We are looking forward to finding a way to expedite and maximize learning without relying on prolonged episodes of training, as an increased amount of training matches takes a significantly longer time for machines to complete (which is another separate issue we can improve upon).

We hope to conduct further research on the adjustments of learning rate and decay exploration rates and their effects on machine’s abilities in not just Tic-Tac-Toe, but similar programs that can be improved with reinforcement learning. While the specifications of states, actions, and rewards change from game to game, the essence of Q-learning algorithm remains the same, and thus our research can be applied to a wide variety of programs. Our work with reinforcement learning can thus be extended to not only various games, but also applied in many important fields such as finance, business, medicine, industrial robotics and other areas where credible unsupervised learning is essential to quick operations and exemplary results.


  1. AlphaGo: The story so far. (2021). Deepmind. Retrieved August 4, 2021, from https://deepmind.com/research/case-studies/alphago-the-story-so-far
  2. Babes, M., Littman, M., & Wunder, M. (n.d.). Classes of Multiagent Q-learning Dynamics with ∊-greedy Exploitation. Rutgers University, Department of Computer Science. Retrieved August 4, 2021 from https://icml.cc/Conferences/2010/papers/191.pdf
  3. Friedrich, C. (2018, July 20). TABULAR Q learning, a Tic Tac Toe player that gets better and better. Medium. Retrieved August 4, 2021, from https://medium.com/@carsten.friedrich/part-3-tabular-q-learning-a-tic-tac-toe-player-that- gets-better-and-better-fa4da4b0892a
  4. Ritthaler, M. (2018, January 18). Using Q-Learning and Deep Learning to Solve Tic-Tac-Toe. YouTube. Clearwater Analytics. Retrieved August 4, 2021, from https://www.youtube.com/watch?v=4C133ilFm3Q
  5. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press Ltd. Retrieved August 4, 2021, from http://incompleteideas.net/book/RLbook2020.pdf
  6. Torres, J. (2020, June 11). The Bellman Equation. Medium. Towards Data Science. Retrieved August 4, 2021, from ttps://towardsdatascience.com/the-bellman-equation-59258a0d3fa7
  7. Watkins, J. (1992). Q-Learning. Kluwer Academic Publishers. Retrieved August 2, 2021 from https://link.springer.com/content/pdf/10.1007/BF00992698.pdf



Leave a Reply