Karan Bhasin and Aditi Talati
Abstract:While there is a breadth of research about the human-decision making process in game theoretical situations, much of that research focuses on the situations in which time pres-sure or lack of information does not limit the ability of humans to process the information which they are given. The lack of proper time to understand the available choices must be understood in order to accurately predict the behavior of humans under rapid decision-making situations, as an autonomous car must when processing the behavior of nearby human drivers. There are many proposed mathematical models of these decision-making processes, where humans are perceived to follow certain reward functions with various probabilities. However, veriﬁcations of these models are inadequate, and it remains un-clear which model works the best.
In order to test these models, we have created multiple logical games where we col-lected data on the decisions players made and compared them to mathematical models of such decisions. We found that the prospect theory model was the most accurate for time constrained games, as previous literature predicted. However, since we could not compute reward values for the information constrained game, we could not compare it to these mathematical models. In the future, we would like to develop a way to analyze the expected rewards for this game so that we can compare it to such models.
We have created multiple games which ask the player to make a computationally heavy decision under various constraints. Two of the games, blackjack and the counting game, asked the player to make a single decision under varying time constraints, while the grid game asked the player to travel from one corner of a grid to another. This was a long-horizon game, where each step the player took impacted the possible future steps. This was also done under various time constraints. The hover game was an information-constrained game, where knowing more information about the game state would also cost points to the player.
The data we collected from each game was compared to various function transfor-mations for likelihood functions, using a Metropolis-Hastings algorithm to calculate the parameter set that maximizes the likelihood for these functions. These values were then re-entered into the function with a validation data set to ﬁnd the likelihood of the data and relative accuracy of the model. We found that, in our games, the prospect theory model produced the most accurate results, which was expected, because it is the most ﬂexible model of human decision-making.
We implemented multiple computation heavy games to gather data about user decisions based on time. In these games, the horizon is short and there are only a few actions pos-sible, but deciding the optimal action requires some considerable amount of computation. The ﬁrst game is called the counting game (Figure 1), and displays 5 square grids each of which is 20 by 20 in size. Anywhere from 3-18 squares are changed to blue on each grid. The user must pick the grid with the maximum reward, where any grid with a multiple of 3 blue squares is a gain of the number of squares and all other grids are a loss of the number of squares. For example, if the displays have 5, 6, 10, 13, and 18 squares, the rewards would be -5, 6, -10, -13, and 18. The time constraint is how much time the user has to make their decision before the grids change to new displays with a new number of blue squares. We collected 100 data samples for each of 5 time constraints: 3, 5, 10, 15, and 20 seconds.
The second game is a variation of Blackjack (Figure 2), where the user has a deck of cards laid out in front of them and knows the ﬁnal value of the opponent’s cards. The player has already been dealt two random cards and must decide whether to draw a third, knowing that if their ﬁnal score (the sum of their card values) is greater than 21, they receive a negative reward. If it is greater than the opponent’s and less than or equal to 21, they collect a positive reward, and a smaller positive reward if their score is equal to the opponent’s. The player received a reward of 0 if they scored lower than their opponent; here, they had the option to play further, though any further moves were not considered in our data since the player had already seen the deck. For this game, we collected data for 2, 5, 10, and 30 seconds. We collected data for over 150 runs of the game for each time constraint, where the data consisted of the expected reward of each decision the player could make and the reward of the decision that the player chose.
We designed another game, the grid game (Figure 3), with a grid of squares, where the user moves in a conﬁned space from square to square. Each square on the grid has a value between -1 and 1, exclusive. The color of each square was decided using an RGB function and the value of the square, where positive values intensify a green color and negative values intensify a red color. The user must then get as many points as possible by moving along an acyclic path with the highest total reward. It is impossible for the user to visit a square that they have already visited. The user begins on the bottom left square of a 10 by 10 grid, which has a reward of 0, and must reach the top right square, which has a reward of 100. This game was a long-horizon game, and we collected data for 3, 5, 10, 15 and 25 seconds.
To investigate the effects of an information constraint on decision-making, we designed a hover game with 3 horizontal panels, each of which contains a block which is constantly moving a random step in either horizontal direction. The game begins with all of the blocks hidden. The user may hover over any panel and view the current location of the block in that panel, but at the cost of 1 point. The user may click to move a block back to the center of the panel, at a cost of 10 points. If the block hits either the right or left edge of the panel, the user loses 100 points. The game ends after a certain number of steps. This game demonstrates how information constraints can change the way in which a player makes decisions.
2.1 Value Iteration and Reinforcement Learning
For the one-horizon games, blackjack and the counting game, expected rewards were directly equal to the values we used in Metropolis-Hastings. For the long-horizon grid game, we generated a new grid of values in place of the grid of raw rewards for each experiment. We used this to establish that one move would aﬀect all future rewards. In other words, the user is incentivized and encouraged to think ahead and plan their choices, rather than consider each move as a separate and unconnected decision. We then used these values with Metropolis-Hastings. Before using this data, we subtracted the mean of the values in each experiment from each value in the experiment for numerical stability.
The optimal path was calculated using value iteration . Each tile was given a value which was originally set as its reward. This value was the value of the corresponding square in the grid game. Then, for each state , the best state to go to (out of the immediate possible steps ), was calculated using
where is a function that retrieves the original reward at state , is gamma, the discount factor, which we set to 0.999, and is a function which retrieves the current value at state . The process is repeated for the entire grid until it converges, or the change in the tile values is lower than a certain threshold value, which we set to 0.001. At this point, the program may draw the optimal path, which is obtained by traveling to the tiles with the greatest values.
We also collected data for another set of experiments, where we allowed the user to play the game with a stopwatch running, rather than with a time constraint. The objective was not to reach the goal in a certain amount of time, but to reach the goal by taking the optimal path. From this, we aimed to determine the average time needed for the user to discern the optimal path.
The Metropolis-Hastings algorithm  samples variables from an input distribution. The algorithm takes in a starting guess for those variables and computes another value for each variable that is a random noise away, by using a constant step size modulated by a random variable with a Gaussian distribution . If the probability of the new value under the given distribution is greater than the original, the new value is added to a set of accepted values. If it is less, it is added with the probability . If the new value is accepted, it becomes the new current value. Otherwise, the current value is added to the set and remains the current value. The algorithm repeats for a certain number of samples, and then cleans the set. For cleaning the set for the counting game and the grid game, we used an array of 100,000 samples, the ﬁrst 10,000 of which were discarded. We then took the ﬁrst of every 9 samples to get 1000 samples with lower autocorrelation. To clean the set for the game of blackjack, we took the last 25000 of our (between 50,000 to 100,000) samples, and used the ﬁrst of every 50 samples to get 500 clean samples. The algorithm then averages all the values in the clean set to ﬁnd the expected values for each of our variables.
In an alternative method, the algorithm traversed the set in the same way as before, but stored only the variable set which produced the maximum log-likelihood output for the function, out of the values that it considered while traversing the possibilities for each variable.
The ﬁrst model we tested is the noisy rational model. The log-likelihood function (which takes each reward R as an input and outputs the natural log of the probability that the player made a certain decision) for the noisy rational model can be expressed as
where is the number of choices, is the reward for each choice (the points gained or lost), and is the temperature coefficient .
Intuitively, when setting beta to 1, this function gives the probability of choosing a given option while making all terms positive, increasing the value of large positive rewards, and decreasing the absolute value of large negative rewards. Then, the beta value adds noise to the function, such that a small beta value makes the R terms more irrelevant and brings the probability of each option closer to a random one, while a large beta value makes the R terms more relevant and increases the probability of the best options. For the blackjack game, the reward values for each trial were set to a reference point of the worst option. This means that, in each trial, the value of the worst option was subtracted from each of the reward values.
The log-likelihood function for the threshold model  can be expressed as
where is the threshold, so that each option with a reward greater than is considered equivalent.
This takes the same function as before, but adds the condition of a threshold, where options above a certain value are considered “good enough” and are seen as equivalent. A real-life example of this model would be a situation where a driver is faced with the possibility of an accident, and has little time to decide how to act. They may drive off the road rather than change lanes, because with such little time, this action is “good enough.” As the threshold value tau decreases, more and more (worse) options are considered good enough, and as tau increases, the player’s discretion becomes finer. We considered two versions of the threshold model: one where we set beta to a given constant and varied only tau, and one where we varied both tau and beta. For the blackjack and counting games, the reward values for each trial were set to a reference point of the worst option. This means that, in each trial, the value of the worst option was subtracted from each of the reward values.
The log-likelihood function for the prospect theory model can be expressed as
where the function is deﬁned as
where is the reward, , and are all variables being sampled, and is positive .
These conditions were set by having the function output a log-likelihood of negative inﬁnity whenever alpha, gamma, or lambda were outside their given range. Since the function takes some root of all positive rewards, the positive rewards seem larger than the actual value when the reward is small, and smaller than the actual value when the reward is large. Thus, small rewards seem important, but distinguishing between two larger rewards becomes diﬃcult. We can see a similar root function for the negative rewards, but there is an added coeﬃcient lambda which increases the absolute value of the negative reward. Thus, between a loss and gain of equal value, the loss appears larger than the gain. This can be understood with a real-life example, where a person is to choose between a 90% chance of a gain of $1000 and a 100% chance of a gain of $900. Prospect theory details that humans are risk-averse, and would choose the sure gain. However, if the gains were instead losses, humans are risk-seeking, and would take the gamble, hoping to avoid the loss. Here, we considered x to be relative to a given reference point. This reference point was the average of the two options for the blackjack game. For the counting game, the model was tested with a reference point of 0, a reference point which was the average of all possible values in the set, and a reference point which was the average of the values for each individual experiment. We found that the reference point of 0 gave the most accurate results.
2.3 Cross Validation
For each of the models, we used a -fold cross-validation method on each of the time constraints, where we randomized the data and divided it into a certain number of groups, and rotated through the process of using each group as a validation set and the others as the training set . We calculated the values of the unknown variables of the function on the training set, and plugged those variable values into the model with the validation data set, so that we could calculate the log-likelihood of our data set given those variable values. This would give an approximation of the relative accuracy of the models we used. From this, we took the average log-likelihood for the time constraint, and then used the averages to compute the overall log-likelihood for the model. Throughout this process, we used the log-likelihood in place of the likelihood, so that the values would be numerically easier to compute and compare.
2.4 Information Constraint
The game lasts for 300 steps, where each block moves once during every step. The distance that each block moves is equivalent to a uniform random integer between -15 and 15 inclusive. The distance between the right edge of the left wall and left edge of the right wall is 480. Crashes are detected when the center of the piece reaches the inner edge of either wall.
3.1 Optimal Parameters for Each Time Constraint
For the single-parameter mathematical models, there are unique trends when we graph the parameter over the possible time constraints. The noisy rational model multiplies the reward by beta. A beta greater than 1 therefore ampliﬁes the value of the reward, meaning that the player is more likely to choose the better decision, while a beta less than 1 decreases each reward value, making the player’s decision more random. When the player has a shorter amount of time to play the game, beta should be smaller, and when the player has more time to consider the possible decisions, beta should be larger. Similarly, in the threshold model, any reward above some value, tau, is considered good enough. A smaller tau means the “good enough” value is worse, while a larger tau means the reward has to be substantial for the decision to be considered good enough. Again, when the player must make a decision with little time, the tau value should be lower.
For the counting game and blackjack, we graphed the values of beta for each time constraint for the noisy rational function and tau for each time constraint for the threshold function. We used the clean set from Metropolis-Hastings, and took the average value of the variable as well as the standard deviation of the clean set for the graphs.
3.2 -Fold Validation Data
There were 5 subsets for the counting game and blackjack and 6 for the grid game. Each subset was used as the validation set exactly once. For each model, the algorithm was used for all the time constraints, and the maximum log-likelihood was obtained.
3.3 Time Needed to Discern Optimal Path
For this data, the user played the grid game until they had achieved the optimal path 30 times. This graph is a histogram of the time it took to reach the optimal path during each trial.
3.4 Hover Game Data
To compare the performance of the AI agent against the performance of a human user, we let the AI agent play 100 games. We then had two users, one who was experienced with the game and had played it many times before, and one who had never played it before, play the game a number of times. For all the tests, we collected the average score at each step and the standard deviation of the scores achieved at each step.
In the blackjack game, the prospect theory model was the best model. This makes sense because the prospect theory model is known, in theory, to be the most accurate model of human behavior. However, there was a concern that with the data set given, the prospect theory model, with its many variables, may overﬁt to the training set, or adjust to the exact noise of the learning set rather than the true function. This would reduce the likelihood of the veriﬁcation set. Fortunately, this did not occur.
The noisy rational model did produce slightly better results than the combined thresh-old model in terms of validation likelihood, yet because the two values are so similar, they may be considered approximately equal. This conclusion is reinforced with the tau and beta outputs for the combined threshold model. Because the combined threshold model combines the noisy rational model and the threshold model, its likelihood should be equal to or greater than both models. The combined threshold model has an equal likelihood to the noisy rational model only if the combined threshold model maximizes the threshold and outputs the same beta as the noisy rational model, since this is equivalent to the best possible beta with no threshold. This is validated when we consider the data for the combined threshold and noisy rational model.
This data also explains why the threshold model is the least accurate model out of the four considered. The threshold model is not entirely relevant in this situation, possibly because the player is only choosing between two options, so increasing or reducing the positive reward value of the optimal option may not significantly change the player’s decision.
When we consider specifically the parameter versus time graph for the noisy rational function, the beta values for 5, 10, and 30 seconds are within one standard deviation of each other. This implies that although the player makes worse decisions when given two seconds, they have sufficient time to consider each model to the maximum of their ability within 5 seconds, and do not improve between 5 and 30 seconds. Note that the player is still not making perfect decisions at this point, and even though they are given a significant amount of time, it is not nearly enough to calculate mentally the exact expected value of the possible decisions. Such a mental calculation would be extremely difficult.
The parameter versus time graph for the threshold function is shocking, because the threshold for 30 second games is actually less than the threshold for 10 seconds, and this is a difference of over one significant figure. However, the threshold values are close to zero for each of the models, implying by this data that the players make almost random decisions. This may partially be because the beta value was set to 1.2 to calculate the tau values in this model, while the true beta was closer to 0.2 for the time constraints given. The smaller beta would account for the element of randomness in the game. When the Metropolis-Hastings algorithm is run to optimize both tau and beta in the threshold function, the beta values are within a standard deviation of the ones in the noisy rational function, and the tau values are maximized. Thus, the threshold model is not a relevant model in describing the blackjack game.
4.2 Counting Game
According with the results of blackjack, the prospect theory model was the best, and overfitting did not occur. However, a discrepancy was that the combined model did produce slightly greater log-likelihoods than the noisy rational model. This is expected, as the combined model trains both beta and tau and therefore is more likely to find a parameter set that yields a greater log-likelihood. The threshold model was the worst in the counting game as well.
The beta and tau vs time constraint graphs demonstrate an increasing value for each variable with time. Moreover, the standard deviations ensure that the differences between these variable values at each time constraint are significant, as none of the points are within one standard deviation of each other. This makes sense for beta, because as the time constraint increases, the user makes more informed decisions that are less random, and therefore a greater beta value gives more weight to these decisions.
The tau graph is also reasonable, as allocating more time allows the user to make favorable moves more often. With these moves come higher scores, which means that the threshold at which the user’s score is considered “good enough” must increase. To interpret this graph, it is important to remember that we subtracted the minimum possible reward in each experiment from every value in each experiment. Rather than the threshold being these tau values, the threshold is actually tau greater than the minimum for each trial. Thus, since the threshold is negative for 3 seconds, this means that any value is good enough, or that play at 3 seconds is basically random. This makes sense, as 3 seconds is generally not enough time to distinguish the best choice. Moreover, since the threshold is about 55 for 20 seconds, no reward is considered “good enough.” This is because the lowest possible reward in the counting game is -17, which means the true threshold is at least 38, an unattainable score. In other words, 20 seconds is generally enough time to distinguish the best choice.
From this graph it can be determined that while the threshold model is the least accurate of the four models we tested, it is still applicable in this game and accurately demonstrates the numerical difference in performance with varying time constraints.
4.3 Grid Game
The results of the grid game further reinforced that the prospect theory model is the most accurate in describing human decision-making in games. The results also demonstrate that the threshold model is less accurate. Unlike the other two games, however, the combined threshold model was not at least as good as the threshold and noisy rational models. This is due to overfitting, which occurred when the thresholds were maximized while training the combined model.
The data from the histogram is not informative, as too many experiments were dis-carded for the time results to be conclusive. However, if the experiment procedure is changed, this could generate useful results. A possible procedure could be playing the game with a certain time constraint, and increasing the time constraint until optimal scores are consistently achieved. This could more accurately determine the time at which a human user could play optimally.
4.4 Hover Game
In the hover game, we did not yet create a mathematical method of analyzing the actions that the human players took. However, from the data that we have from the AI playing the game, we can conclude that the AI hovers over some object at almost every step, and that, in the hundred trials we held, it never let any ball out of bounds. From this, we can deduce that a relatively optimal strategy would be to hover over one panel at every step, rotating between the panels. If a ball could reach the end of the panel within three steps, then it is clicked to ensure that it returns to the center. This could be further optimized by not hovering over the balls when they are known to be close to the center.
By comparing the performances of the inexperienced and experienced humans, it is clear that playing the game many times prior to data collection allowed for significant improvement. The experienced user achieved an average score of over 240 points greater, establishing a trend that human users learn to play more rationally with time and experience.
The human player consistently scored worse than the AI agent even after experience; this may be for many reasons. One might be that the time constraints of the game prevented the human player to hover over a ball at every step, an opportunity that the AI had. The human player played with the condition that the balls would move one step every 0.3 seconds; they could not click or hover over the panels more than once during that duration. Though a version of the game was modiﬁed to move the balls every second, this was not used in the data-collection process because such a game would be too time-consuming to play over multiple trials. Another explanation of this may be simply because the human had developed a less effective strategy, or that their idea on when to click the ball was less precise than that of the AI.
We found that for time-constrained games, the prospect theory model best describes human rational behavior. Our results were empirically corroborated across hundreds of trials in two one-horizon games and one long-horizon game. We were able to further conclude that the threshold model is the least accurate and applicable of the four models we tested. However, we found significant trends in the beta and tau values from the noisy rational and threshold models, numerically demonstrating the effect of a time constraint. We would like in the future to corroborate our results with even more experiments across other games. We also plan to see how real robots perform with these different models by applying them to games with physical movements.
For the information-constrained game, the expected reward for each move was un-known, so we were not able to ﬁt it to a mathematical model. However, we were able to set up a playable user interface and train an AI agent, and compare the respective scores. Moreover, we gathered data including the current position of the ball in all three panels, the score, and the action at each step. This data could be used to create a mathematical model that best fits these types of games. We would like to create a reward function that would calculate the overall value of each possible move the user could make, so that we could compute the rationality of the user’s behavior. Then, we could also consider the effects of different time constraints on the hover game. The results of the hover game can also be used to demonstrate the suboptimality of humans in other tasks, like a human checking their mirrors while driving.
We thank Inder Bhasin and Vijay Talati for playing our games and helping us collect data. We would also like to acknowledge Cindy Nguyen and Professor Tsachy Weissman for organizing the SHTEM program, and Erdem Biyik and Professor Dorsa Sadigh for mentoring and advising us throughout this project.
 Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540.
 Halpern, J. Y., Pass, R., & Seeman, L. (2014). Decision Theory with Resource-Bounded Agents. Topics in cognitive science, 6(2), 245-257.
 Hartley, T., Mehdi, Q., Gough, N. (2004). Applying Markov Decision Processes to 2D Real Time Games. University of Wolverhampton, School of Computing and Informa-tion Technology. Retrieved from http://hdl.handle.net/2436/31518.
 Luce, R. D. (2012). Individual choice behavior: A theoretical analysis. Courier Cor-poration.
 MacKay, D. J. (2003). Information theory, inference, and learning algorithms. Cam-bridge: Cambridge University Press.
 Metropolis-Hastings algorithm. (n.d.). Retrieved from https://python4mpia.github.io/fitting data/Metropolis-Hastings.html.
 Refaeilzadeh P., Tang L., Liu H. (2009) Cross-Validation. In: LIU L., ¨OZSU M.T.(eds) Encyclopedia of Database Systems. Springer, Boston, MA.
 Schulman, J., Wolski, F., Dhariwal, P., A., & O. (2017). Proximal Policy Optimization Algorithms.
 Tversky, A., & Kahneman, D. (2000). Advances in Prospect Theory: Cu-mulative Representation of Uncertainty. Choices, Values, and Frames, 44-66. doi:10.1017/cbo9780511803475.004.