Nathan’s Status Report for 10.21.23

As I mentioned in my previous status report, my goal for this week was to finish custom loss function implementation, then work towards getting the MCTS training data generation working, given that the framework was already there.

With regard to the former of the two goals, that work is all complete. I had actually misunderstood my needs, I don’t actually need a custom loss function, as the two networks are trained on mean-squared error (MSE) and binary cross-entropic loss (BCEL). Nevertheless, I have built the framework for the training of the two networks (Policy & Value), in addition to the data generation section of the MCTS code (where the board states and MCTS visit counts are stored as npz files.

With regard to the latter, there was a bit more involved than I originally anticipated. While I have all the Go gameplay functionality for training built, I am specifically not finished for the code for the MCTS expansion phase, where the leaf to expand is determined and children are generated. That being said, I have built out all other functionality, and am working to finish the expand() function by Monday in order to stay ahead of schedule. A brief schematic of the training structure is shown below.

Once the expand() function is finished, the next step is finding a dataset of expert go matches to use as training data for the value network pre-simulation. While this is not strictly necessary, and self-play could be used for this, giving the value network a strong foundation with expert-generated data improves the quality and function of the initial training data much faster. My goal is to have expand() finished and this dataset found by the end of the week. If that goes according to plan, I would be able to commence network training and MCTS generation immediately afterwards.

To accomplish my task, I had and will have to learn a few new tools. While not really a tool, I had to fully understand Monte Carlo Tree Search in order to program it correctly. More presciently, I have never used PyTorch before, or worked with Convolutional Neural Networks. While I have worked with similar technologies (fully connected deep neural networks and TensorFlow) I have had to do a lot of research into how to most effectively utilize them.

Nathan’s Status Report For 10.7.23

Now that we have fully settled on our transition to Go, and how it will be implemented,  I have been able to fully focus on my part of the project. Due to our team presenting our design on Monday, and my conclusion of research last week with AlphaZero and MuZero I had the rest of the week to focus solely on implementation.

Almost all of my time this week was spent on the Go simulation framework, as well if beginning to figure out how to set up the reinforcement learning architecture. With regards to the former, I worked together with Hang to make sure we have a clear plan on how to pass board information from Arduino to my backend engine. From there, I was able to implement a backend representation of the board, and implemented function to allow an outside controller (in this case the engine) to make a move for simulation purposes, as well a functions to update position based on the information conveyed by the physical board. This is effectively all I need (along with basic rule checking like making sure the game is or isn’t over which I have also implemented) to move on to the reinforcement learning architecture. The real challenge here is the custom loss functions defined in our design proposal (expected result optimization and policy vector normalization). I have never worked with custom loss functions in python before, so I’ve done a huge amount of research into different ways to accomplish this. I decided to settle on PyTorch, as this is not only the current industry consensus for best deep learning framework, but also extremely well supported in Python. I started, but have not completed, actually scripting these loss functions, I am taking my time to make sure they are not only correct, but also optimally efficient, as in conjunction with the MCTS simulation, training times could balloon rapidly with inefficient implementation for either of these.

In the next week, I plan to finish these custom loss functions, then work on getting the in-training MCTS simulations to work. With the simulation framework already built, this shouldn’t require too much time.

Team Status Report For 9.30.23

The major risk we are currently facing is making sure we successfully transition our project’s goals and requirements without losing a large amount of progress. As we will go into more below, we are transitioning our project from a website to play Mancala on combined with an engine, to a physical Go board, that displays engine recommendations and stores game histories locally for further analysis. This now creates a hardware (physical board) requirement for our project. Fortunately, a large amount of our research is applicable to the new project, and some of the software we had already begun to write can be adapted to meet our new needs. Nevertheless, we need to make sure we are not over-committing, and have a plan to catch up on the small amount of time now lost. To do so, we have defined a new MVP, adjusting our human-computer interaction requirements, while also making plans to adapt as much of our pre-existing work as possible, and adjusted our schedules (condensing some earlier steps) so that while we will have to commit to extra work for a few weeks, we will not be in a crunch at the end.

As mentioned above, we have made monumental changes to our project and its design. After receiving helpful feedback from TAs and students in regards to our proposal, we realized that our use-case was not strong enough and our project did not have the requisite breadth. These two major flaws led us to change our project focus to Go instead. Due to the competitive gaming community there is much more demand for a Go training product, and the Go equivalent to a chess DGT board (which is one of the services our project will provide) has not been created. This switch will also incorporate a hardware component that record’s player’s games and allows our website component to show analysis for these already played games. These changes have forced us to rearrange our schedule a bit (as seen below) but that, combined with the other mitigating actions we took (as described above) will allow us to stay on target.

New Schedule:

Nathan’s Status Report For 9.30.23

As is mentioned in our team status report for the week, we have transitioned our project away from Mancala and into the game Go instead. Fortunately, the division of labor remains relatively similar and I am still broadly responsible for the training of a reinforcement-learning engine that will eventually be used to give move suggestions and positional evaluations to our users.

Accordingly, almost all of the research I did last week is still applicable, as the same self-play techniques can be used, and, in fact, have been proven to work in the cases of  AlphaZero and MuZero. After making the transition to Go this week, I had to do a quick catch-up on the rules and gameplay, but after that, along with the two above-linked papers, the research phase of my project has come to a close.

Of course, with the design presentations coming up next week, a good amount of my time this week was devoted to preparing for that as well, and the rest was spent building the platform for the reinforcement learning. The current consensus for optimal Go engine creation is a combination of deep learning and Monte Carlo Tree Simulations (MCTS). MCTS works by using self-play to simulate many game paths given a certain position, and choosing the move providing the best overall outcome. I have started work on creating the framework to perform these simulations as quickly as possible (holding game state, allowing the candidate engine to make moves against itself and returning the new board, etc.).

With regards to classwork helping me prepare for this project, I think the two ECE classes that helped the most are 18213 and 18344. I have not taken any classes in reinforcement learning or machine learning in general, but the research I did in the Cylab with ECE Prof. Vyas Sekar certainly helped me a huge amount, both in the subject matter of the research (deep learning) and the experience of reading scholarly papers to fully understand techniques you are considering using. What 18213 and 18344 provided was the “correct” way of thinking about setting up my framework. I need my simulations to be as efficient as possible while also maintaining accuracy, and I need my system to be as robust as possible, as I will need to make frequent changes, tuning parameters, etc. These combined with the research papers read last week, and the two above papers are what influenced my portion of the design the most.

Finally, in the next week I plan to finish the Go simulation framework, and begin work on setting up the reinforcement learning architecture, to begin training in the week after. MCTS simulation is quite efficient, but with the distinct limit on computational resources I have, allocating proper time is vital.

Team Status Report for 9.23.2023

One of our biggest risks currently is the possibility that the planned minimax strategy will not prove effective as an initial opponent for the self-play RL model (be it too strong or too weak). If it is too weak, the platform that we are building it on can be extended to look more than 2-ply into the future. While this will increase training time (as the calculations will take longer to compute) it will provide a stronger opponent. On the other hand, if it proves too strong, we have other, even more basic strategies waiting in standby, such as 1-ply maximization (just maximize the amount of stones captured in one move, ignoring the possible responses) or even a random agent.

With regard to changes, a possible problem pointed out during the presentation was the idea that some variants of Mancala were solved. While we had always planned on this, we had not made clear that the version we were building for our website was an unsolved variant (the seven stone variant of Kalah Mancala). Some players use other ways to get around the solved aspect of the game such as switching positions after the first move, but those add unnecessary complication to the game, raising the barrier for entry, especially for younger players. This will not cause any increase in price, or changes to the system itself, but does specify requirements a bit better. Other than that there have been no changes to the system or structure of the project.

For right now, everyone is on schedule, so no changes are necessary there.

The effect our project will have on public safety, the economy, or the environment are relatively minimal. Of course, we are using a small amount of computational power on training the RL model and maintaining our servers, but in the grand scheme of things it is next to nothing. That being said, our project certainly has a non-trivial effect socially, and could possibly improve mental health for some users as well. Multiplayer games are inherently social, and an online platform for them provides an outlet for users to connect with other like-minded individuals. The fact that there is no major website dedicated to Mancala makes it all the more important. Beyond even meeting new people and possible friends, our project would also allow for friends to play each other directly, for friendships where it is difficult for the participants to see each other (long-distance, etc.) this can help strengthen them. Finally, this may only be relevant a tiny percentage of the time, but the small amount of social interaction from online gaming can make a significant difference in mental health. It is all too easy to shut yourself away and not interact with anyone, and as this effect compounds it becomes harder and harder to break out of it. Online social interaction can be a small step in the right direction, and our platform could provide that.

Nathan’s Status Report for 9.23.2023

This week I did a combination of research on reinforcement learning, and opponent/platform setup to enable the RL model training.

With regard to research, I want to understand as much as possible about reinforcement learning before I start the process of actually building a Mancala RL model. Through preliminary research our group decided that a self-play method of training would be best, so I read a number of papers and tutorials on both the theory of self-play RL and the logistics of putting it into practice in Python. A few of the resources I used are shown below:

OpenAI Self-Play RL

HuggingFace DeepRL

Provable Self-Play (PMLR)

Towards Data Science

Python Q-Learning

In order to train the self-play RL model, I must have a competent opponent for the model to start off playing against, before it can train against previous iterations of itself. If I choose too strong of a starting opponent, the model will not get enough positive reinforcement (as it will almost never win), and if I choose too weak of one the reverse is true. As such, we will start with a relatively simple minimax strategy that looks two “ply” (single player turns) into the future. However, to build this strategy, I need a platform for the game to be played on (so the RL model can play the minimax opponent). This week I started building this platform, programming all game rules and actions, and a framework where two separate players can interact on the same board.  I then implemented unit tests to make sure all game actions were functioning as they should. With this now in place, I have begun programming the minimax strategy itself. This means I am on schedule, and hopefully will have the minimax available to start training within the week.