Nathan’s Status Reports – Team A0: Go Learning Buddy

December 9, 2023December 9, 2023

Nathan’s Status Report For 12.9.23

I am happy to say that not only is the engine fully ready for demo multiple days ahead of time, but also that the improvements from the MCTS training run in the past week were much larger than I originally anticipated.

Before switching to 9×9 board, the original (initial trained) version of the policy network had just above 1% accuracy. While that sounds extremely low, it in fact meant the policy network identified the “best” move in the position 4 time more often than randomly guessing (with 362 options), plus, there is a much higher chance that the “best” move is in its top 5 or 10 suggestions.

The second version, trained exclusively on 9×9 data generated from my MCTS runs, had an accuracy of around 9%, which represents performance about 8 times better than randomly guessing, as for 9×9 there are only 82 possible selections, not 362. With each increase in accuracy, the expansion factor of the tree search (i.e. how many candidate moves are considered from each node) can be reduced, as the probability that the best moves are some of the stronger recommendations of the policy network is higher.

Finally, with the third version, I was expecting something around 20% accuracy, however, much to my surprise the measured accuracy (on separated validation data) was just over 84%. Though this is probably not due to overtraining (due to the network architecture having multiple systems to prevent this, and considering the performance on “unseen” validation data), I suspect some of this marked increase is due to the policy network essentially predicting the value network’s response, with an extra layer of abstraction. That is, if the policy network identifies what traits in a position the value network “likes”, its suggestions will line up more evenly with the value network’s evaluations. However, this is indeed the point of MCTS, as assuming a degree of accuracy in the value network (which I think is safe due to the abundance of training data and past performance), the policy network’s job is to identify candidate moves that may or may not result in the best possible positions. Thus fewer candidate moves need to be explored, and deeper, more accurate simulations can be run in the same amount of time.

This 84% model is the policy network that will be utilized by our engine during demo, however, I am starting a simulation run tonight, whose results (if better), will be used in our final report. The engine is fully integrated onto the web app backend, and works exactly as intended.

Added note, the code for the engine itself (i.e. lacking the code for simulation which can be found on the Github I linked earlier in the semester) can be found here.

December 3, 2023

Nathan’s Status Report For 12.2.23

Some unforeseen issues caused me to deviate a bit from my expectations for the week (i.e. what I said I would do in my previous status report). After extensive consultation it was decided that we would change the board from its original size of 19×19 to a small size of 9×9 to aid in construction time reductions. As such there were a few extra tasks for the week, that I will go into more detail on later.

Task 1: Multi-machined MCTS. In the past week(s) I have been running MCTS near-continuously across many of the ECE number and lab machines. This allowed me to collect around 150,000 data points for tuning, bringing me to

Task 2: Policy Network Initialization. With the data generated through MCTS I was able to train the initial version of the policy network. This is more useful than just being an accuracy increase, as an increase in policy network strength means the expansion factor of MCTS can be reduced to maintain the same level of strength over the same simulation depth. That is, because the suggested moves are better on average, fewer possibilities need to be explored, and thus execution time decreases.

Task 3: Converting the engine to 9×9 usage. The change to a 9×9 board required a change in the engine as all the networks were set up to take in a 19×19 vector instead of a 9×9. This required a small amount of refactoring and debugging to make sure everything was working as intended.

Task 4: Converting MCTS to run as 9×9. As previously mentioned the engine has been converted from 19×19 to 9×9 to conform to the physical board changes. Unfortunately, this reduces the engine’s relative strength, as the value network and first iteration of the policy network were trained on 19×19 data. Accordingly, I refactored the MCTS code to run MCTS simulating 9×9 games, which will generate more specialized data to tune both networks.

Accordingly, aside from prepping for final presentations and demo, I will just be running MCTS in parallel across machines to generate extra tuning data for both networks. The engine works as is, so this improvement is all I have to work on until demo.

November 18, 2023November 18, 2023

Nathan’s Status Report For 11.18.23

I am once again happy to report that I was able to accomplish all of my stated goals from last week, and am again running on schedule.

I started by debugging MCTS locally, as I still had some errors in the code, including an aliased scoreboard among tree nodes causing rampantly inflated scores and a missed case when checking which stones on a board would be captured by a particular move. Once I fixed these issues, among others, I was able to simulate multiple full matches of the MCTS engine playing against itself locally to cover any edge cases that my test cases didn’t catch (stepping through each move manually). Once this was finished, I ported it over to be run on the ECE machines.

Once moved over to the ECE machines, I set up the initial run of MCTS, which is running as I write this. As I am able to complete more runs, the policy network strength will increase, and thus the expansion factor for each node in the tree can be lowered, reducing the computation required for each simulation (ex. I might only need to consider the top 20 suggested moves at any given position from a stronger policy network rather than say 50 from a weaker or recently-initialized network).

That being said, I am not fully satisfied with the current time each MCTS iteration is taking, and am thus currently working on optimizing my implementation while simulations are running. I was expecting about a 13x speedup from my local machine to the ECE machines, which is what I saw when training the value network, but for some reason, with MCTS this speedup is almost non-existent, limiting the rate at which I can generate new data. As such, I am doing some research into what might be causing this (GPU utilization, etc.). Secondarily, I am also optimizing my direct MCTS implementation. An example of the types of changes I’m making includes only expanding a node (i.e. generating children for each of the top n nodes) once it has been selected post its own generation, that is, the search not only found it as a temporary ending node, but also selected the node again for expansion. This cuts down of the amount of calls to the value network to evaluate positions, which seems to be the largest factor in slowing the program down.

Finally, I have settled on a definite policy network architecture, with it being the same as the value network, but having a length 362 softmax vector as the final dense layer, instead of a singular sigmoid scalar.

Over the next week (Thanksgiving week) I mean to continue running MCTS simulations, training the policy network, and optimizing the system to increase the speed at which I generate training data.

Final note: As I have MCTS working fully, this essentially means the the engine can be run to play against a real opponent (not itself) at any time, as everything is synced up together. The engine will improve with each iteration of MCTS, but this just updates the weights of the constituent networks.

November 11, 2023

Nathan’s Status Report For 11.11.23

I’m happy to say that I accomplished everything I set as my goals last week and in fact am ahead of schedule at the moment. I was able to port my data over to the ECE machines and finish training (with about a 13x speedup) on there. I solidified both the weights and the architecture for the value network. I then utilized the design space exploration from the value network to solidify the architecture for the policy network. Finally, I began testing the MCTS process locally, and once I am sure it fully works, I will port it over to continue on the ECE machines as well.

Starting off with the relocation to the ECE machines, I was able to move 13 GB of training data (more than 3.5 million data points) over to my ECE AFS so I could train the value network remotely. This had the added advantage of speeding up training time by a factor of about 13, meaning I had more freedom with the network architecture. The architecture I ended up settling on took about 13 minutes per epoch on the ECE machine, meaning it would have taken ~170 minutes per epoch locally which obviously would have been impossible as even a lower bound of 50 epoch would have taken about a week.

Secondly, my finalized architecture is shown below in listed form,

As you can see, there are three parallel convolution towers, with kernel sizes of 3, 5, and 7, which help the network derive trends in different sized subsections of the board. Each tower than has a flattening layers, and a fully-connected dense layer. These layers are concatenated together with the other towers, giving us a singular data-stream that passes through successive dense and dropout layers to prevent overfitting, culminating in a singular sigmoid output node, which provides the positional evaluation. This network was trained on 3.5 million data points, pulled evenly from over 60,000 expert level go matches. After training, the network was able to identify the winner of a game from a position 94.98% of the time, with a binary cross-entropic loss of .0886. This exceeded my expectations, especially considering many of the data points come from the opening of matches, where it is considerably harder to predict the winner as not many stones have been placed.

Using my design space exploration for the value network, I was able to solidify an initial architecture for the policy network, which will have the same convolutional towers, only differing in the amount of nodes in the post-concatenation dense layers, and the output form of a length 362 vector.

I have started testing MCTS locally, with success, once I am convinced everything works as expected I will port over to the ECE machines to continue generating training data for the policy network, in addition to tuning the value network. Fortunately, as for the first iteration of MCTS the policy network essentially evaluates all moves equally, the training data will be valid for further training, even if the architecture for the policy network needs to be changed.

In the next week, I plan to move MCTS over to the ECE machines and complete at least one iteration of the (generate training data via MCTS, tune value network and train policy network, repeat) cycle.

ABET: For the overall strength of the Go engine, we plan to test it by simply having it play against different Go models of known strength, found on the internet. This will allow us to quantitatively evaluate its performance. However, the Go engine is made of 2 parts, the value and the policy networks. Training performance gives me an inkling into how these networks are working, but even with “good” results, I still test manually to make sure the models are performing as expected. Examples of this include walking through expert games to see how the evaluations change over time, and measuring against custom-designed positions (some of which were shown in the interim demo).

November 4, 2023

Nathan’s Status Report For 11.4.23

In my previous week’s report, I mentioned that my goal this week was to finish the design space exploration for the Value Neural Network, and begin running simulations. Unfortunately, I am running about one day behind schedule, as processing the expert-level games dataset into consumable board states took longer than expected. However, I have a baseline version of the value network set aside for the interim demo, and am finishing up the design exploration as we speak, meaning if a better model is trained between now and Monday I can replace the already competent baseline.

That being said, I have not fallen very far behind at all, and it is easily covered by the slack built into my schedule. However, there are a few things of note before I start simulation proper, the first being ECE Machine setup. For the preliminary value network, I trained locally as the training data I generated takes up roughly 40 GB of space, well above my AFS limit. However, locally I am also limited by 8 GB of RAM meaning I can only use about 7.5 GB of this training subset anyway. As such, even if I cannot port all 40 GB of data onto the ECE Machines, anything over 8 GB would be an improvement, and worth trying just in case it helps train a substantially different model. As such, I am planning on asking Prof. Tamal on Monday who I should ask about getting my storage limit increased, and I will work on it from there.

The design space exploration has also yielded useful results in terms of what an allowable limit on network size would be. Locally, I’m currently operating with 2 convolutional layers, 1 pooling, and 1 fully connected dense layer, and this takes about 6.5 minutes per epoch with my reduced 8GB training set. The ECE machines will compute faster, and this 6.5 minutes per epoch rate is far shorter than my limit once we’re past the interim demo. This means if necessary, both the value and policy network architectures can grow without the training time becoming too prohibitive.

Therefore, beyond our interim demo, I plan to begin simulations next week to generate my first batch of policy-network-training and value-network-tuning data. Ideally I get the space increase on AFS quickly meaning I can do this remotely, but if possible I can run it locally as well, and port over the weights later. I also plan on setting up the architecture and framework for the policy network as well, so that I can begin training it as soon as the simulation data starts being generated.

October 28, 2023

Nathan’s Status Report For 10.28.23

In my previous weekly report, I mentioned that my goal for this week was to finish my expand() function, and to find a dataset to train the initial values for my value network. I am happy to say that I have accomplished both of these, as well as finishing the rest of the code required for MCTS. The dataset I am going to use is located here and contains over 60000 games from professional Go matches played in Japan. For the curious viewers, all the code required for MCTS can be found on Github.

With the aforementioned dataset, I plan to begin training the value network as soon as possible (hopefully by Monday). While I have the dataset, it is stored in the Smart Game File (SGF) format, which is the sequence of moves, not the sequences of board states generated. As I need the board states themselves for training, I am currently working on a script to automatically process all 60000 of these files, generating each board state and tagging them with the game result. These results are the training data I require. Once this is finished, I can begin training, which involves the physical training over the dataset, but also I will do some design space exploration with regards to network architecture (number of nodes, types and number of layers, etc.). This will allow me to find a closer to ideal combination of accuracy and processing time (as efficient simulations are helpful in training, but vital for usage).

This design-space exploration will actually prove helpful for the Policy Network as well, as it will provide a baseline for the allowable complexity. (Higher number of nodes and layers will generally perform better barring overfitting, but will take more time, the value network exploration will give me an estimate for the amount of layers I can use in the policy network as the networks can be used to evaluate positions in parallel.)

Then once I have the parameters (architectural mainly) for the networks set, and the initial weights for the value network trained, I can immediately begin running simulations, as all of that infrastructure is complete. I should be running these simulations by the time of my report next week, and will aim to do my first training run on both networks with the data the simulations generate.

October 21, 2023

Nathan’s Status Report for 10.21.23

As I mentioned in my previous status report, my goal for this week was to finish custom loss function implementation, then work towards getting the MCTS training data generation working, given that the framework was already there.

With regard to the former of the two goals, that work is all complete. I had actually misunderstood my needs, I don’t actually need a custom loss function, as the two networks are trained on mean-squared error (MSE) and binary cross-entropic loss (BCEL). Nevertheless, I have built the framework for the training of the two networks (Policy & Value), in addition to the data generation section of the MCTS code (where the board states and MCTS visit counts are stored as npz files.

With regard to the latter, there was a bit more involved than I originally anticipated. While I have all the Go gameplay functionality for training built, I am specifically not finished for the code for the MCTS expansion phase, where the leaf to expand is determined and children are generated. That being said, I have built out all other functionality, and am working to finish the expand() function by Monday in order to stay ahead of schedule. A brief schematic of the training structure is shown below.

Once the expand() function is finished, the next step is finding a dataset of expert go matches to use as training data for the value network pre-simulation. While this is not strictly necessary, and self-play could be used for this, giving the value network a strong foundation with expert-generated data improves the quality and function of the initial training data much faster. My goal is to have expand() finished and this dataset found by the end of the week. If that goes according to plan, I would be able to commence network training and MCTS generation immediately afterwards.

To accomplish my task, I had and will have to learn a few new tools. While not really a tool, I had to fully understand Monte Carlo Tree Search in order to program it correctly. More presciently, I have never used PyTorch before, or worked with Convolutional Neural Networks. While I have worked with similar technologies (fully connected deep neural networks and TensorFlow) I have had to do a lot of research into how to most effectively utilize them.

October 7, 2023

Nathan’s Status Report For 10.7.23

Now that we have fully settled on our transition to Go, and how it will be implemented, I have been able to fully focus on my part of the project. Due to our team presenting our design on Monday, and my conclusion of research last week with AlphaZero and MuZero I had the rest of the week to focus solely on implementation.

Almost all of my time this week was spent on the Go simulation framework, as well if beginning to figure out how to set up the reinforcement learning architecture. With regards to the former, I worked together with Hang to make sure we have a clear plan on how to pass board information from Arduino to my backend engine. From there, I was able to implement a backend representation of the board, and implemented function to allow an outside controller (in this case the engine) to make a move for simulation purposes, as well a functions to update position based on the information conveyed by the physical board. This is effectively all I need (along with basic rule checking like making sure the game is or isn’t over which I have also implemented) to move on to the reinforcement learning architecture. The real challenge here is the custom loss functions defined in our design proposal (expected result optimization and policy vector normalization). I have never worked with custom loss functions in python before, so I’ve done a huge amount of research into different ways to accomplish this. I decided to settle on PyTorch, as this is not only the current industry consensus for best deep learning framework, but also extremely well supported in Python. I started, but have not completed, actually scripting these loss functions, I am taking my time to make sure they are not only correct, but also optimally efficient, as in conjunction with the MCTS simulation, training times could balloon rapidly with inefficient implementation for either of these.

In the next week, I plan to finish these custom loss functions, then work on getting the in-training MCTS simulations to work. With the simulation framework already built, this shouldn’t require too much time.

September 30, 2023September 30, 2023

Nathan’s Status Report For 9.30.23

As is mentioned in our team status report for the week, we have transitioned our project away from Mancala and into the game Go instead. Fortunately, the division of labor remains relatively similar and I am still broadly responsible for the training of a reinforcement-learning engine that will eventually be used to give move suggestions and positional evaluations to our users.

Accordingly, almost all of the research I did last week is still applicable, as the same self-play techniques can be used, and, in fact, have been proven to work in the cases of AlphaZero and MuZero. After making the transition to Go this week, I had to do a quick catch-up on the rules and gameplay, but after that, along with the two above-linked papers, the research phase of my project has come to a close.

Of course, with the design presentations coming up next week, a good amount of my time this week was devoted to preparing for that as well, and the rest was spent building the platform for the reinforcement learning. The current consensus for optimal Go engine creation is a combination of deep learning and Monte Carlo Tree Simulations (MCTS). MCTS works by using self-play to simulate many game paths given a certain position, and choosing the move providing the best overall outcome. I have started work on creating the framework to perform these simulations as quickly as possible (holding game state, allowing the candidate engine to make moves against itself and returning the new board, etc.).

With regards to classwork helping me prepare for this project, I think the two ECE classes that helped the most are 18213 and 18344. I have not taken any classes in reinforcement learning or machine learning in general, but the research I did in the Cylab with ECE Prof. Vyas Sekar certainly helped me a huge amount, both in the subject matter of the research (deep learning) and the experience of reading scholarly papers to fully understand techniques you are considering using. What 18213 and 18344 provided was the “correct” way of thinking about setting up my framework. I need my simulations to be as efficient as possible while also maintaining accuracy, and I need my system to be as robust as possible, as I will need to make frequent changes, tuning parameters, etc. These combined with the research papers read last week, and the two above papers are what influenced my portion of the design the most.

Finally, in the next week I plan to finish the Go simulation framework, and begin work on setting up the reinforcement learning architecture, to begin training in the week after. MCTS simulation is quite efficient, but with the distinct limit on computational resources I have, allocating proper time is vital.

September 23, 2023September 23, 2023

Nathan’s Status Report for 9.23.2023

This week I did a combination of research on reinforcement learning, and opponent/platform setup to enable the RL model training.

With regard to research, I want to understand as much as possible about reinforcement learning before I start the process of actually building a Mancala RL model. Through preliminary research our group decided that a self-play method of training would be best, so I read a number of papers and tutorials on both the theory of self-play RL and the logistics of putting it into practice in Python. A few of the resources I used are shown below:

OpenAI Self-Play RL

HuggingFace DeepRL

Provable Self-Play (PMLR)

Towards Data Science

Python Q-Learning

In order to train the self-play RL model, I must have a competent opponent for the model to start off playing against, before it can train against previous iterations of itself. If I choose too strong of a starting opponent, the model will not get enough positive reinforcement (as it will almost never win), and if I choose too weak of one the reverse is true. As such, we will start with a relatively simple minimax strategy that looks two “ply” (single player turns) into the future. However, to build this strategy, I need a platform for the game to be played on (so the RL model can play the minimax opponent). This week I started building this platform, programming all game rules and actions, and a framework where two separate players can interact on the same board. I then implemented unit tests to make sure all game actions were functioning as they should. With this now in place, I have begun programming the minimax strategy itself. This means I am on schedule, and hopefully will have the minimax available to start training within the week.