This week I leveraged the response time header implementation I completed last week as a metric for load balancing decisions. I completed our first custom load balancing algorithm, which is based on a simple solution to the multi-armed bandit problem. Multi-armed bandit is a classic reinforcement learning scenario in which a gambler (decision maker) must repeatedly play one of many slots machines with initially unknown reward probabilities, trying to maximize reward. I find multi-armed bandit to be highly related to the load balancing decision, where the performance of each load balancing decision can be thought of as the slots rewards. The key dilemma is between exploitation (choosing the best current estimated server) vs. exploration (improving estimates for all servers).
The algorithm is based on pattern called epsilon-greedy. It starts with an initialization period where response time info is collected for each server in round robin fashion. Then, a random variable between 0 and 1 is created for each decision. If it is less than epsilon, the the server with lowest average response time (exploitation). Otherwise, the server chooses one of any servers at random or in round robin fashion (exploration). A key difference between load balancing and multi-armed bandit is that server performance is not fixed. Thus, response times are only maintained for the most recent 10 values are retained to ensure data relevance and limit memory needs.
I also wrote an outline for our final presentation slides and am almost finished on another load balancing algorithm based on UCB1 reinforcement. Over the next week, I plan to complete this new algorithm and make two similar ones to the e-greedy and UCB1 based ones by leveraging a new metric that our team is implementing for the video servers, network input/output volume.