Fault-tolerant Architecture

Replication Manager

- The Replication Manager(RM) manages servers, detects server faults and initiates server recovery.
- The RM reads from a configuration file / command line for parameters like:
- How many servers to start
- Which hosts the servers should be started on
- The period of the heartbeat messages
- The startup timeout period
- The RM starts up and contact the JNDI server, which is currently running on go.lab.ece.cmu.local.
- If the servers in the configuration file are not running, the RM starts them up.
- The RM asks the servers if they are alive periodically.
- If the RM fails to give a response within the timeout period, they are considered dead (e.g. every 2 seconds).
- The RM kills whatever may be running on the classified dead server, and restarts the server remotely.
- The RM also manages/updates JNDI variables.
- The clients look up to JNDI variables when they try to connect to a server. The JNDI variables consist of:
- number of servers (server-n)
- primary server (server-p)
- list of server hardcoded (server-01, server-02, server-03 ... server-15)
- The clients will always connect to server-p, i.e. if "server-p" is 2, it will connect to "server-02", which will be a server hostname.
- Whenever the current primary server dies, the RM will update server-p by incrementing it. The clients will then switch to the next available server-p while the RM will remotely restarts the dead server.
Database
- Since the database is assumed not to fail, there is no replication or fault tolerance in the database tier (for now only, we may do it in other phases)
- Non-idempotent database interactions are equipped with transaction IDs which are unique to each client and each session to protect the database from executing duplicate operations during failover. (We have to figure out how to do that; we got this from last year)
Servers
- The servers can be started by the replication manager or manually launched with the launchserver script
- The servers take as command line parameters:
- Server ID (so that they don't overwrite each other in the Global JNDI)
- Heartbeat period
- Once launched, the servers await invocations from the client.
Clients
- The client obtains server references from the Replication Manager.
- The client cycles through the server connections until it finds a working server, or until it reaches an internal limit (currently 15 is the maximum since there are only 15 machines in the game cluster). It will then ask the client to retry after a certain time (e.g. 60 seconds).
Fault-Tolerant Design Features
Replication
- We would use passive replication for now :-)
- Three servers will run on different machines ("Server1"- chess, "Server2"- go "Server3"- boggle)
- Each server will be registered on the naming service (JNDI) with different names.
- We will have a Global JNDI service.
- Only one instance of the database will run on "mahjongg.lab.ece.cmu.local:3306/ece749_team3" and it is assumed to be fault-free (connection through JDBC)
Fault Detection
- The client will receive an exception (e.g. RemoteException) that the communication between the primary server and itself has been lost.
- The servers will send periodic heartbeats to the Global Fault Detector (e.g. every second).
- In case the Global Fault Detector does not receive any heartbeats from one of the servers, it will assume that server has died. It will then try to restart the server by using a local recovery manager.
Fail Over
- The client at startup will get information about all of the servers from the Global JNDI running on our Replication Manager.
- When an exception has occurred, the client will automatically move to the next server
- Clients will not be notified about the server failure and fail over
- Client will try to fail over to next server. If all of the servers are down, our client application asks the client to retry after 60 seconds.
Recovery
- When the Replication Manager has acknowledged that a server is dead or it is not responding, the RM will kill the server.
- It will then proceed to restart the server with its existing name for the Global JNDI
Checkpointing
- Since our operations are stateless, all data is stored persistently in the database. Therefore we do not have any checkpointing.
Scenarios

From the diagram, we see 4 points of failure:
- Client sends a request; Server does not receive it
- In this case, Client assumes the server has died. It moves on to the next one
- Client sends a request; Server processes the request and forwards it to the database; Database does not receive it
- Using an Exception, Server tells Client that the operation has failed. Client has the option to try again.
- Client sends a request, Server processes the request and forwards it to the Database. The Server does not receive acknowledgement after the Database has processed it
- Using an Exception, Server times out and tells Client that the operation has failed. Due to two-phase commit the changes made to the Database are not committed.
- Everything works fine except the Client does not get the server's response.
- Client assumes the server has died and moves onto another server.
Questions to be Answered
- Since our servers are stateless, we don't have checkpointing. If we are doing passive replication, do we require all the clients to only talk to the primary server?
- Non-idempotent database interactions are equipped with transaction IDs in the MySQL Database. We need to figure out how exactly we can do it in the MySQL Database.