Meeting Minutes: Team Meeting w/ Priya Time: 02/25/04 @ 6:00 - 6:30 PM Where: MSE Cave Attendees: Ackley, Boyer, Fry, Wilson Purpose: Weekly Status Update & Demo ------------------------------------ --- Demo Fault Tolerance to Priya --- In this initial implementation, the client contacts a 'Load Balancer' on the "golden machine" for a replica. The client then contacts replica for message processing. This permits a simple load balancing mechanism. Priya Comments (our TODO list) *) We have sucecssfully demonstrated we've completed the Fault Tolerance Baseline. *) Remaining tasks are: - Automatic Replica recovery - currently a failed replica must be manually re-started. - Obtain initial Performance Measurements - Must keep track of Faults generated & recovered from. - Baseline Performance (a fault-free report) - Recovery Time (performance with faults) - Revise Failover strategy for performance. *) In our current design the load balancer, as a single connection point will become a bottleneck in terms of scalability. - May wish to consider some other strategies to mitigate this, yet still permit load balancing. - Current design is a "pull" strategy, where clients must constantly query the load balancer for information. - Perhaps we could employ a "push" strategy, where clients retain a cache of system load, and this cache is periodically updated by the replication manager. Q) How would you classify our replication strategy - passive or active? A) Neither - really it is a clustering strategy. - All the replicas are actively doing work - just not the same work. - We are the only team to adopt such a strategy for our project. ------------------------------------ Team Meeting (Before Demo) Time: 02/25/04 @ 5:00 - 6:00 PM / 6:30-7:00 PM Where: NSH Atrium / Wean Hall - 4th floor corridor niche Design Update: -------------- * System handling of duplicate MsgID's - Currently generating a FATAL exception. - System will be modified to handle duplicate MsgID by first checking for such duplicate MsgID. If found, we will re-use that entity bean. - Any duplicate messages (as a result of failover) will be passed to SpamAssassin for processing. We will NOT store the result of the processing in the database, as the activity is indempotent. * Generation of MsgIDs - Currently the client generates a MsgID/TransactionID as a concatenation of the clientID + hash(msg) - Issue is that hashes are NOT unique, and take some time to compute. - System will be modified to replace the hash(msg) part of the MsgID with the system timer (milliseconds). This should ensure a unique MsgID, and involves less processing. * Server/Replica State - The ServerStat table has been modified to contain a ServerState. - This state consists of "Active/Inactive", and indicates the state of the replica on that server. - A replica is marked "Active" at system startup. - When a client detects a fault, it marks the replica as "Inactive". - The load balancer will only give requesting clients the names of replicas marked as "Active". Thus, a failed replica will never be contacted again. New Assignments: ---------------- PW - Merge Fault-Tolerance & Java Log into CVS. GA - Continue working Global Replication Manager GA/CF - Add Automatic Replica Recovery/Restart GA/CF - Add Automatic Fault Detection (Replica Polling) CF - Merge eat-spam.pl into MPClient CF - Track down current exception in system - (an unhandled exception from SpamC?) CF/PW - Complete Java Log Configuration & invokation AB - Test Cleanup - Make test results obvious (Pass/Fail) - Make test activity obvious (display test activity) ** Obtain Performance Measurements - records number of failovers - determine fault recovery time - Postponed... - May depend on eat-spam/MPClient merge AB - Implement Design Change : Re-use duplicate MsgIDs AB - Implement Design Change : Replace hash(msg) --> timestamp Notes: Team goals are to: * Complete component implementation by Monday * Begin component integration Monday Evening.