18-642 Project 11

Updated 11/21/2021. Changelog

Learning objectives: implement more monitors in a style similar to safety monitors used in life critical systems; confront race conditions and asynchrony in distributed systems.

Lab Files:

Project 11 maze files (These are the same mazes used for project 12; no new mazes will be introduced after this point.)

Hints/Helpful Links:

$ECE642RTLE_DIR refers to ~/catkin_ws/src/ece642rtle (or corresponding project workspace).
Please follow the handin naming convention: every filename (before the file extension) must end in [AndrewID]_[FamilyName]_[GivenName].
See the Recitation video and slides on Canvas

Procedure:

Write monitors for the rest of the invariants. Refer to the previous project for reminders on how to write and compile these monitors. It might be helpful to factor out some functionality in a utility file, such as ANDREWID_monitor_utils.cpp. You should also use the structured types defined in $ECE642RTLE_DIR/monitors/monitor_inteface.h. You are required to write the following monitors:
- ANDREWID_turn_monitor.cpp: update or use as-is from previous project.
- ANDREWID_tick_monitor.cpp: between calls to tickInterrupt, there shall be at most one call to each of poseInterrupt, visitsInterrupt, and bumpInterrupt. tickInterrupt occurs after every call to moveTurtle, so this effectively checks that your moveTurtle only makes one of each type of call.
  - Consider whether it is possible for those three interrupt calls to happen in varied orderings, and if that is an issue what you might do to ensure that a changed ordering does not break your monitor design.
- ANDREWID_forward_monitor.cpp: the turtle shall face the direction it is moving. In other words, if the turtle moves from square A to square B, and square B is direction DIR (example: east) of square A, the turtle must have been facing DIR before moving. ADDITIONALLY, the turtle shall not move in round in during which it has also turned.
- ANDREWID_wall_monitor.cpp: the turtle shall not go through walls
  - Recall that the monitor connects to the system and eavesdrops on ROS messages, so it does not actually have a ground-truth of where the walls are in the maze. (In real safety critical systems it is difficult for a monitor to know more about the environment than the rest of the system does since it has to rely upon the same sensor technology as the main system.) Because of this, a more precise definition of our invariant is: given a turtle that has gone between squares A and B that share an edge E, we view an invariant violation as: turtle has not checked the given edge E bump() before crossing it, OR turtle has checked the given wall segment E for bumped() , which returned true, and turtle still crossed it. In other words, the turtle must have seen bumped() return false for E some time in the past before going from A to B. This suggests that, every time your monitor sees a bumped() call, it will have to store it for lookup later. It also effectively outlaws designs that remember where the walls are to skip calling bumped() -- but it is common for safety monitors to close off some avenues of behavioral optimization, so this is a realistic tradeoff decision.
  - You can simplify this approach by putting a bound of how many past bumped() calls your monitor keeps track of. For example, if your turtle implementation always checks bumped() before moving, you only need to keep the last bumped() call stored. Similarly, if your turtle spins and checks four directions for bumps before moving, you only need to keep the last four bumped() calls stored. There are more sophisticated approaches possible, but do be sure that when the turtle moves to a new square your monitor doesn't retain stale bumped() data from the previous square.
  - Note that, geometrically, the line segment between (0,1) and (0,2) and the line segment between (0,2) and (0,1) are the same. Also, because all walls in the maze run parallel to the x- and y- axes, either both x- or both y- coordinates of each wall segment will be the same. We therefore suggest you store your line-segments in a canonical format to make it easy to compare two line segments: dictionary order, or all line segments fulfilling ((y1==y2 && x1 < x2) || (x1==x2 && y1 < y2)) is a good one.
  - Hint: Our implementation used a fixed-size circular queue (statically declared array) and two main helper functions: Endpoints wallBetween(Pose p1, Pose p2) and bool bumpedInMemory(Endpoint e). However, you do not have to adopt this approach.
- ANDREWID_face_monitor.cpp: for any call to bumped(x1,y1,x2,y2), the turtle shall be facing the wall segment with endpoints (x1,y1) and (x2,y2).
  - As a hint, note that this means there is only one valid wall segment that the turtle can call bumped(...) for any given position and orientation. You might consider computing this valid wall segment and comparing it to the arguments of bumped(...).
- ANDREWID_solved_monitor.cpp: If the turtle has solved the maze (atEnd(x,y)==true), it shall not move or turn
- ANDREWID_atend_monitor.cpp: Turtle shall only call atEnd(x,y) if it is at a position x,y. (No remotely fishing around the maze to figure out where the goal is for path planning.)
  - We have seen occasional problems with the ROS Warning stream being corrupted when terminating the simulation, probably due to the output stream not being flushed properly when a Control-C is encountered. A suggested workaround is that this monitor outputs a "Successful atEnd" using ROS_WARN when the turtle has reached the end of its journey (atEnd(x,y) is true) to flush out any warnings still buffered in the output queue. This way you know that when you see a few "Successful atEnd" messages in a row that any pending warnings issued before that have also been output successfully. It might take several such messages to flush the output queue, so this monitor can output the success message once every tick.
- Other Requirements and notes:
  - "step_monitor" shall be included in your system. If you need to update it for some reason you may, but the intent of the monitor shall be unchanged.
  - Your code shall use the ROS_WARN function to indicate an invariant violation.
  - You should use ROS_INFO to print out any relevant information for your monitor to display.
  - You can use ./run_642_monitors with any number of arguments to run a subset of all the monitors at once and observe the violations in VIOLATIONS.txt.
  - You can also use the command rqt_console to view and sort ROS logging messages.
  - Take note of any invariant violations. Are they due to the turtle code or due to bugs in your monitor(s)? If the latter, fix the bugs. In rare cases, ROS timing issues might give you false positives if messages arrive out of order. If you see these, you'll need to reconsider your design to tolerate this type of issue.
  - You should fix any invariant violations. (For the next project that becomes "shall fix", but for now you don't need to be perfect.)
  - The number of times the turtle visits a cell has been intentionally left off the required monitors, but fewer is better.
  - Monitors should send an "Monitor [NAME] is runnning at [time]" entry to the violations file to help differentiate between a violations file being empty because the monitor did not launch vs. a no-violations run.
  - Monitors should send an "Monitor [NAME] is runnning at [time]" entry to stderr / cerr to help catch issues with monitors failing to launch.
Try out the Project 11 maze files (see link in lab files above). Notes:
- Mazes m1-m6 in this folder are the same m1-m6 that you have seen in previous projects. m7-m10 are new.
- Put the mazes in the $ECE642RTLE_DIR directory.
- All of these mazes are your acceptance tests -- your turtle shall solve all mazes m1-m10 to get the corresponding points for the final project, including running without invariant violations. For this project you don't have to solve all of them.
Updated the build package similar to the one for previous projects, meeting the following requirements:
- ANDREWID_build.sh enables all previously required compiler flags and runs all unit tests. Any compiler warnings and any unit test failures are annunciated in a way that a grading TA can readily see.
- ANDREWID_run.sh k runs maze k with all run-time monitors enabled. Any run-time monitor failure is annunciated in a way that a grading TA can readily see.
Writeup: Include the following in a writeup:
- Your name. (If the writeups are printed out, the file name info is lost, so put your name and Andrew ID in the document text as well)
- Q1: Did you struggle with anything creating these invariants? If so, briefly describe. (We strongly recommend you see a TA at office hours for any difficult spots before handing in, or at least before working on the next project.)
- Q2: Did you get any invariant violations due to turtle behavior? If so, briefly describe.
- Q3: Did you get any invariant violations due to timing issues (that includes timing issues you fixed before hand-in)? If so, briefly describe.
- Q4: Include a screen shot of solving maze m1 and the window showing invariant violations generated with all invariants enabled. (If no violations, show that this is the case. If there are more than 5 or 10 invariant violations then just showing the last handful or two is fine according what is reasonable to show in a screen shot.)
- Q5: Summarize the status of your invariants as follows: For each invariant monitor, state which mazes violate the monitor. (A table might be the easiest way to do this with monitors on one axis and maze numbers on the other.)
- Q6: Summarize the status of solving each maze m1 through m10, minimally by stating for each one if it is solved yes/no. Remember you'll want to solve all the mazes for the final project.
- Q7: Any feedback about this project?

Handin checklist:

Hand in the following:

A release candidate build file: ANDREWID_P11.tgz
Your writeup, which should be in a single acrobat file called p11_writeup_[FamilyName]_[GivenName]_[AndrewID].pdf

Other requirements:

Use the usual project naming conventions with a prefix of "p11" for items other than source code.
The writeup shall be in a single acrobat file for ease of navigation.
Fonts for all diagrams and text shall be rendered at no smaller than 10 point font. Smaller fonts will require a resubmission

Zip the files and submit them as P11_[Family name]_[First name]_[Andrew ID].zip.

The rubric for the project is found here.

Hints:

The invariants defined in projects 10 & 11 completely replace the less formal constraints you've seen in previous projects. The final project will require that you satisfy the written in project 10 & 11 to implement the constraints previously described.
There is a sleep statement between launching monitors in the scripts we give you. In some cases the sleep is too short to finish launching a monitor (it depends on a lot of conditions such as how fast your virtual machine can run on your physical machine and might vary from run to run). If you have problems with monitors not running or situations that work with one monitor but fail with multiple monitors try lengthening the sleep time in that launching loop.
Don't forget that there is a way in the code to speed up the turtle by reducing a time-wasting loop if your system is running slower than you'd like while you are doing testing.
Mazes 9 and 10 might look the same, but maze 10 starts with a random orientation.
ROS is an inherently asynchronous system. This means that messages will arrive in different orders, and it is possible for the same type of message to arrive twice in a row without some other message arriving at all due to changing message ordering. (The project calls these 'interrupts' from the system point of view, but as implemented in ROS they are basically messages.) Note that ROS is designed to operate in real time on physical systems, so even though we are running simulations, the wall time (i.e., time perceived by you while you watch the turtle run the maze) that passes when running the simulation is relevant to ROS behavior.
- If you try to run your system faster than the jitter of message arrival time in ROS you might see two, three, or more of one type of message before another type of message arrives. That means that messages that arrive might be from different turtle positions and/or different turtle orientations, causing problems for your monitors.
- Slowing down how fast the maze runs (basically slowing the speed of the turtle on the screen in real time) is usually the easiest way to resolve race conditions caused by excessive message timing jitter. In general what you want to do is run the turtle slow enough that all the messages from one move get sent before the next move is attempted. ("move" is either a turn or forward move.) In real life, if you need to run faster than this, you need to deal with distributed time issues by, for example, using synchronization or group membership protocols, in a way that gets pretty complicated and pretty painful in a hurry.
- Hint: if you see monitor violations that change from run to run on the same maze, that is very likely a race condition (asynchronous system defect) rather than straightforward logic error.
- Grading criteria note: In case of monitor violations that might be due to timing errors, most students are able to resolve these. However, for the purposes of the course grade that passes acceptance testing, the following criteria will be used:
  - If a maze does not produce monitor violations (and does not have false negative violations involving unreported turtle misbehavior), the run counts as violation-free.
  - If a maze does produce monitor violations, it will be run a second time on the same maze. If the monitor violations disappear on the second try, that also counts as violation-free.
  - If both first and second runs have violations, it will be run a third time on the same maze.
    - If the third run has no violations, that counts as violation-free.
    - If any violation happens at the same cell in the maze, that will be presumed to be a logic error unless the student can show otherwise.
    - If all violations happen in different places in the maze (i.e., no three runs have the same violation at the same square in the maze), the TA has authority to determine it is a timing violation. The acceptance test may be counted as violation-free at TA discretion. Points are likely to be taken off for the project, but it may count as completed for the course letter grade requirements.
You might see out of order message arrival leading to issues with multiple monitors. This is a result of the non-deterministic, unsynchronized nature of ROS. We expect that your code works despite these circumstances. There are many ways to mitigate this issue: sleeping a thread, sequence numbers, etc.
Recall that from Project 10 we recommended you exit after a few cycles in the goal square. If you use Ctrl-C you might see an error such as "boost: mutex lock failed in pthread_mutex_lock: Invalid argument" As long as the monitors execute properly we will not deduct points for this error.
Repeated from Project 10:
- Bug fix you might need to make: We had a bug report that the monitor script was reporting violations that didn’t exist and seemed to be triggering on ROS_INFO lines. After some investigation, it appears that the run_642_monitors.sh script is matching on regular expressions instead of literal text which is causing it to match on the info command that affects some, but not all, students. If this affects you, the corrected code is
  - 22: grep -C 5 "\[ WARN\]" $m.output.tmp >> VIOLATIONS.txt
  - 23: m_viol=`grep "\[ WARN\]" $m.output.tmp | wc -l`
Some of you might see an exception that looks like this when you Ctrl-C the terminal: [master] killing on exit terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >' what(): boost: mutex lock failed in pthread_mutex_lock: Invalid argument
- In most cases this issue can be avoided by having the system do a clean self-shutdown instead of a Ctrl-C at the end of the maze run.
- Note that while this isn’t perfect, we will NOT be taking any points off if this happens during a demo. So, if you are properly executing the monitors, you will earn full points. Ctrl-C is a pretty violent way to shut down a piece of software, and apparently ROS buffer management isn't very robust to that. If you want to dig deeper come to office hours and we can suggest some ways to do better, but this is NOT required for the project.

Changelog:

6/28/2021: issued
10/17/2021: updated hints
11/21/2021: deleted reference to unused documentation hand-in; changed hand-in zip file name for consistency.
10/20/2023: updated with font size requirement