Author: Niko Gupta

Niko’s Status Report for 4-19

This past week, I put a lot of work into wrapping up the functionality of the interaction layer and integrating all three parts together.

The way we broke the project up, node infrastructure (aws hosting) fell under the interaction layer umbrella, since the ‘node’ process is the top level process running on the node and managing the other processes. As such, it fell onto me to take care of running the frontend webapp and making sure that the data read / write pipeline worked from the frontend all the way through the interaction layer and down to the hardware emulator for each node.

Below, I’ve broken up my work this past week into different categories, and I elaborate on what I worked on in each category.

  •  Infrastructure
    • setup script: I changed the setup scripts for setting up a new node to create a shared Python virtual environment for the project, rather than a different one for each of the three parts. This made it much easier when manipulating different parts of the system, since I no longer had to worry about which virtual environment was being used.
    • AWS
      • Set up all the nodes: I created 5 unique nodes, each with a different config and hardware description. Since initial device commissioning is out of the scope of the project, when the system “starts” for any demo, it has to already be initialized to a steady state. This means I had to hardcode each device to work properly
      • Elastic IPs: I figured out how to create and assign a static IP address (called an elastic IP by AWS) to each node, so that I could easily hard code the IP address within the code instead of dynamically gathering it each time Amazon decides to change it.
      • Static IP for webapp: I started looking into defining an extra elastic IP address that is reserved for the webapp instead of for a specific node. The way this would work is that all nodes have a unique static IP address, but the master node has an additional public IP address pointing to it. If and when the master dies and another node promotes to master, the new master will dynamically claim that second static IP from AWS. The result of this would be that the same IP address would always point to the webapp, even if the node hosting the webapp changes. I hit a few issues, such as restarting the network stack on an Ubuntu 18.04 virtual machine, and couldn’t get this working by the final demo.
  • Interaction layer
    • CONFIG and hardware_description parser: I made the definition for the CONFIG file clearer and more rigid, and added functionality to easily parse it and get a dictionary. I also created a helper function to parse a “hardware_description.json” file, which describes the hardware located on the node. This was something required by Rip’s hardware emulator that I hadn’t expected, and had to be done during integration.
    • Database schema updates: As discussed in last week’s update, Richard and I discussed certain updates that had to be made to the database schema. I updated the database (and database management helper functions) as per our discussion.
    • help_lib for Richard’s frontend webapp: I added the following functions to a file called help_lib.py, with the intent that these would facilitate communication between the webapp’s backend and the interaction layer’s distributed data storage.
      • addInteraction: takes a new interaction and sends it to all nodes in the system to write to their database
      • deleteInteraction: same as addInteraction; deletes an interaction on all nodes in the system
      • updateInteraction: same as above, but updates an interaction in place instead of deleting it. The way the frontend is currently set up, their is no way for a user to modify an existing interaction, so this function is unused.
      • updateNode: update the display name and description of a node. This again makes sure to sync across all nodes
      • getNodeStatus: takes a list of node serial numbers, and for each node, gets the current value of that node, from the node, through the mqtt broker. If the node is dead, will short circuit and return that the node is dead, instead of trying to ping it to ask for a status update
      • setStatus: set the value for a particular node. E.g. turn a light on, set off an alarm, etc. This will communicate to that node through the broker.
    • db_lib for sql interactions (read / write etc): added a lot of functionality to the db_lib file for managing the database, so that code outside the db_lib file has to do minimal work to read from / write to the database, and in particular, has to do nothing relating to sql.
    • Proper logging setup: updated the logging in the system to use the python logging module and create logs useful for debugging and system monitoring.
    • MqttSocket: created a new class called MqttSocket (i.e. socket type functionality over mqtt). Currently only used by getNodeStatus, this class is meant to describe and facilitate a handshake type interaction between two nodes. We decided to do all node communication through the broker instead of directly from node to node in order to facilitate interactions. However, sometimes one node has to specifically request a piece of information from another node, which follows a request / response pattern. The async and inherently separated publish / subscribe nature of MQTT makes it fairly convoluted to follow this request / response pattern, so I packaged the convoluted logic into a neat helper class that makes it very easy to do the request / response cycle. Here is an example of how it’s used:sock = MqttSocket()# topic to listen for a response
      sock.setListen(‘node/response’)

      # blocking function, sends ‘data’ to ‘node/request’ and returns the
      # response sent on ‘node/response’
      sock.getResponse(‘node/request’, data)

      sock.cleanup()

    • Master failover: master failover hinges on the fact that nodes know whether other nodes in the system (in particular the current master) are alive. Initially, I planned to do this using regular heartbeats. However, I realized that the mqtt protocol has built in support for exactly this behavior, namely topic wills and on_disconnect callbacks.
      • topic wills: any client that connects to a broker can define a ‘will’ message to be sent on any number of topics. If the node disconnects from the broker without properly sending a DISCONNECT packet, the broker will assume it died and send its will message to all nodes subscribed to that topic. I used this to implement heartbeats. All nodes listen to the “heartbeats” topic, and if a node dies, they will all be notified of its death and update their local database accordingly. If the master dies, this notification is done using the on_disconnect callback.
      • on_disconnect: if the underlying socket connection between the paho mqtt library (the library I’m using for client-side mqtt behavior between the nodes and the broker) and the broker is broken, it will invoke an optional on_disconnect callback. This callback will be invoked any time the node client disconnects from the broker. However since the nodes should never intentionally disconnect from the broker, this will only happen if the broker has died. This way, nodes are notified of a master’s death, and can begin the failover process.
    • By using topic wills and on_disconnect, I save needing to send frequent publishes from each node, which would cost unnecessary bandwidth. If a node receives notice that the master has gone down, it will select a new master. The next master is the node with the lowest serial number of the nodes that are currently alive. If that happens to be the current node, it will start the master process, otherwise, it will try to connect to the broker on the new master node.Currently, all of the above works except for the final reconnect step. For some reason, the client mqtt library is having trouble throwing away the previous connection and connecting to the new broker. As such the failover happens, but nodes can’t communicate after :(. I will fix this before the “public demo”.
    • populate db: while hardcoding is generally frowned upon, since our system has a “stable state” for demos, I needed an easy way to get to that steady state. I made a script that populates the database with the hardcoded node values as they exist in the defined steady state, so that it’s very easy to reset a database’s state.
    • Path expansion for integration: this was a small bug that I found interesting enough to include here. On the node, the top level directory contains three repositories:hardware-emulator, interaction-layer, and ecp-webappThe helper functions I wrote for Richard’s frontend exist ininteraction-layer/help_lib.pyand those functions are used inecp-webapp/flask-backend/app.pyMore importantly, those helper functions use relative paths to access the config and database files, which creates problems when the webapp tries to call them. I had to change this behavior so that the helper lib expands relative paths to absolute paths, allowing them to be called from anywhere in the systm.
    • interactions: interaction definitions were ironed out to work with the way the hardware emulator expects values to be, and how the frontend expects them to be. Since my layer is in the middle, I had to be careful about parsing, storing, and acting upon interactions.
  • Problems to be fixed this upcoming week:
    • broker failover: as discussed above in the master failover section, after a master failover, nodes fail to reconnect to the new broker. This needs to be fixed.
    • conflicting interactions: the frontend allows you to define conflicting interactions, which would arbitrarily thrash the system. For example, I could define the following 2 interactions:

      if motion sensor > 5, turn light on
      if motion sensor > 5, turn light off

      Now, if the motion sensor is triggered, the light will begin rapidly flickering, which is annoying and probably unintended. I think it would be cool if the frontend can identify such conflicting interactions, but it may end up being far more complicated than I suspect, and might be too hard to identify loops.

    • setting sensors on frontend: the frontend currently allows you to set a value for any device, such as setting “light = 1” (turn on the light). However, if you try to set a value for a sensor node, the backend throws an exception, crashing the interaction layer. This behavior needs to be prohibited.
    • measure latency: in our design docs, we defined latency requirements. While the logging facilities are in place to measure this latency, we need to actually do the measurements for the final report.

Niko’s Status Report for 4-12

This past week I worked more on integration between the webapp and interaction layer. Richard and I had a discussion, and realized we had different ideas for how the database would store the information we’d agreed upon. We had a long discussion and settled on a modified subset of the database schema. I worked on modifying the schema to fit our discussion

I also worked on some helper functions to facilitate interaction with both the database and the mqtt broker. Besides making my own code cleaner, it begins to provide an interface for Richard’s webapp’s flask backend to obtain node data and present it to the frontend for the user.

For this upcoming week, I want to finish the api that I worked on this past week and finish integrating it with Richard’s layer. At the end of the week, the frontend should be presenting no hardcoded data, but should be pulling information directly from the nodes.

I also would like to reopen the API discussion with Rip, and outline exactly how his layer and mine will interact.

Team Status Report for 4-5

Our team updates for the past week are largely in two parts. You can see more details on each individual person’s updates in the individual status reports; we will largely discuss integration efforts and issues here

  • Webapp and interaction layer integration:
    • This week we began integration between these two layers. Currently, the interaction layer can start and run the webapp, and make sure it comes back up even when it dies.
    • Setup scripts have been written for both layers (individually and together), facilitating future development.
  • The virtual environment problem:
    • When Richard developed the webapp, he used virtualenv to create the python virtual environment. However, Niko used venv to create his. When creating the setup scripts, there was an issue with using virtualenv that caused the environment to not activate properly and the webapp to fail in initialization. However when manually testing, niko would use venv, and the webapp would work. After a lot of collaborative debugging, Niko and Richard figured out that for the Ubuntu vm’s, venv works better, and they updated the setup scripts accordingly.

Niko’s Status Report for 4-5

This week I did a lot of work on the interaction layer in preparation for the demo on Monday. Here are the following areas I worked on:

  • Setup / install scripts:
    • While user-friendly device commissioning is not in the scope of our project, the fact remains that we still have to “commission” devices during development. This involves creating an aws VM, cloning the appropriate repos, installing dependencies, setting up python virtual environments, initializing device configs, initializing the database, etc. Since this is not something anybody wants to do more than once or twice, I created setup scripts for both the frontend webapp and the interaction layer. I also made a top level script that will clone all the repos and run all their setup scripts. That way after starting a fresh vm, all you need to do is scp the top level script and run it, and then wait for it to finish.
  • Integration:
    • I spent a lot of time this week working to integrate the interaction and frontend webapp layers. Currently, the interaction layer is able to start and run the frontend, and I have written the setup scripts for both layers. For next week, I still need to tie the webapp’s backend into the interaction layer so that it no longer has hardcoded dummy data.
  • Master process:
    • I initially wrote the master process in python, since that is what the rest of the interaction layer is written in. However, I quickly realized that all I was doing was running shell commands from python, such as checking if a process was up and then starting it if not. It doesn’t really make sense to be running only bash commands from within Python, and Python wasn’t making my life easier. I decided it would be better to implement the master in bash as a script. This greatly simplified its logic, and made it a more elegant program. The master is in charge of starting and keeping alive the frontend webapp, node process, and mqtt broker. Once the interaction layer is integrated with the hardware layer, it will also be in charge of starting and keeping alive the hardware simulation webapp.
  • Node interactions:
    • I got the nodes to be able to subscribe and publish to each other, and react to data from other nodes. While the actual definition of an “interaction” needs to be ironed out a bit between me and Richard (front end webapp), the infrastructure is now in place.

Niko’s Status Report for 3-29

This past week I thought a lot about node setup and the relationships between all of the moving parts in the interaction layer.

I will begin by discussing node setup. While we don’t plan to incorporate new device commissioning into our project, we do need some way of bootstrapping a node, at least for our own development. From this perspective I wrote some bootstrapping scripts to obtain the appropriate code, set up config files, install necessary libraries, and set up the database.

I also spent some time designing the database and its schema, which can be found in more detail here (note that while unlikely, this link could change in the future. If so, see the readme here. Most importantly, I defined what the tables will look like, and what datatypes will exist in each table.

With regards to the master process, I wrote a preliminary python executable that checks in a loop if the broker / webapp are running, and if not, starts them. While i think the master may end up having to do a few more things, I think that for the most part this will be its sole purpose

As for the node process, I spent some time debating the merits of implementing it in C / C++ vs. in Python. This was a difficult decision because the node process is where the bulk of the actual interaction logic will exist. The main problem with doing Python is that python is not particularly good for parallel programming. While constructs for concurrent execution exist, (i.e. threads), each thread must acquire Python’s global lock, which serializes the execution. Processes could instead be used, but are a much more heavyweight alternative for a problem that only needs a smaller snippet of code.

Since most of the node’s communications are going to be done through the MQTT broker over the network (an inherently async operation), bottlenecking the system by serializing execution seems at first glance to be a mistake, and points towards using C as a better solution. That being said, I believe that we can get around this problem and still use Python. As long as we keep each thread very lightweight, and limit the amount of time each thread can run (limit blocking operations), then it should be no problem if execution is effectively serialized. I think that this allows us to take advantage of Pythons very powerful facilities, and eliminate the complexity of C.

One thing that was brought up in our SOW review was that we should redefine our latency requirements. As we discussed, it won’t really mean anything if we define latency requirements between our nodes, as they are no longer on the same network and as such have potentially unpredictable (and out of our control delays). However, we do have some level of control over the on device latency. While it’s true that virtualized hardware such as aws ec2 instances don’t guarantee consecutive execution (the hypervisor can interleave execution of different machines), we believe this will be less noticeable than network latency.

After thinking about it a bit, I decided the simplest way to measure this on device latency is as follows: when a relevant piece of data is received from the broker (i.e. something that should trigger an interaction), the node will write a timestamp to a log file. When that piece of data is finally acted upon, the node will write a second timestamp to the log file. By looking at the difference in timestamps, we can measure the on device latency.

Another thing that I worked on was defining more formally how all the moving parts in the interaction layer interact. See the below diagram for more information:

Moving forward, I have a few goals for the upcoming week.

  • Latency: I would like to define on-device latency that is reasonable and in-line with the initial (and modified) goals of this project.
  • APIs:
    • Work with Rip to define how my layer will interact with his hardware. Currently we are planning on having Rip implement a “hardware” library with a simple API that I can call to interact with the “hardware”. This would include functions such as “read_sensor_data” and “turn_on”, etc. I would like to iron out this interface with him by next weekend.
    • Work with Richard to interface with the webapp. As I currently understand it, the webapp will need to publish config changes, read existing config data, and request sensor data from other nodes. While I plan for all this functionality to be facilitated by the broker, I would like to implement a simple library of helper functions to mask this fact from the webapp. Ideally, Richard will be able to call functions such as get_data(node_serial, time_start, time_end) or publish_config_change(config). I would also like to iron out this API with him by this weekend, even if the library itself isn’t finished.
  • Node process: I would like a simple version of the node process done by this weekend. I think this subset of the total functionality is sufficient for the midsemester demo. It should function with hardcoded values in the database and mock versions of the hardware library / webapp to work against. This process should be able to:
    • Read config / interaction data from the database
    • Publish data to the broker
    • Receive data from the broker and act upon it (do interactions)

Team Status Report for 3-21

This past week our team worked on adapting to the changing situation and the shift to virtual classes and projects. From a team perspective, we worked to establish regular channels of communication, and scheduled recurring zoom meetings for Monday and Wednesday 11:00-12:30 and Thursday 1:30-2:30. In addition, we began the discussion of how we can communicate problems, progress, and questions outside of those meetings

Outside of the team dynamic, we also worked on our statement of work. This document details how our project needs to be changed so that it can be done entirely virtually. Due to the rapidly shifting situation and the sporadic shipping of parts, we have decided to (almost) entirely remove hardware from the project. The only exception is that Rip will obtain a few small IoT devices that support developer APIs and use them to generate sensor data sets. From that point forward, we will simulate the entire system (sensors, hardware, network, etc.) in the cloud, and serve the sensor data that rip recorded.

Besides the hardware changes, the webapp and device interactions remain largely the same. The main difference with those is that they will now be hosted on a cloud server, as opposed to on a local IoT device.

Team Status Report for 3-14

This past week, all of our team was together for spring break. We worked this week into our Gantt chart knowing that we would not work on the project. As such, we did not do any work on it.

Niko’s Status Report for 3-21

This past week, I tried to adjust to the shift to online classes. I got back to Pittsburgh on Sunday the 15th, but spent a majority of the week packing up all my belongings. I plan to drive back home tomorrow (Sunday the 22nd) so that I can be with my family during the pandemic.

That being said, I worked with my team to create a modified statement of work (see our content page). This document serves to show how we have changed our scope and requirements to better fit the necessary shift to an almost entirely software project. Based off of our discussions, it seems that both my interaction layer and Richard’s webapp will remain largely unchanged, while Rip’s hardware will need some level of change. Prior to this change, my interaction layer was to interact with the hardware through an API that allows me to read sensor data, and write commands to control the devices. Since we no longer have hardware, Rip will recreate a similar API so that my code can continue to function as before, but he will implement a software library behind the API to emulate the hardware.

Niko’s Status Report for 3-14

From 3-4 through now, I have been traveling, and my laptop has been in the apple store. Unfortunately all apple stores nationwide have been shut down, and my laptop is currently still in the shop. I am looking into how I can get a replacement device.

Update: As of Sunday 3-15, the specific store that has my laptop will reopen for 24 hours for people to get their repaired devices. I now have my laptop and can continue working.

Niko’s Status Report for 3-7

From 2-29 to 3-2, I continued developing the prototype as discussed in my previous post, and did research based off of our design review feedback. I primarily looked into the nano board as an alternative to the raspberry pi. While it seems like it could be a viable alternative, we already have raspberri pi’s on hand, and with the unpredictable shipping out of China I believe it’s the best option to just move forward with the pi’s.

On 3-2 I had to send my laptop to the apple store for repairs, and so did not do any coding from 3-2 to 3-7.