Category: Niko’s Status Report

Niko’s Status Report for 4-19

This past week, I put a lot of work into wrapping up the functionality of the interaction layer and integrating all three parts together.

The way we broke the project up, node infrastructure (aws hosting) fell under the interaction layer umbrella, since the ‘node’ process is the top level process running on the node and managing the other processes. As such, it fell onto me to take care of running the frontend webapp and making sure that the data read / write pipeline worked from the frontend all the way through the interaction layer and down to the hardware emulator for each node.

Below, I’ve broken up my work this past week into different categories, and I elaborate on what I worked on in each category.

  •  Infrastructure
    • setup script: I changed the setup scripts for setting up a new node to create a shared Python virtual environment for the project, rather than a different one for each of the three parts. This made it much easier when manipulating different parts of the system, since I no longer had to worry about which virtual environment was being used.
    • AWS
      • Set up all the nodes: I created 5 unique nodes, each with a different config and hardware description. Since initial device commissioning is out of the scope of the project, when the system “starts” for any demo, it has to already be initialized to a steady state. This means I had to hardcode each device to work properly
      • Elastic IPs: I figured out how to create and assign a static IP address (called an elastic IP by AWS) to each node, so that I could easily hard code the IP address within the code instead of dynamically gathering it each time Amazon decides to change it.
      • Static IP for webapp: I started looking into defining an extra elastic IP address that is reserved for the webapp instead of for a specific node. The way this would work is that all nodes have a unique static IP address, but the master node has an additional public IP address pointing to it. If and when the master dies and another node promotes to master, the new master will dynamically claim that second static IP from AWS. The result of this would be that the same IP address would always point to the webapp, even if the node hosting the webapp changes. I hit a few issues, such as restarting the network stack on an Ubuntu 18.04 virtual machine, and couldn’t get this working by the final demo.
  • Interaction layer
    • CONFIG and hardware_description parser: I made the definition for the CONFIG file clearer and more rigid, and added functionality to easily parse it and get a dictionary. I also created a helper function to parse a “hardware_description.json” file, which describes the hardware located on the node. This was something required by Rip’s hardware emulator that I hadn’t expected, and had to be done during integration.
    • Database schema updates: As discussed in last week’s update, Richard and I discussed certain updates that had to be made to the database schema. I updated the database (and database management helper functions) as per our discussion.
    • help_lib for Richard’s frontend webapp: I added the following functions to a file called help_lib.py, with the intent that these would facilitate communication between the webapp’s backend and the interaction layer’s distributed data storage.
      • addInteraction: takes a new interaction and sends it to all nodes in the system to write to their database
      • deleteInteraction: same as addInteraction; deletes an interaction on all nodes in the system
      • updateInteraction: same as above, but updates an interaction in place instead of deleting it. The way the frontend is currently set up, their is no way for a user to modify an existing interaction, so this function is unused.
      • updateNode: update the display name and description of a node. This again makes sure to sync across all nodes
      • getNodeStatus: takes a list of node serial numbers, and for each node, gets the current value of that node, from the node, through the mqtt broker. If the node is dead, will short circuit and return that the node is dead, instead of trying to ping it to ask for a status update
      • setStatus: set the value for a particular node. E.g. turn a light on, set off an alarm, etc. This will communicate to that node through the broker.
    • db_lib for sql interactions (read / write etc): added a lot of functionality to the db_lib file for managing the database, so that code outside the db_lib file has to do minimal work to read from / write to the database, and in particular, has to do nothing relating to sql.
    • Proper logging setup: updated the logging in the system to use the python logging module and create logs useful for debugging and system monitoring.
    • MqttSocket: created a new class called MqttSocket (i.e. socket type functionality over mqtt). Currently only used by getNodeStatus, this class is meant to describe and facilitate a handshake type interaction between two nodes. We decided to do all node communication through the broker instead of directly from node to node in order to facilitate interactions. However, sometimes one node has to specifically request a piece of information from another node, which follows a request / response pattern. The async and inherently separated publish / subscribe nature of MQTT makes it fairly convoluted to follow this request / response pattern, so I packaged the convoluted logic into a neat helper class that makes it very easy to do the request / response cycle. Here is an example of how it’s used:sock = MqttSocket()# topic to listen for a response
      sock.setListen(‘node/response’)

      # blocking function, sends ‘data’ to ‘node/request’ and returns the
      # response sent on ‘node/response’
      sock.getResponse(‘node/request’, data)

      sock.cleanup()

    • Master failover: master failover hinges on the fact that nodes know whether other nodes in the system (in particular the current master) are alive. Initially, I planned to do this using regular heartbeats. However, I realized that the mqtt protocol has built in support for exactly this behavior, namely topic wills and on_disconnect callbacks.
      • topic wills: any client that connects to a broker can define a ‘will’ message to be sent on any number of topics. If the node disconnects from the broker without properly sending a DISCONNECT packet, the broker will assume it died and send its will message to all nodes subscribed to that topic. I used this to implement heartbeats. All nodes listen to the “heartbeats” topic, and if a node dies, they will all be notified of its death and update their local database accordingly. If the master dies, this notification is done using the on_disconnect callback.
      • on_disconnect: if the underlying socket connection between the paho mqtt library (the library I’m using for client-side mqtt behavior between the nodes and the broker) and the broker is broken, it will invoke an optional on_disconnect callback. This callback will be invoked any time the node client disconnects from the broker. However since the nodes should never intentionally disconnect from the broker, this will only happen if the broker has died. This way, nodes are notified of a master’s death, and can begin the failover process.
    • By using topic wills and on_disconnect, I save needing to send frequent publishes from each node, which would cost unnecessary bandwidth. If a node receives notice that the master has gone down, it will select a new master. The next master is the node with the lowest serial number of the nodes that are currently alive. If that happens to be the current node, it will start the master process, otherwise, it will try to connect to the broker on the new master node.Currently, all of the above works except for the final reconnect step. For some reason, the client mqtt library is having trouble throwing away the previous connection and connecting to the new broker. As such the failover happens, but nodes can’t communicate after :(. I will fix this before the “public demo”.
    • populate db: while hardcoding is generally frowned upon, since our system has a “stable state” for demos, I needed an easy way to get to that steady state. I made a script that populates the database with the hardcoded node values as they exist in the defined steady state, so that it’s very easy to reset a database’s state.
    • Path expansion for integration: this was a small bug that I found interesting enough to include here. On the node, the top level directory contains three repositories:hardware-emulator, interaction-layer, and ecp-webappThe helper functions I wrote for Richard’s frontend exist ininteraction-layer/help_lib.pyand those functions are used inecp-webapp/flask-backend/app.pyMore importantly, those helper functions use relative paths to access the config and database files, which creates problems when the webapp tries to call them. I had to change this behavior so that the helper lib expands relative paths to absolute paths, allowing them to be called from anywhere in the systm.
    • interactions: interaction definitions were ironed out to work with the way the hardware emulator expects values to be, and how the frontend expects them to be. Since my layer is in the middle, I had to be careful about parsing, storing, and acting upon interactions.
  • Problems to be fixed this upcoming week:
    • broker failover: as discussed above in the master failover section, after a master failover, nodes fail to reconnect to the new broker. This needs to be fixed.
    • conflicting interactions: the frontend allows you to define conflicting interactions, which would arbitrarily thrash the system. For example, I could define the following 2 interactions:

      if motion sensor > 5, turn light on
      if motion sensor > 5, turn light off

      Now, if the motion sensor is triggered, the light will begin rapidly flickering, which is annoying and probably unintended. I think it would be cool if the frontend can identify such conflicting interactions, but it may end up being far more complicated than I suspect, and might be too hard to identify loops.

    • setting sensors on frontend: the frontend currently allows you to set a value for any device, such as setting “light = 1” (turn on the light). However, if you try to set a value for a sensor node, the backend throws an exception, crashing the interaction layer. This behavior needs to be prohibited.
    • measure latency: in our design docs, we defined latency requirements. While the logging facilities are in place to measure this latency, we need to actually do the measurements for the final report.

Niko’s Status Report for 4-12

This past week I worked more on integration between the webapp and interaction layer. Richard and I had a discussion, and realized we had different ideas for how the database would store the information we’d agreed upon. We had a long discussion and settled on a modified subset of the database schema. I worked on modifying the schema to fit our discussion

I also worked on some helper functions to facilitate interaction with both the database and the mqtt broker. Besides making my own code cleaner, it begins to provide an interface for Richard’s webapp’s flask backend to obtain node data and present it to the frontend for the user.

For this upcoming week, I want to finish the api that I worked on this past week and finish integrating it with Richard’s layer. At the end of the week, the frontend should be presenting no hardcoded data, but should be pulling information directly from the nodes.

I also would like to reopen the API discussion with Rip, and outline exactly how his layer and mine will interact.

Niko’s Status Report for 4-5

This week I did a lot of work on the interaction layer in preparation for the demo on Monday. Here are the following areas I worked on:

  • Setup / install scripts:
    • While user-friendly device commissioning is not in the scope of our project, the fact remains that we still have to “commission” devices during development. This involves creating an aws VM, cloning the appropriate repos, installing dependencies, setting up python virtual environments, initializing device configs, initializing the database, etc. Since this is not something anybody wants to do more than once or twice, I created setup scripts for both the frontend webapp and the interaction layer. I also made a top level script that will clone all the repos and run all their setup scripts. That way after starting a fresh vm, all you need to do is scp the top level script and run it, and then wait for it to finish.
  • Integration:
    • I spent a lot of time this week working to integrate the interaction and frontend webapp layers. Currently, the interaction layer is able to start and run the frontend, and I have written the setup scripts for both layers. For next week, I still need to tie the webapp’s backend into the interaction layer so that it no longer has hardcoded dummy data.
  • Master process:
    • I initially wrote the master process in python, since that is what the rest of the interaction layer is written in. However, I quickly realized that all I was doing was running shell commands from python, such as checking if a process was up and then starting it if not. It doesn’t really make sense to be running only bash commands from within Python, and Python wasn’t making my life easier. I decided it would be better to implement the master in bash as a script. This greatly simplified its logic, and made it a more elegant program. The master is in charge of starting and keeping alive the frontend webapp, node process, and mqtt broker. Once the interaction layer is integrated with the hardware layer, it will also be in charge of starting and keeping alive the hardware simulation webapp.
  • Node interactions:
    • I got the nodes to be able to subscribe and publish to each other, and react to data from other nodes. While the actual definition of an “interaction” needs to be ironed out a bit between me and Richard (front end webapp), the infrastructure is now in place.

Niko’s Status Report for 3-29

This past week I thought a lot about node setup and the relationships between all of the moving parts in the interaction layer.

I will begin by discussing node setup. While we don’t plan to incorporate new device commissioning into our project, we do need some way of bootstrapping a node, at least for our own development. From this perspective I wrote some bootstrapping scripts to obtain the appropriate code, set up config files, install necessary libraries, and set up the database.

I also spent some time designing the database and its schema, which can be found in more detail here (note that while unlikely, this link could change in the future. If so, see the readme here. Most importantly, I defined what the tables will look like, and what datatypes will exist in each table.

With regards to the master process, I wrote a preliminary python executable that checks in a loop if the broker / webapp are running, and if not, starts them. While i think the master may end up having to do a few more things, I think that for the most part this will be its sole purpose

As for the node process, I spent some time debating the merits of implementing it in C / C++ vs. in Python. This was a difficult decision because the node process is where the bulk of the actual interaction logic will exist. The main problem with doing Python is that python is not particularly good for parallel programming. While constructs for concurrent execution exist, (i.e. threads), each thread must acquire Python’s global lock, which serializes the execution. Processes could instead be used, but are a much more heavyweight alternative for a problem that only needs a smaller snippet of code.

Since most of the node’s communications are going to be done through the MQTT broker over the network (an inherently async operation), bottlenecking the system by serializing execution seems at first glance to be a mistake, and points towards using C as a better solution. That being said, I believe that we can get around this problem and still use Python. As long as we keep each thread very lightweight, and limit the amount of time each thread can run (limit blocking operations), then it should be no problem if execution is effectively serialized. I think that this allows us to take advantage of Pythons very powerful facilities, and eliminate the complexity of C.

One thing that was brought up in our SOW review was that we should redefine our latency requirements. As we discussed, it won’t really mean anything if we define latency requirements between our nodes, as they are no longer on the same network and as such have potentially unpredictable (and out of our control delays). However, we do have some level of control over the on device latency. While it’s true that virtualized hardware such as aws ec2 instances don’t guarantee consecutive execution (the hypervisor can interleave execution of different machines), we believe this will be less noticeable than network latency.

After thinking about it a bit, I decided the simplest way to measure this on device latency is as follows: when a relevant piece of data is received from the broker (i.e. something that should trigger an interaction), the node will write a timestamp to a log file. When that piece of data is finally acted upon, the node will write a second timestamp to the log file. By looking at the difference in timestamps, we can measure the on device latency.

Another thing that I worked on was defining more formally how all the moving parts in the interaction layer interact. See the below diagram for more information:

Moving forward, I have a few goals for the upcoming week.

  • Latency: I would like to define on-device latency that is reasonable and in-line with the initial (and modified) goals of this project.
  • APIs:
    • Work with Rip to define how my layer will interact with his hardware. Currently we are planning on having Rip implement a “hardware” library with a simple API that I can call to interact with the “hardware”. This would include functions such as “read_sensor_data” and “turn_on”, etc. I would like to iron out this interface with him by next weekend.
    • Work with Richard to interface with the webapp. As I currently understand it, the webapp will need to publish config changes, read existing config data, and request sensor data from other nodes. While I plan for all this functionality to be facilitated by the broker, I would like to implement a simple library of helper functions to mask this fact from the webapp. Ideally, Richard will be able to call functions such as get_data(node_serial, time_start, time_end) or publish_config_change(config). I would also like to iron out this API with him by this weekend, even if the library itself isn’t finished.
  • Node process: I would like a simple version of the node process done by this weekend. I think this subset of the total functionality is sufficient for the midsemester demo. It should function with hardcoded values in the database and mock versions of the hardware library / webapp to work against. This process should be able to:
    • Read config / interaction data from the database
    • Publish data to the broker
    • Receive data from the broker and act upon it (do interactions)

Niko’s Status Report for 3-21

This past week, I tried to adjust to the shift to online classes. I got back to Pittsburgh on Sunday the 15th, but spent a majority of the week packing up all my belongings. I plan to drive back home tomorrow (Sunday the 22nd) so that I can be with my family during the pandemic.

That being said, I worked with my team to create a modified statement of work (see our content page). This document serves to show how we have changed our scope and requirements to better fit the necessary shift to an almost entirely software project. Based off of our discussions, it seems that both my interaction layer and Richard’s webapp will remain largely unchanged, while Rip’s hardware will need some level of change. Prior to this change, my interaction layer was to interact with the hardware through an API that allows me to read sensor data, and write commands to control the devices. Since we no longer have hardware, Rip will recreate a similar API so that my code can continue to function as before, but he will implement a software library behind the API to emulate the hardware.

Niko’s Status Report for 3-14

From 3-4 through now, I have been traveling, and my laptop has been in the apple store. Unfortunately all apple stores nationwide have been shut down, and my laptop is currently still in the shop. I am looking into how I can get a replacement device.

Update: As of Sunday 3-15, the specific store that has my laptop will reopen for 24 hours for people to get their repaired devices. I now have my laptop and can continue working.

Niko’s Status Report for 3-7

From 2-29 to 3-2, I continued developing the prototype as discussed in my previous post, and did research based off of our design review feedback. I primarily looked into the nano board as an alternative to the raspberry pi. While it seems like it could be a viable alternative, we already have raspberri pi’s on hand, and with the unpredictable shipping out of China I believe it’s the best option to just move forward with the pi’s.

On 3-2 I had to send my laptop to the apple store for repairs, and so did not do any coding from 3-2 to 3-7.

Niko’s Status Report for 2-29

This past week, I decided to use sqlite as the database for storing the local data on each device. We are aiming to make our code footprint as light as possible while still keeping the system fast, and sqlite helps to accomplish both those goals.

I also set up 2 raspberry pi 4’s that I have lying around at home. I had never set one up before, so it took a lot longer than I had initially expected. After getting all the parts I needed, I loaded up the Raspbian OS. I then connected them to the local network and set up sqlite on each of them, and spent some time figuring out how to read from and write to the database. After that, I figured out how to designate one of the pis as an MQTT broker, and then set up processes on both pis and on my laptop to act as clients. I was able to get all of the processes to subscribe to channels and receive published messages.

In terms of our schedule, I am a bit behind where I had intended to be. Most of this week was spent on design, and so I’m not as far into prototyping as I would like. I also had a couple projects and an exam this week for my other classes, which took time away from capstone.

In order to get caught up, I plan to spend most of Tuesday and Wednesday working on prototyping, since all my other work is due by Tuesday morning, and I leave for spring break Wednesday night.

The main thing I want to accomplish in the next few days is to finish thinking through our design. Our design review brought up a couple potential issues with our proposed solution that as a group we need to address. We still have time to do so over the weekend before our final design is due Monday night. I also would like to work on prototyping the system, and to have some barebones system functional.

Here are the problems in our design that we have found over the past week / were brought up in our design review:

  • Look into nano board – often used in IoT applications, potentially powerful enough for our use case, and far cheaper than a raspberry pi
  • Think more about security of the data, maybe encrypt data when on the the wire and in the db
  • We would like to achieve maximum 1 second delay between device interactions. However, our master failover rate is 3 seconds – so in the case of a master node going down, that 1 second requirement is violated.
  • Consider other master selection techniques than “lowest serial number”
  • Maybe host broker on different device than webapp to take load off master
  • Need more granular failure testing than just “chaos testing”. Things can go wrong in interactions between devices too, not just on device uptime.
  • Hardest part about distributed systems is resiliency – should really think more about it
  • After a device fails and a new master is
    selected, need to think more about how devices will be forwarded to the new webapp host, since it will be at a different IP address.
  • How will devices outside of the network access the node hosting the webapp, since it’s in a private network? Look into static IP addresses
  • Low energy pub / sub: ESP8266 + ESP32

Niko’s Status Report for 2-22

As a note to the reader, this past week we decided as a team to change our proposed devices from a combined sensing / smart device into 2 separate devices. One device will do only sensing, and the other will do only “actions”, such as turning a device on or off.

Another thing to note is that we decided to host the webapp locally. This means that one node in the network will be a “master”, and run the webapp. If that device goes down, the network should elect a new master, and then that device should spin up the webapp. As such all devices should have the capability to run the webapp

This past week I did research into how the interaction layer will be designed. In particular, I put a lot of thought into what data our system will contain, and how we might go about storing it in the system. I came to the following conclusions.

The data the system will need:

  • Per-node sensor data (for sensing nodes)
  • Identifying device information (such as a device id)
  • Identifying smart home network information (for commisioning new devices and for reconnencting when a device goes down)
  • Device list, and how to access each device (e.g. device IP addresses)
  • Code for the webapp
  • User’s settings
  • Defined interactions, and when each interaction was triggered
    • This list of past interactions can be viewed by the user
  • Last known status of a device. If a sensing device, last piece of data. If an actuation device, then whether device is on or off.

In thinking about this data, I came to the conclusion that we can minimize the complexity of our system if we minimize the data that needs to be shared. Here is the breakdown I arrived at:

On each node, modifiable data:

  • Sensor data

On each node, hardcoded:

  • Device ID
  • Code for webapp

Shared across all nodes:

  • Network id
  • Registered devices list
  • User settings
  • Defined interactions / past transactions
  • Last known device status

The biggest source of complexity here is the shared data. However, it can be further simplified because the majority of the “shared” data is information that will be needed if a device becomes the hoster of the webapp. The only data that needs to be written by all nodes is last known device status and past interactions.

My progress is on schedule.

For next week, I plan to research a few different database solutions using the above information as criteria for their effectiveness. For each, I will use our listed use cases and our metrics from the project proposal to evaluate their effectiveness. I hope to select a database technology, and then prototype how it fits together with the rest of the system.