Team Status Report for 4/26

This past week we finished up testing.

This week we will be working on the poster, report, and demo.

We performed unit tests on the 4 different models we had selected: llama3-8b, falcon-7b, qwen2.5-7b, and vicuna-7b. Testing and results can be found here: https://github.com/jankrom/Voice-Vault/tree/main/server/model-testing. This involved setting up the tests for each model, making a python script, and saving the results in a png for each model. I found that llama3 had the best accuracy at 100%, while qwen and vicuna both did around 90%. Falcon actually had a score of 0% accuracy which was very surprising. I looked into it more and it could be because it is optimized for code, as I saw a lot of the responses the model was giving was in javascript and such, despite the model saying it is optimized for conversations. These results cause me to remove falcon from our options and now we will only offer the other 3 models to pick from. This resulted in me having to modify our website to only include the other 3.

We performed unit tests on the system prompts to test out multiple different system prompt and find the best one. The best one we found gave us 100% accuracy in selecting if it is an alarm request, music request, or LLM request.

We performed many e2e tests just by interacting with the system and we did not find any errors when doing so.

Leave a Reply

Your email address will not be published. Required fields are marked *