This week, my goal was to implement the model inference system on the FPGA.
I ended up running into a large number of issues, and was forced to switch models in order to fix them. For now, the model switch is temporary. I simply switched because the dynamic memory was over the board’s resources, theoretically this should not be happening and it should definitely be only because there is a bug in the inference systems.
We shifted to afrideva/llama-160m-GGUF llama-160m.q2_k.gguf along with the q2_k quantization system.
For now, we are well ahead of schedule given that this model produces decent output quality(37% hallucination rate). For now, changing from this model back to the original one is a much lower priority.
My goals for next week are to increase the performance(it is currently operating at a reading speed level(8-10 tokens/sec) all the way to reading level. It currently is a bit slower than the reading speed given that I notice a slight lag.