I started to work on the text-based emotional analysis. For the first iteration, I’m using Term Frequency-Inverse Document Frequency. This parameter gives the relative importance of a term in the data and is a measure of how frequently and rarely it appears in the text. It’s how we turn words into numbers. I used sklearn’sTfidfVectorizer to do this. I tried 3 models: a Naive Bayes’ Classified, Linear SVM and logistic regression. All 3 produced accuracies around 50% which is well below our goal. For the datasets, there were two main options: scraping online forums and social media sites or just using a precompiled dataset. I stuck with the latter. The first step was to preprocess any data, removing punctuation, upper casing and other insignificant formatting. Some research papers also suggested removing the rarest words from the datasets as they are essentially noise in the data.  

For the next step, I will explore using count vectors and using Stanford NLPs Global Vectors for Word Representation (GloVE) library. These methods have been shown to achieve 60-70% accuracy. I started off with simpler models as I wanted to experiment with how data sets affected accuracy. I will most probably combine different data sets going forward as this produces the best results. 

As a side note, I’ve put in orders for the Raspberry Pi and Camera.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *