499 views
0 0 votes
I am trying to create a sentiment analysis model and I have a question.

After I preprocessed my tweets and created my vocabulary I've noticed that I have words that appear less than 5 times in my dataset (Also there are many of them that appear 1 time). Many of them are real words and not gibberish. My thinking is that if I keep those words then they will get wrong "sentimental" weights and gonna make my model worse.
Is my thinking right or am I missing something?

My vocab size is around 40000 words and those that are "rare" are around 10k.Should I "sacrifice" them?

Thanks in advance.
0% Accept Rate Accepted 0 answers out of 1 questions

Please log in or register to answer this question.

Related questions

0 0 votes
0 0 answers
493
493 views
ntonis asked Jan 30, 2021
493 views
I am trying to create a sentiment analysis model using binary classification as loss.I have a batch of tweets that some of them are tagged as positive (labeled as 1) and ...
2 2 votes
1 1 answer
560
560 views
codemonkey asked Oct 16, 2018
560 views
If trying to read text and need to finalize texts as good, bad , ugly or any such buckets, where to start? What sentiment functions to use?
5 5 votes
1 answers 1 answer
9.1k
9.1k views
tofighi asked Jun 26, 2019
9,103 views
Assume we have a $5\times5$ px RGB image with 3 channels respectively for R, G, and B. IfR2000012001201021210101020G0212211100002202002002111B0100111201102021011012112 We...
0 0 votes
0 0 answers
529
529 views
HbibOs asked Jun 21, 2021
529 views
Hello,I trained a CNN using synthetic data to perform a segmentation task on human faces. During the test and to evaluate the prediction of this network, I used 200 examp...
1 1 vote
0 0 answers
834
834 views
saugata28 asked Jun 8, 2019
834 views
I am using Matlab R2018b and am trying to infuse SVM classifier within CNN. My plan is to use CNN only as a feature extractor and use SVM as the classifier. I know people...