highbrow | Highbrow Annotation Browser | Data Labeling library
kandi X-RAY | highbrow Summary
kandi X-RAY | highbrow Summary
Highbrow Annotation Browser
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of highbrow
highbrow Key Features
highbrow Examples and Code Snippets
Community Discussions
Trending Discussions on highbrow
QUESTION
I'm trying to build a naive bayes based classifier for 1000 positive+negative labled IMDB reviews (txt_sentoken) and weka API for Java.
As I wasn't aware of StringToWordVector
, which basically provides a BagOfWords model that reaches an 80% accuracy, so I did the vocabulary building and vector creation myself, with an accuracy of only 75% :(
Now I'm wondering why my solution is performing so much worse.
1) From my 2000 reviews, I build the BagOfWords:
...ANSWER
Answered 2017-Dec-28 at 07:18Reading through Weka's StringToWordVector
documentation, there seem to be a couple of implementation details different than yours. Here are the top two, based on how likely they are to be the reason for the performance difference you see, in my opinion:
- It seems that by default, the resulting vector is boolean (i.e. noting the existence of a word, rather than number of occurrences)
- If the class attribute is set before vectorizing the text, a separate dictionary is built for each class, then all dictionaries are merged.
While any of them (or other, more minor differences) could be the culprit, my bet is on the second point.
The built-in class allows setting and unsetting each of these options; you could try re-running the 80% version using StringToWordVector
with the -C option to use number of occurences rather then a boolean value, and with -O, to use a single dictionary across both classes.
This should allow you to verify whether any of these is indeed the culprit.
EDIT: Regarding the first point, i.e. counting occurences vs. noting word existence (also called Bernoulli and multinomial models), there were several academic papers at the 90s which looked into the differences, e.g. here and here. While usually the multinomial model works better, there are also opposite cases, depending on corpus and classification problem.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install highbrow
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page