ideas

Trying fasttext classifier models with different corpus

After making some experiments with using stackoverflow data, I wonder how these models work with different corpus. Is it a good idea to predict tags from body with a model which is trained by titles? I used models from this post and I used a simple methodology for this experiment:

1- Used already trained model
2- Apply model for test data except itself, for example title model file, apply this model to body, title + body, title + body + comment

Results

  • I: Title only
  • II: Body only
  • III: Title + Body
  • IV: Title + Body + Comments

Tests

Precision:

P@1 Model I Model II Model III Model IV
Test Data I - 0.491 0.405 0.425 0.415
Test Data II - 0.478 0.399 0.484 0.487
Test Data III - 0.507 0.436 0.518 0.522
Test Data IV - 0.506 0.421 0.512 0.515

Recall:

R@1 Model I Model II Model III Model IV
Test Data I - 0.212 0.175 0.184 0.179
Test Data II - 0.207 0.173 0.209 0.21
Test Data III - 0.219 0.189 0.224 0.226
Test Data IV - 0.218 0.182 0.222 0.222

Conclusion

1- Title only model (I) is trained by less content than others so it doesn't work with test data which has more content, it gives 7% less precision accuracy and 3% less recall accuracy. Model I is trained by 407,596 words which is almost 6 times less than others at best case.
2- And also model II, III and IV gives worse precision and recall accuracy for the test model I for the same reason.
3- 1 and 2 shows, content size of the training is so important for the classifier accuracy. It's not a good idea to train a model with larger content and predict data with less content, vice versa.
4- Model II is trained by 3,963,237 words, Model III trained by 4,105,156 words and Model IV is trained by 5,431,988 words. And Model II gives better results for Test Data III and IV. This is kind a surprising for me.
5- Model III gives better results for Model II and IV.
6- Model IV gives better results for Model II and III.
7- Model III and IV give better results for Model II and I think it's because body dominates the data and comments give good information about tags also.
8- I wonder ws (content size) effect on this data, like building Model IV again with ws = 1 and test this model with Test Data I.


Soner ALTIN