I think I'm so obsessed with open data and open data processing. I think each website, which is driven by user's content (Like wikipedia, twitter, tumblr, facebook) needs to provide API or data dump for community to use these information. It's a fair trade right, people provides information and then you give it back to community.
I just realized stackoverflow also provides data dumps, here you can reach it. It's just awesome, you can use this data for a lot of different training tasks.
I used this dump to train a model for tag classification and by using Fasttext and also I tried Fasttext's compressing feature and compare the performance. Fasttext is my popular framework, in short Fasttext is a library for efficient learning of word representations and sentence classification.
Ok, here is the scenario for the training:
- Download dump, I used math.stackexchange.com only
- Parse file
- Create training and test data for different categories: I created 3 different training and test sets, for titles only, for body only, title + body, titles + body + comments
Please refer to here for data structure.
Here are the results for these different configurations:
- I: Title only
- II: Body only
- III: Title + Body
- IV: Title + Body + Comments
- dim 10
- lr 0.1
- wordNgrams 2
- minCount 1
- bucket 10000000
- epoch 5
- thread 4
|# of words||407,596||3,963,237||4,105,156||5,431,988|
|# of input||707,387|
|# of examples||85,856|
|# of labels||1,501|
Playing with parameters
Just changed epoch to 100 for the data set III and here is the results:
|Type/Training||P@1||R@1||P@1 comp.||R@1 comp.|
|Type/Training||P@2||R@2||P@2 comp.||R@2 comp.|
1- It is surprising for me that III is better precision and recall than IV since IV has more words
2- P@1 is better than P@2 where R@2 is better than R@1
3- Setting epoch to 100 has great impact on learning, it gives almost 4% improvement on precision and 2% on recall.
4- Quantize functions works amazing, model size decreased from 406M to 1.8M at worst case.
5- Surprisingly, quantization increased precision and recall for the model with epoch 5 and decreased for the model epoch 100.
6- Precision and recall is decreased for the compressed models but increased for non-compressed models.
7- Recall has better performance at 2 and I don't have any idea why.
8- We better make a grid search for tuning the ML parameters.
I'm planning to repeat this work with using stackexchange.com data.