ideas

Fasttext classifier for stackoverflow data

I think I'm so obsessed with open data and open data processing. I think each website, which is driven by user's content (Like wikipedia, twitter, tumblr, facebook) needs to provide API or data dump for community to use these information. It's a fair trade right, people provides information and then you give it back to community.

I just realized stackoverflow also provides data dumps, here you can reach it. It's just awesome, you can use this data for a lot of different training tasks.

I used this dump to train a model for tag classification and by using Fasttext and also I tried Fasttext's compressing feature and compare the performance. Fasttext is my popular framework, in short Fasttext is a library for efficient learning of word representations and sentence classification.

Ok, here is the scenario for the training:

  • Download dump, I used math.stackexchange.com only
  • Parse file
  • Create training and test data for different categories: I created 3 different training and test sets, for titles only, for body only, title + body, titles + body + comments
  • Train ./fasttext supervised
  • Quantize ./fasttext quantize
  • Test ./fasttext test

Data Structure

Please refer to here for data structure.

Results

Here are the results for these different configurations:

  • I: Title only
  • II: Body only
  • III: Title + Body
  • IV: Title + Body + Comments

Training characteristics

Training parameters:

  • dim 10
  • lr 0.1
  • wordNgrams 2
  • minCount 1
  • bucket 10000000
  • epoch 5
  • thread 4
Type/Training I II III IV
Total words 8M 73M 79M 121M
# of words 407,596 3,963,237 4,105,156 5,431,988
# of input 707,387
# of examples 85,856
# of labels 1,501

Quatization Performance

Type/Training I II III IV
Without compression 406M 632M 641M 730M
With compression 1.8M 1.8M 1.8M 1.7M

Tests

Precision:

P@1 I II III IV
Without compression 0.491 0.478 0.507 0.506
With compression 0.494 0.489 0.523 0.519
P@2 I II III IV
Without compression 0.384 0.369 0.393 0.39
With compression 0.387 0.379 0.404 0.399

Recall:

R@1 I II III IV
Without compression 0.212 0.207 0.219 0.218
With compression 0.214 0.211 0.226 0.224
R@2 I II III IV
Without compression 0.332 0.32 0.34 0.337
With compression 0.335 0.328 0.35 0.345

Playing with parameters

Just changed epoch to 100 for the data set III and here is the results:

Type/Training P@1 R@1 P@1 comp. R@1 comp.
Epoch 5 0.507 0.219 0.523 0.226
Epoch 100 0.541 0.234 0.47 0.203
Type/Training P@2 R@2 P@2 comp. R@2 comp.
Epoch 5 0.393 0.34 0.404 0.35
Epoch 100 0.424 0.367 0.369 0.319

Conclusion

1- It is surprising for me that III is better precision and recall than IV since IV has more words
2- P@1 is better than P@2 where R@2 is better than R@1
3- Setting epoch to 100 has great impact on learning, it gives almost 4% improvement on precision and 2% on recall.
4- Quantize functions works amazing, model size decreased from 406M to 1.8M at worst case.
5- Surprisingly, quantization increased precision and recall for the model with epoch 5 and decreased for the model epoch 100.
6- Precision and recall is decreased for the compressed models but increased for non-compressed models.
7- Recall has better performance at 2 and I don't have any idea why.
8- We better make a grid search for tuning the ML parameters.

Future

I'm planning to repeat this work with using stackexchange.com data.

Model files

Here you can download compressed version of the model files I, II, III and IV


Soner ALTIN