If you check fasttext info page, you will see fasttext has a lot of different input parameters for training and also dictionary. If you ever tried to tune your model accuracy, you would see that changing these parameters changes model's precision and recall dramatically. So I decided to make a grid search to understand some parameter's effect on quality.
What's grid search?
According to wikipedia grid search is The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
And I also recommend you to watch this video.
I checked all the parameters and I decided to try following parameters for the experiment:
- wordNgrams: 1 2 3
- lr: 0.05 0.1 0.25
- dim: 100 300
- ws: 5 10 25
- epoch: 5 50 100
- loss: ns hs softmax
Which means I made 486 different training and testing. It took 2 days in total.
Yeah, it's not easy to analyze 486 results :) Let's go parameter by parameter.
By changing wordNgrams you set the max length of word ngram and it's default value 1. At this experiment we tried 1, 2 and 3 and it seems, there is a lot chance to get better results with wordNgrams 1 but on the other hand 2's median is higher than 1 and we get the best score with wordNgrams 3. For the next trainings, I will use wordNgrams 1.
- 1: Median 0.567 Max 0.595 Min 0.454
- 2: Median 0.577 Max 0.597 Min 0.377
- 3: Median 0.560 Max 0.603 Min 0.316
By changing lr you set the learning rate of the fasttext training with default value 0.05. We tried 0.05, 0.1 and 0.25 and 0.25 gives the best results.
By changing dim you set the size of word vectors with default value 100. We tried 100 and 300 and both give the same quality.
By changing ws you set the size of context window with default value 5. We tried 5, 10 and 25 and both give the same quality.
By changing ws you set the number of epoch of the training with default value 5. We tried 5, 50 and 100 and 50 gives the best results.
By changing ws you set the loss function of the training with default value ns. We tried ns, hs and softmax and softmax gives better results.
For understanding the parameters' effect on each other let's check the best results:
And we can clearly realize that wordNgrams is 3, lr 0.25 and loss is softmax on the top items. Where we already now we can get better results with softmax and lr 0.25 and wordNgrams 3 is very compatible with these values. Normally we should select wordNgrams 1 or 2 for better results.
Let's analyze worst results:
And we clearly see that if you use loss function ns and lr 0.05, don't use wordNgrams 1. We already know that ns and 0.05 is not a good choice for learning and also wordNgrams 2 and 3 makes things worse.
You can find all the outputs of the experiment here at plotly