ideas

Finding best fasttext hyperparameters

If you check fasttext info page, you will see fasttext has a lot of different input parameters for training and also dictionary. If you ever tried to tune your model accuracy, you would see that changing these parameters changes model's precision and recall dramatically. So I decided to make a grid search to understand some parameter's effect on quality.

What's grid search?

According to wikipedia grid search is The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set[2] or evaluation on a held-out validation set.

And I also recommend you to watch this video.

Parameter analysis

I checked all the parameters and I decided to try following parameters for the experiment:

  • wordNgrams: 1 2 3
  • lr: 0.05 0.1 0.25
  • dim: 100 300
  • ws: 5 10 25
  • epoch: 5 50 100
  • loss: ns hs softmax

Which means I made 486 different training and testing. It took 2 days in total.

Results

Yeah, it's not easy to analyze 486 results :) Let's go parameter by parameter.

wordNgrams

By changing wordNgrams you set the max length of word ngram and it's default value 1. At this experiment we tried 1, 2 and 3 and it seems, there is a lot chance to get better results with wordNgrams 1 but on the other hand 2's median is higher than 1 and we get the best score with wordNgrams 3. For the next trainings, I will use wordNgrams 1.

  • 1: Median 0.567 Max 0.595 Min 0.454
  • 2: Median 0.577 Max 0.597 Min 0.377
  • 3: Median 0.560 Max 0.603 Min 0.316 wordNgrams box plot

lr

By changing lr you set the learning rate of the fasttext training with default value 0.05. We tried 0.05, 0.1 and 0.25 and 0.25 gives the best results.

learning rate

dim

By changing dim you set the size of word vectors with default value 100. We tried 100 and 300 and both give the same quality.

dimension

ws

By changing ws you set the size of context window with default value 5. We tried 5, 10 and 25 and both give the same quality.

context window

epoch

By changing ws you set the number of epoch of the training with default value 5. We tried 5, 50 and 100 and 50 gives the best results.

epoch

loss

By changing ws you set the loss function of the training with default value ns. We tried ns, hs and softmax and softmax gives better results.

epoch

Parameters

For understanding the parameters' effect on each other let's check the best results:

wordNgrams lr dim ws epoch loss P@1
3 0.25 300 10 100 softmax 0.603
3 0.25 300 25 100 softmax 0.603
3 0.25 100 10 100 softmax 0.602
3 0.25 300 5 100 softmax 0.602
3 0.25 100 5 100 softmax 0.601
3 0.25 100 25 100 softmax 0.601
3 0.25 100 25 50 softmax 0.599
3 0.25 300 5 50 softmax 0.599
3 0.25 300 10 50 softmax 0.599
3 0.25 100 5 50 softmax 0.598

And we can clearly realize that wordNgrams is 3, lr 0.25 and loss is softmax on the top items. Where we already now we can get better results with softmax and lr 0.25 and wordNgrams 3 is very compatible with these values. Normally we should select wordNgrams 1 or 2 for better results.

Let's analyze worst results:

wordNgrams lr dim ws epoch loss P@1
2 0.05 100 10 5 ns 0.387
2 0.05 300 10 5 ns 0.378
2 0.05 300 5 5 ns 0.377
2 0.05 300 25 5 ns 0.377
3 0.05 100 10 5 ns 0.326
3 0.05 100 25 5 ns 0.325
3 0.05 100 5 5 ns 0.324
3 0.05 300 25 5 ns 0.32
3 0.05 300 5 5 ns 0.316
3 0.05 300 10 5 ns 0.316

And we clearly see that if you use loss function ns and lr 0.05, don't use wordNgrams 1. We already know that ns and 0.05 is not a good choice for learning and also wordNgrams 2 and 3 makes things worse.

You can find all the outputs of the experiment here at plotly


Soner ALTIN