Detailed Comparison of Deep Hybrid Neural Models and Optimizers for Multiclass classification in NLP
Multi-class classification based on textual data refers to a supervised machine learning task, where we try to predict a certain class for input data, given the input data in raw textual format. Well, thanks to TensorFlow, we have a variety of algorithms available to perform multi-class classification via natural language processing.
Generally, we try to find out the best suitable models, loss functions, and optimizers, which could help to obtain better accuracy by minimum loss. Since we have to deal with textual data, we might also consider it as a sequence problem, with or without contextual relations exploration, which might add a few problems and load while training our model.
The main question that remains unanswered — How to select the best model or algorithm for multi-class classification in NLP? Let’s try to explore and answer that.
First, What we are going to do?
We are going to analyze and compare various Neural network models for multi-class classification on the same textual dataset. We will be looking at different models their associativity with different optimizers to check which performs better and the computational time taken to train the model.
The Data
We are going to use a text file, which has certain news articles headlines and their associated text labels as targets obtained from BBC News. You can get the dataset from the link given below :
https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv \-O /tmp/bbc-text.csv
Data Handling
Let’s start importing our main libraries first.
As you can see, I have imported a few models like LSTM, CNN-1D, Bidirectional. These are the models which we will look to make into hybrid models, that is, by looking for the effect they have when we combine these models.
Furthermore, I have also imported Embedding, Flatten, Dropout, and Dense Layers, which are usually required for structuring the Neural network for the NLP task of multi-class classification.
As far ar as the NLTK package is concerned, I have imported porter stemmer along with the ‘punkt’ and ‘STOPWORDS’ libraries for basic text preprocessing.
Next, we will try to extract text data and convert it into a data frame for further analysis.
Total number of Unique Target Classes : 5
(array(['tech', 'business', 'sport', 'entertainment', 'politics'],
dtype=object))
So we have a total of 5 unique target classes. That is -Tech, Business, Sport, Entertainment, Politics
Next, Let’s do some text preprocessing. This step usually involves cleaning of text — like removing unnecessary punctuations, removing stopwords and I have done stemming of root words. Again, Preprocessing totally depends on the user’s extent, you can even go to much more depth while handling the preprocessing part.
Once we are done with text cleaning, it’s time to split the data into training and testing for our neural network training. Also, I am going to reshape the target labels, to feed them into our model.
Once we are done with splitting our data, we need to fix some common parameters for our NLP-based neural network to train on.
Remember, these parameters are always flexible, we can change them and have a play around with them while trying out neural network models.
I have commented on the parameters actually meaning for your reference given below.
Next step, we have to convert our raw text into sequence-based integers so that we can feed forward it into our neural network. Thanks to TensorFlow, we are going to use some pre-built functions for our ease.
Our Main work here would be as follows :
- Taking up the Text
- Tokenising It (Splitting a single sentence into words)
- Converting them into sequences (Assigning those split words into unique integers and then arranging those integers into sequences based on word index)
- Padding them (Pad the sequences with ‘0’ to ensure uniformity of sequence input).
We need to perform this on both train and test data.
#Just an Example to see the raw sentance , sentence in sequences, sentence in sequences with padding:X_train[3],train_seq[3],train_padded[3]('school tribut tv host carson peopl turn sunday pay tribut late us tv present johnni carson nebraska town grew carson host tonight show year die januari respiratori diseas emphysema live norfolk nebraska age eight join navi return regularli donat local caus old school friend among crowd school johnni carson theater carson one bestlov tv person us ask public memori lo angel live later life began showbusi career norfolk perform magic name great carsoni age donat includ norfolk high school build new perform art centr carson die presid bush led public tribut say present profound influenc american life entertain',
[406, 1949, 118, 533, 4422, 8, 256, 400, 263, 1949, 671, 9, 118, 414, 2424, 4422, 7885, 2002, 1396, 4422, 533, 5529, 32, 4, 788, 314, 12214, 1915, 12215, 210, 4423, 7885, 480, 688, 534, 7886, 188, 2140, 1054, 395, 613, 689, 406, 879, 429, 1103, 406, 2424, 4422, 9366, 4422, 10, 9367, 118, 196, 9, 230, 82, 1149, 1150, 857, 210, 298, 280, 1055, 6867, 583, 4423, 184, 2141, 213, 208, 12216, 480, 1054, 44, 4423, 231, 406, 472, 7, 184, 1133, 377, 4422, 788, 257, 820, 649, 82, 1949, 12, 414, 6124, 1252, 409, 280, 682], array([ 406, 1949, 118, 533, 4422, 8, 256, 400, 263, 1949, 671, 9, 118, 414, 2424, 4422, 7885, 2002, 1396, 4422, 533, 5529, 32, 4, 788, 314, 12214, 1915, 12215, 210, 4423, 7885, 480, 688, 534, 7886, 188, 2140, 1054, 395, 613, 689, 406, 879, 429, 1103, 406, 2424, 4422, 9366, 4422, 10, 9367, 118, 196, 9, 230, 82, 1149, 1150, 857, 210, 298, 280, 1055, 6867, 583, 4423, 184, 2141, 213, 208, 12216, 480, 1054, 44, 4423, 231, 406, 472, 7, 184, 1133, 377, 4422, 788, 257, 820, 649, 82, 1949, 12, 414, 6124, 1252, 409, 280, 682, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0], dtype=int32))
Once we are done with input data, we need to focus on our target labels as well. To do that, I am gonna use One Hot Encoding. Furthermore, we need to convert them into arrays to input them into our neural network.
Let’s Have a look at all those variables shapes that we are dealing with :
print(train_padded.shape)print(validation_labels.shape)print(validation_padded.shape)print(training_labels.shape)print(type(train_padded))print(type(validation_padded))print(type(training_labels))print(type(validation_labels))print(type(training_labels))
print(type(validation_labels))(1780, 200) (445, 5) (445, 200) (1780, 5) <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'> <class 'numpy.ndarray'> <class 'numpy.ndarray'>
Here, we are done with all data handling and now we are ready to train it on the model.
Before we move forward to Code our core neural network model, let us first define some plotting and evaluating functions which will save our time to know the performance of our model while training!
Let’s get down to the main Business — Train Those Models!
Model Training
We are going to see the following model’s performance on the given raw textual data.
- CNN-1D
- LSTM
- Bi-directional LSTM
- CNN-1D+LSTM
- CNN-1D+Bidirectional LSTM
We will be also considering the following optimizer’s impact on those models:
- Adam (Learning Rate = 0.01)
- AdaGrad (Learning Rate = 0.01)
- SGD (momentum=0.5, nesterov=True)
- RMSprop (Learning Rate = 0.01)
- AdaDelta (Learning Rate = 0.01)
As you can see above, I have updated the default parameters of all the optimizers, i.e, their learning rates and momentum and Nesterov for SGD. The parameters can be of any value from a fixed range, which again, is flexible and can be played around with.
So Overall, we are going to define these models, along with different sets of optimizers to see and analyze the overall performance and time taken to train by the model.
NOTE: I am using Google Colab GPU for running these inferences. Actual computational time might vary for the processor being used!
CNN1-D:
CNN 1-D stands for Convolutional Neural Network for One Dimension. In Conv1D, kernel slides along one dimension. We can project text and times series plot on 1-D, and hence, can be processed using Conv1-D.
In the model given below, we are trying to implement a very basic, standard NLP-based model. We are using the Embedding layer, to extract featured word representations, followed by Con1D with 48 filters, a size of 5 each, and then a GlobalMaxPooling layer to get max weighted outputs. In terms of regulariser, we are implementing Dropout layers, to prevent Overfitting of data.
I have implemented a few callbacks in the model as well, to assure training gets constantly monitored-ModelCheckpoint, save model weights for best accuracy achieved during the entire training, and EarlyStopping with patience=10 to wait till a fixed number of 10 epochs.
I have set epochs to 150 and batch size to 32, which would be quite sufficient as per the given length of the dataset.
We are going to train the same model, but with different sets of optimizers, that is, Adam, Adagrad, SGD, RMSprop, and Adadelta.
Initial Analysis with Adam Optimizer:
Acc: 93.71%
Precision: 0.94
Recall: 0.94
F1 score: 0.94Total time took for training 132.444 seconds.Model Loss on training data: 0.000858851766679436
Model Accuracy on training data: 1.0
Model Loss on validation data:: 0.16400769352912903
Model Accuracy on validation data: 0.9528089761734009
Initial Analysis with Adagrad Optimizer:
Acc: 86.52%
Precision: 0.87
Recall: 0.87
F1 score: 0.87Total time took for training 42.859 seconds.Model Loss on training data: 0.18429572880268097
Model Accuracy on training data: 0.9735954999923706
Model Loss on validation data 0.3753611147403717
Model Accuracy on validation data: 0.8876404762268066
Initial Analysis with SGD Optimizer:
Acc: 85.17%
Precision: 0.85
Recall: 0.85
F1 score: 0.85Total time took for training 49.209 seconds.Model Loss on training data: 0.4065827429294586
Model Accuracy on training data: 0.9264044761657715
Model Loss on validation data 0.5402304530143738
Model Accuracy on validation data: 0.8471910357475281
Initial Analysis with RMSProp Optimizer:
Acc: 92.58%
Precision: 0.93
Recall: 0.93
F1 score: 0.93Total time took for training 27.585 seconds.Model Loss on training data: 2.9393253498710692e-05
Model Accuracy on training data: 1.0
Model Loss on validation data 0.4350118935108185
Model Accuracy on validation data: 0.934831440448761
Initial Analysis with Adadelta Optimizer:
Acc: 28.31%
Precision: 0.28
Recall: 0.28
F1 score: 0.28Total time took for training 23.972 seconds.Model Loss on training data: 1.5702073574066162
Model Accuracy on training data: 0.3101123571395874
Model Loss on validation data 1.5865534543991089
Model Accuracy on validation data: 0.26966291666030884
LSTM:
In the model given below, we are trying to implement a very basic, standard NLP-based model. We are using the Embedding layer, to extract featured word representations, followed by the LSTM layer with neurons exactly equal to the latent dimensions provided in embeddings. In terms of regulariser, we are implementing Dropout layers, to prevent Overfitting of data.
I have implemented a few callbacks in the model as well, to assure training gets constantly monitored-ModelCheckpoint, save model weights for best accuracy achieved during the entire training, and EarlyStopping with patience=10 to wait till a fixed number of 10 epochs.
I have set epochs to 150 and batch size to 32, which would be quite sufficient as per the given length of the dataset.
We are going to train the same model, but with different sets of optimizers, that is, Adam, Adagrad, SGD, RMSprop, and Adadelta.
Initial Analysis with Adam Optimizer:
Acc: 38.43%
Precision: 0.38
Recall: 0.38
F1 score: 0.38Total time took for training 119.479 seconds.Model Loss on training data: 0.22766165435314178
Model Accuracy on training data: 0.8713483214378357
Model Loss on validation data 0.5319708585739136
Model Accuracy on validation data: 0.8157303333282471
Initial Analysis with Adagrad Optimizer:
Acc: 25.17%
Precision: 0.25
Recall: 0.25
F1 score: 0.25Total time took for training 32.487 seconds.Model Loss on training data: 1.4589718580245972
Model Accuracy on training data: 0.4061797857284546
Model Loss on validation data 1.4928467273712158
Model Accuracy on validation data: 0.3685393333435058
Initial Analysis with SGD Optimizer:
Acc: 33.26%
Precision: 0.33
Recall: 0.33
F1 score: 0.33Total time took for training 36.672 seconds.Model Loss on training data 1.4594911336898804
Model Accuracy on training data: 0.3292134702205658
Model Loss on validation data 1.4703514575958252
Model Accuracy on validation data: 0.33932584524154663
Initial Analysis with RMSProp Optimizer:
Acc: 92.81%
Precision: 0.93
Recall: 0.93
F1 score: 0.93Total time took for training 77.248 seconds.Model Loss on training data 6.697150473078395e-11
Model Accuracy on training data: 1.0
Model Loss on validation data 0.7437273263931274
Model Accuracy on validation data: 0.9393258690834045
Initial Analysis with AdaDelta Optimizer:
Acc: 24.27%
Precision: 0.24
Recall: 0.24
F1 score: 0.24Total time took for training 11.980 seconds.Model Loss on training data 1.5746873617172241
Model Accuracy on training data: 0.2550561726093292
Model Loss on validation data 1.575269103050232
Model Accuracy on validation data: 0.23146067559719086
Bi-directional LSTM:
Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all time steps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence, that is forward pass and backward pass.
In the model given below, we are trying to implement a very basic, standard NLP-based model. We are using the Embedding layer, to extract featured word representations, followed by a Bi-Directional LSTM layer with neurons exactly equal to the latent dimensions provided in embeddings. In terms of regulariser, we are implementing Dropout layers, to prevent Overfitting of data.
I have implemented a few callbacks in the model as well, to assure training gets constantly monitored-ModelCheckpoint, save model weights for best accuracy achieved during the entire training, and EarlyStopping with patience=10 to wait till a fixed number of 10 epochs.
I have set epochs to 150 and batch size to 32, which would be quite sufficient as per the given length of the dataset.
We are going to train the same model, but with different sets of optimizers, that is, Adam, Adagrad, SGD, RMSprop, and Adadelta.
Initial Analysis with Adam Optimizer:
Acc: 94.61%
Precision: 0.95
Recall: 0.95
F1 score: 0.95Total time took for training 91.483 seconds.Model Loss on training data 0.0007189955795183778
Model Accuracy on training data: 1.0
Model Loss on validation data 0.23885682225227356
Model Accuracy on validation data: 0.9460673928260803
Initial Analysis with Adagrad Optimizer:
Acc: 83.37%
Precision: 0.83
Recall: 0.83
F1 score: 0.83Total time took for training 99.641 seconds.Model Loss on training data 0.37893444299697876
Model Accuracy on training data: 0.9168539047241211
Model Loss on validation data 0.5318742990493774
Model Accuracy on validation data: 0.833707869052887
Initial Analysis with SGD Optimizer:
Acc: 33.93%
Precision: 0.34
Recall: 0.34
F1 score: 0.34Total time took for training 66.537 seconds.Model Loss on training data 1.4997886419296265
Model Accuracy on training data: 0.31067416071891785
Model Loss on validation data 1.504658579826355
Model Accuracy on validation data: 0.3280898928642273
Initial Analysis with RMSProp Optimizer:
Acc: 92.81%
Precision: 0.93
Recall: 0.93
F1 score: 0.93Total time took for training 75.920 seconds.Model Loss on training data 4.0584662741594e-08
Model Accuracy on training data: 1.0
Model Loss on validation data 0.4116312861442566
Model Accuracy on validation data: 0.934831440448761
Initial Analysis with Adadelta Optimizer:
Acc: 24.49%
Precision: 0.24
Recall: 0.24
F1 score: 0.24Total time took for training 21.339 seconds.Model Loss on training data 1.5830999612808228
Model Accuracy on training data: 0.2595505714416504
Model Loss on validation data 1.585620403289795
Model Accuracy on validation data: 0.2404494434595108
CNN1-D + LSTMs:
Hybrid Deep Neural Network Models are one of the most amazing models discovered by researchers, where a fusion of more than 1 deep neural network can help to extract features from input in a vast efficient way.
The Hybrid model which we are going to look at is CNN1-D and LSTMs.
The proposed hybrid CNN-LSTM model uses CNN layers for feature extraction from the input data with LSTM layers for sequence learning.
In the model given below, we are trying to implement a very basic, standard NLP-based model. We are using the Embedding layer, to extract featured word representations, followed by a CNN1-D Layer with appropriate filters and filter size and a Max pooling to obtain weighted outputs. Then, the LSTM layer with neurons exactly equal to the latent dimensions provided in embeddings is added further to this layer.
I have implemented a few callbacks in the model as well, to assure training gets constantly monitored-ModelCheckpoint, save model weights for best accuracy achieved during the entire training, and EarlyStopping with patience=10 to wait till a fixed number of 10 epochs.
I have set epochs to 150 and batch size to 32, which would be quite sufficient as per the given length of the dataset.
We are going to train the same model, but with different sets of optimizers, that is, Adam, Adagrad, SGD, RMSprop, and Adadelta.
Initial Analysis with Adam:
Acc: 81.57%
Precision: 0.82
Recall: 0.82
F1 score: 0.82Total time took for training 181.622 seconds.Model Loss on training data 0.0008503638091497123
Model Accuracy on training data: 1.0
Model Loss on validation data 0.670300304889679
Model Accuracy on validation data: 0.8719100952148438
Initial Analysis with Adagrad:
Acc: 33.03%
Precision: 0.33
Recall: 0.33
F1 score: 0.33Total time took for training 14.808 seconds.Model Loss on training data 1.5146769285202026
Model Accuracy on training data: 0.2657303512096405
Model Loss on validation data 1.5247085094451904
Model Accuracy on validation data: 0.2719101011753082
Initial Analysis with SGD:
Acc: 34.61%
Precision: 0.35
Recall: 0.35
F1 score: 0.35Total time took for training 35.070 seconds.Model Loss on training data 0.9556211233139038
Model Accuracy on training data: 0.5629213452339172
Model Loss on validation data 1.1251300573349
Model Accuracy on validation data: 0.5078651905059814
Initial Analysis with RMSProp:
Acc: 91.24%
Precision: 0.91
Recall: 0.91
F1 score: 0.91Total time took for training 38.411 seconds.Model Loss on training data 5.357720378462716e-10
Model Accuracy on training data: 1.0
Model Loss on validation data 0.5674594640731812
Model Accuracy on validation data: 0.9438202381134033
Initial Analysis with Adadelta:
Acc: 24.04%
Precision: 0.24
Recall: 0.24
F1 score: 0.24Total time took for training 136.470 seconds.Model Loss on training data 1.5710480213165283
Model Accuracy on training data: 0.24438202381134033
Model Loss on validation data 1.5702738761901855
Model Accuracy on validation data: 0.2404494434595108
CNN1-D + Bidirectional LSTMs :
Bidirectional LSTMs enable forward and backward pass, which makes sequence learning more vivid and efficient using LSTM architectures.
Therefore, let’s have a look at the hybrid model of Bidirectional LSTMs together with CNN1-D.
In the model given below, we are trying to implement a very basic, standard NLP-based model. We are using the Embedding layer, to extract featured word representations, followed by a CNN1-D Layer with appropriate filters and filter size and a Max pooling to obtain weighted outputs. Then, the Bidirectional-LSTM layer with neurons exactly equal to the latent dimensions provided in embeddings is added further to this layer.
I have implemented a few callbacks in the model as well, to assure training gets constantly monitored-ModelCheckpoint, save model weights for best accuracy achieved during the entire training, and EarlyStopping with patience=10 to wait till a fixed number of 10 epochs.
I have set epochs to 150 and batch size to 32, which would be quite sufficient as per the given length of the dataset.
We are going to train the same model, but with different sets of optimizers, that is, Adam, Adagrad, SGD, RMSprop, and Adadelta.
Initial analysis with Adam Optimizer:
Acc: 94.38%
Precision: 0.94
Recall: 0.94
F1 score: 0.94Total time took for training 417.384 seconds.Model Loss on training data 0.00014469937013927847
Model Accuracy on training data: 1.0
Model Loss on validation data 0.2845289707183838
Model Accuracy on validation data: 0.9438202381134033
Initial analysis with Adagrad Optimizer:
Acc: 87.64%
Precision: 0.88
Recall: 0.88
F1 score: 0.88Total time took for training 871.265 seconds.Model Loss on training data 0.009952237829566002
Model Accuracy on training data: 0.9994382262229919
Model Loss on validation data 0.5138413906097412
Model Accuracy on validation data: 0.8764045238494873
Initial analysis with SGD Optimizer:
Acc: 69.89%
Precision: 0.70
Recall: 0.70
F1 score: 0.70Total time took for training 1121.185 seconds.Model Loss on training data 0.8237250447273254
Model Accuracy on training data: 0.748314619064331
Model Loss on validation data 1.093525767326355
Model Accuracy on validation data: 0.6988763809204102
Initial analysis with RMSProp Optimizer:
Acc: 94.83%
Precision: 0.95
Recall: 0.95
F1 score: 0.95Total time took for training 222.419 seconds.Model Loss on training data 1.339430094615679e-10
Model Accuracy on training data: 1.0
Model Loss on validation data 0.4486222565174103
Model Accuracy on validation data: 0.9483146071434021
Initial analysis with Adadelta Optimizer:
Acc: 24.49%
Precision: 0.24
Recall: 0.24
F1 score: 0.24Total time took for training 123.594 seconds.Model Loss on training data 1.5946524143218994
Model Accuracy on training data: 0.2606741487979889
Model Loss on validation data 1.595898985862732
Model Accuracy on validation data: 0.24494382739067078
Test on Some Text :
Let’s try to test one of our models on some text so that we can know how well our model is performing. Make sure, we have to follow the standard NLP pipeline, that is, handling of text must be done the same way we did earlier.
japanes bank battl end japan sumitomo mitsui financi withdrawn takeov offer rival bank ufj hold enabl latter merg mitsubishi tokyo sumitomo boss told counterpart ufj decis friday clear way conclud trillion yen bn deal mitsubishi deal would creat world biggest bank asset trillion yen trillion sumitomo eit end high profil fight japanes bank histori ufj hold japan fourthlargest bank centr fierc bid battl last year sumitomo japan thirdlargest bank tabl higher offer ufj rival valu compani bn howev ufj manag known prefer offer mitsubishi tokyo financi group mtfg japan secondlargest bank concern also rais sumitomo abil absorb ufj former admit defeat believ market investor accept ufjmtfg merger sumitomo said statement given ongo integr ufj mtfg oper persist propos may best interest sharehold ufj mitsubishi takeov ufj japan largestev takeov deal still approv sharehold two firm howev epect formal sumitomo may turn attent deepen tie daiwa secur anoth japanes financi firm two set merg ventur capit oper specul could lead fullblown merger japanes bank increasingli seek allianc boost profit.['japan bank battl end japan sumitomo mitsui financ withdrawn takeov offer rival bank ufj hold enabl latter merg mitsubishi tokyo sumitomo boss told counterpart ufj deci friday clear way conclud trillion yen bn deal mitsubishi deal would creat world biggest bank asset trillion yen trillion sumitomo eit end high profil fight japan bank histori ufj hold japan fourthlargest bank centr fierc bid battl last year sumitomo japan thirdlargest bank tabl higher offer ufj rival valu compani bn howev ufj manag known prefer offer mitsubishi tokyo financ group mtfg japan secondlargest bank concern also rai sumitomo abil absorb ufj former admit defeat believ market investor accept ufjmtfg merger sumitomo said statement given ongo integr ufj mtfg oper persist propo may best interest sharehold ufj mitsubishi takeov ufj japan largestev takeov deal still approv sharehold two firm howev epect formal sumitomo may turn attent deepen tie daiwa secur anoth japan financ firm two set merg ventur capit oper specul could lead fullblown merger japan bank increasingli seek allianc boost profit'].Product category id: 0
Predicted label is: ['business']
Accuracy score: 99.92726445198059
Summarised Final Evaluation :
Let’s Have A Summarised Performance using a package Pretty Table:
Let’s Sort this table based on the Highest Accuracy achieved and time taken by an individual model to train on.
RMSprop is an extension of Adagrad that actually deals with its small diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS’s parameter updates.
Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms.
SGD usually achieves to find a minimum at a longer pace, it might take significantly longer than with some of the optimizers since it is reliant on a robust initialization and may get stuck in saddle points rather than local minima. SGD can outperform Adam at quite some times with a given value of momentum or with Nesterov momentum.
Change In learning rate might have a significant impact on optimizer, since, at every epoch, the learning rate’s value could help to converge at a very good pace and hence, would be able to find the minima early.
Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods, like AdaGrad, Adam, RMSProp(Since it is an extension of AdaGrad).
What Could Have Been Done More?
After going through a detailed analysis, being an NLP Practioner, you can try out a wide variety of following pointers to get even better results of what we just read about!
Word embeddings — Word embeddings are arguably the most widely known best practice in the recent history of NLP. It is well-known that using pre-trained embeddings like Google News Or Wiki300k helps a lot to understand featured representations of words. Choosing the size of dimensions depends on what to achieve like bigger the latent dimensions number could be good for heavy tasks like sentiment analysis, whereas a smaller number could be useful for Entity recognition or Part of speech tagging.
Depth Of Network — Neural networks in NLP have become progressively deeper. State-of-the-art approaches now regularly use deep Bi-LSTMs, typically consisting of 3–4 layers, e.g. for POS tagging (Plank et al., 2016).
Dropouts — Dropout Layers is still the go-to regularizer for deep neural networks in NLP. A dropout rate of 0.5 is effective in most scenarios for most of the NLP-based general and mid-heavy tasks. Recurrent dropout applies dropout mask across timesteps. This avoids amplifying the dropout noise along the sequence and leads to effective regularization for sequence models
Conclusion:
So far, we have tried to test a bunch of models, let it be an individual model or hybrid combinations of such. We analyzed their performances based on their computational time management, the effect of optimizers being used, and the highest accuracy achieved so far.
You can find the code for the above-mentioned example Here.