This is the code for RNN on IMDB data set and I provide entire code of this, whatever steps I have taken is shown in this, can you help me to improve this code and accuracy, what kind of steps should I take to improve this accuracy and its there any another way to do this.
To enhance the accuracy of your RNN (Recurrent Neural Network) model for sentiment analysis on the IMDB dataset, we need to consider various steps and potential improvements. I'll elaborate on these below. 1. **Data Preprocessing:** Ensure that your data is appropriately preprocessed. This includes cleaning, tokenization, and potentially handling imbalanced classes if present. 2. **Embedding Layer:** Optimize the embedding layer. Experiment with different embedding dimensions and techniques such as pre-trained embeddings like Word2Vec or GloVe. 3. **Model Architecture:** Modify the RNN architecture. Consider using more complex architectures like LSTM or GRU, or stacking multiple layers to capture richer representations of the text. 4. **Hyperparameter Tuning:** Experiment with different learning rates, batch sizes, and sequence lengths to find optimal settings that improve convergence and generalization. 5. **Regularization and Dropout:** Implement regularization techniques like L2 regularization and dropout to prevent overfitting. 6. **Learning Rate Scheduling:** Utilize learning rate schedules to adjust the learning rate during training, improving convergence and accuracy. 7. **Early Stopping:** Implement early stopping to halt training once the validation loss plateaus, preventing overfitting. 8. **Gradient Clipping:** Apply gradient clipping to prevent exploding gradients, especially in deeper architectures. 9. **Class Imbalance Handling:** Address class imbalances, if present, through techniques like class weights or oversampling/undersampling. 10. **Ensemble Learning:** Consider ensemble methods, where multiple models are trained and their predictions are combined for improved accuracy. 11. **Optimizers:** Experiment with different optimizers like Adam, RMSProp, or SGD to find the most suitable one for your model. 12. **Fine-Tuning:** If time and resources allow, fine-tune a pre-trained language model like BERT and adapt it to your sentiment analysis task. 13. **Feature Engineering:** Explore additional features or feature engineering techniques that could enhance the model's understanding of the text. 14. **Cross-Validation:** Implement k-fold cross-validation to ensure that the model's performance is consistent and robust. 15. **Debugging and Error Analysis:** Analyze model errors, misclassifications, and areas of improvement to iteratively refine the model and its performance. By carefully considering and implementing these steps, you can significantly improve the accuracy and effectiveness of your RNN model for sentiment analysis on the IMDB dataset. Experimentation and a systematic approach to model tuning are key to achieving the best possible performance.
# 1.0 Call libraries
#%reset -f
import numpy as np
# 1.1 Import module imdb & other keras modules
import tensorflow as tf
from tensorflow.keras.datasets import imdb
# 1.2 API to manipulate sequences of words
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
# 1.3 We will have three types of layers.
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
# 1.4 Misc
import matplotlib.pyplot as plt
import time
import io
# 1.5 Display multiple commands output from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.1 Define some constants:
max_vocabulary = 10000 # words
max_len_review = 500 # words
(X_train,y_train),(X_test,y_test) = imdb.load_data(
num_words=max_vocabulary
)
# 2.5 About our data
type(X_train) # numpy.ndarray
print("\n")
f"Shape of X_train {X_train.shape}" # (25000,) Total 25000 reviews
print("\n")
f"Shape of X_test {X_test.shape}" # (25000,) Total 25000 reviews
print("\n")
y_train.shape # (25000,) Total 25000 pos/neg labels
print("\n")
y_test.shape # (25000,) Total 25000 pos/neg labels
# 2.6 Check max and min length of reviews
maxLen = 0 # Start with a low number
minLen = 200 # Start with a high number
for i in range(X_train.shape[0]): # Go over all 25000
if len(X_train[i]) > maxLen:
maxLen = len(X_train[i])
if len(X_train[i]) < minLen :
minLen = len(X_train[i])
# 2.6.1
maxLen # 2494
print()
minLen # 11
# 3.2 Pad x_train sequences
# And also make each inner list as one row:
X_train = sequence.pad_sequences(
X_train, # An array of lists where each inner
# list is a sequence, Or,
# A list of lists with each
# list being a sequence
maxlen = max_len_review, # This is default
padding = 'pre' # option: 'post'
)
# 3.4 Look at first twenty rows
# and first twenty columns:
X_train[:20,:10]
# 4.0 Build model now
# 4.0.1 Delete any earlier model
if 'model' in locals():
del model
# 4.0.2 Out model:
# 4.0.3 Start with a blank template:
model = Sequential()
# 4.1 Add an embedding layer:
model.add(Embedding(
max_vocabulary, # Decides number of input neurons
32, # Decides number of neurons in hidden layer
input_length= max_len_review) # (optional) Decides how many groups of OHEs
# are input at a time (or in sequence).
# It also decides how many times
# RNN should loop around
# If omitted, decided autoamtically
# during 'model.fit()' by considering
# x_train.shape[1]
)
# 4.2
# It is instructive to see number of parameters
# in the summary. This tells us about the Embedding
# layer as being two layered network with no of neurons
# as max_vocabulary and output (hidden) layer with 32 neurons
# Note: Hidden layer has no activation function
# and no bias parameter:
model.summary()
model.add(
SimpleRNN
(
32, # Neurons at the output
return_sequences = False # Make it True
# And add layer #4.4
)
) # Output
# 4.4 Model summary
model.summary()
model.add(Dense(1, activation = 'sigmoid'))
model.summary()
# 4.7 Plot model
tf.keras.utils.plot_model(
model,
show_shapes=True
)
# 4.8 Compile model
model.compile(
loss = 'binary_crossentropy',
optimizer = 'rmsprop',
metrics = ['acc']
)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
epochs = 15
start = time.time()
history = model.fit(X_train,
y_train,
batch_size = 32, # Number of samples per gradient update
validation_split = 0.2, # Fraction of training data to be used as validation data
epochs = 4 # epochs,
shuffle = True, # Shuffle training data before each epoch
callbacks=[tensorboard_callback],
verbose =1
)
end = time.time()
(end-start)/60
# 6.0 Start tensorboard server
# and display logs
%load_ext tensorboard
%tensorboard --logdir logs
weights = model.get_layer('embedding').get_weights()
# 7.1.1
weights
# 7.2
# Extract array of weights
# and print its shape
weights = weights[0]
weights
print("--")
weights.shape # (10000, 32)
# 7.3 Get vocabulary:
type(imdb.get_word_index()) # dictionary
vocab = imdb.get_word_index()
vocab # dict
# 7.4 Create empty files to store
# weight-vectors and metadata
# (labels):
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')
# 7.6 Store data in the two files:
for word, index in vocab.items():
if index < 10000:
vec = weights[index]
_=out_v.write('\t'.join([str(x) for x in vec]) + "\n")
_=out_m.write(word + "\n")
# 7.7
out_v.close()
out_m.close()
# 6.1 Get x_test padded
x_test = sequence.pad_sequences(
x_test, # A list of lists where each inner
# list is a sequence, Or,
# An array of lists with each
# list being a sequence
maxlen = max_len_review,
padding = 'pre'
)
# 6.2 Predict now
out = model.predict(x_test)
out[out > 0.5] = 1
out[out <= 0.5] = 0
out
# 6.3
model.evaluate(x_test,y_test)
# 7.3.1
model.metrics_names # ['loss', 'acc']
ANN
# 1.0
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
# 1.1
import tensorflow as tf
# 1.2 Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# 2.2
data = pd.read_csv(path + "Churn_Modelling.csv")
# 2.3 Pop out target
y = data.pop('Exited')
data['Gender'] = data['Gender'].map({'Female' : 1, 'Male' : 0})
data['Geography'].unique()
data['Geography'] = data['Geography'].map({'France' : 0, 'Spain' :
2, 'Germany' : 3})
data['CustomerId'].duplicated().sum()
data = data.drop(columns = ['RowNumber','CustomerId', 'Surname'])
X_train,X_test, y_train,y_test = train_test_split(data, y,
test_size = 0.25)
X_train.shape
X_test.shape
mm = MinMaxScaler()
mm.fit(X_train)
model = tf.keras.Sequential()
model.add( tf.keras.layers.Input(shape = (10,) ))
# 6.2 Start
model.add(tf.keras.layers.Dense(40, activation = 'relu')) # MAke
it 5 and then 20 (not more or less)
# 6.2.1 Experiment with adding a dropout layer
# but then increase number of units in Dense layer from 20
to
40
model.add(tf.keras.layers.Dropout(rate = 0.5 ))
model.add(tf.keras.layers.Dense(20, activation = 'relu'))
model.add(tf.keras.layers.Dropout(rate = 0.5 ))
model.add(tf.keras.layers.Dense(10, activation = 'relu'))
model.add(tf.keras.layers.Dropout(rate = 0.5 ))
# 6.3 Experiment first with activation of sigmoid
# and then no activation function
model.add(tf.keras.layers.Dense(1, activation = 'sigmoid')) #
Keep sigmoid; then remove sigmoid
# 6.4 Model summary:
model.summary()
# 6.5 Compile model
# Expt with adam
model.compile(
loss = 'binary_crossentropy',
optimizer = 'adam', # Try first with default
optimizer and then with 'adam'
# may not make much
difference
metrics = ['acc']
)
# 7.0
#
history1 = model.fit(X_train,y_train,
epochs = 70,
validation_data = (X_test, y_test)
# 7.1
model.evaluate(X_test,y_test)
type(history.history)
history.history.keys()
# Plot without dropouts
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(loss)
plt.plot(val_loss)
# plot with dropouts
loss = history1.history['loss']
val_loss = history1.history['val_loss']
plt.plot(loss, label = "train_loss")
plt.plot(val_loss, label = "val loss")
plt.legend()
CNN
influx
show databases
use tp
show measurements
select * from tp limit 3
# SQ1.1 select max(complaints), city from busdata group by city
# SQ1.2. select mean(ratingGiven) into rateDriver from busdata group by driver
# SQ1.3. select mean(ratingGiven) from busdata where expHours = 'vhigh'
# SQ1.4. select mean(ratingGiven) from busdata where expHours != 'vhigh'
# SQ1.5: select ticketPerPassenger * passengers as tp into earnings from busdata group by *
select sum(tp) from earnings
# SQ1.6: select ticketPerPassenger * passengers as tp into earnings from busdata group by *
select sum(tp) from earnings where "city" = 'delhi'
# SQ1.7: select ticketPerPassenger * passengers as tp into earnings from busdata group by *
select sum(tp) from earnings where city = 'delhi' or "city" = 'jaipur'
# SQ1.8: select ticketPerPassenger * passengers as tp into earnings from busdata group by *
select median(tp) from earnings where driver = 'ahmed'
# SQ1.9: select stddev(complaints) from busdata
# SQ1.10: select count(*) from busdata where complaints > 3 * 3.141
# SQ1.11: select ticketPerPassenger * passengers/complaints, driver from busdata
# SQ1.12: select mean("complaints") ,stddev("complaints") from busdata
Use these values to calculate z-score
select ("complaints" - 9.95) / 3.20 from busdata
# SQ1.13: select (mean(complaints) - median(complaints))/stddev(complaints) from busdata
# SQ1.14: select ratingGiven/complaints, driver from busdata
# YOUR DASHBOARD SHOULD REFRESH EVERY 20 seconds to plot incoming data
localhost:3000
http://localhost:8086
# Q2.1: select passengers, complaints,ratingGiven,ticketPerPassenger from busdata where time >= '2023-11-15'
# Q2.2: select "complaints" from busdata where time > now() - 1h and "driver" = 'naik'
# Q2.3: select stddev("ratingGiven") from busdata group by time(30s)
# Q2.4: Step1: Use into...group by *
select "ticketPerPassenger" * "passengers" as totalearnings into tp from "busdata" group by *
Step2: Write the following query to grafana to get results
select sum("totalearnings") from "tp" group by time(30s)
# Q2.5: select mean("ticketPerPassenger") from busdata group by time(2m), "city"
# Q2.6:Step1: Use into...group by *
select "ticketPerPassenger" * "passengers" as totalearnings into tp from "busdata" group by *
Step 2: Write the following in grafana and display using 'stat'
select sum(totalearnings) from tp
# Q2.7: Step1: Use into...group by *
select "ticketPerPassenger" * "passengers" as totalearnings into tp from "busdata" group by *
Step 2: Write the following in grafana and display using 'gauge'
select sum(totalearnings) from tp group by "city"
# Q2.8 select max(passengers) from busdata group by time(40s)