Similarity Assessment
Next, I wanted to look at the similarities between each batch of the generated reviews and the original reviews. To do this, we can use cosine similarity to calculate how similar the different sentence vectors from each source are. First, we can create a cosine similarity matrix that will first transform our sentences into vectors using TfidVectorizer() and then calculate the cosine similarity between the two new sentence vectors.
def cosine_similarity(sentence1, sentence2):
"""
A function that accepts two sentences as input and outputs their cosine
similarityInputs:
sentence1 (str): A string of word
sentence2 (str): A string of words
Returns:
cosine_sim: Cosine similarity score for the two input sentences
"""
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Create the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
# Calculate the cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
return cosine_sim[0][0]
One problem I had was the datasets were now so big that the calculations were taking too long (and sometimes I did not have enough RAM on Google Colab to continue). To combat this issue, I randomly sampled 200 reviews from each of the datasets for calculating the similarity.
#Random Sample 200 Reviews
o_review = sample(reviews_dict['original review'],200)
p_review = sample(reviews_dict['fake positive review'],200)
n_review = sample(reviews_dict['fake negative review'],200)r_dict = {'original review': o_review,
'fake positive review': p_review,
'fake negative review':n_review}
Now that we have the randomly selected samples, we can look at cosine similarities between different combinations of the datasets.
#Cosine Similarity Calcualtion
source = ['original review','fake negative review','fake positive review']
source_to_compare = ['original review','fake negative review','fake positive review']
avg_cos_sim_per_word = {}
for s in source:
count = []
for s2 in source_to_compare:
if s != s2:
for sent in r_dict[s]:
for sent2 in r_dict[s2]:
similarity = calculate_cosine_similarity(sent, sent2)
count.append(similarity)
avg_cos_sim_per_word['{0} to {1}'.format(s,s2)] = np.mean(count)results = pd.DataFrame(avg_cos_sim_per_word,index=[0]).T
For the original dataset, the negative reviews were more similar. My hypothesis is this is due to my using more prompts to create negative reviews than positive reviews. No surprise, the ChatGPT-generated reviews showed the highest signs of similarity between themselves.
Great, we have the cosine similarities, but is there another step we can take to assess the similarities of the reviews? There is! Let’s visualize the sentences as vectors. To do this, we must embed the sentences (turn them into vectors of numbers) and then we can visualize them in 2D space. I used Spacy to embed my vectors and visualize them.
# Load pre-trained GloVe model
nlp = spacy.load('en_core_web_lg')source_embeddings = {}
for source, source_sentences in reviews_dict.items():
source_embeddings[source] = []
for sentence in source_sentences:
# Tokenize the sentence using spaCy
doc = nlp(sentence)
# Retrieve word embeddings
word_embeddings = np.array([token.vector for token in doc])
# Save word embeddings for the source
source_embeddings[source].append(word_embeddings)
def legend_without_duplicate_labels(figure):
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
figure.legend(by_label.values(), by_label.keys(), loc='lower right')
# Plot embeddings with colors based on source
fig, ax = plt.subplots()
colors = ['g', 'b', 'r'] # Colors for each source
i=0
for source, embeddings in source_embeddings.items():
for embedding in embeddings:
ax.scatter(embedding[:, 0], embedding[:, 1], c=colors[i], label=source)
i+=1
legend_without_duplicate_labels(plt)
plt.show()
The good news is we can clearly see the embeddings and distributions of the sentence vectors closely align. Visual inspection shows there is more variability in the distribution of the original reviews, supporting the assertion they are more diverse. Since ChatGPT generated positive and negative reviews, we would suspect their distributions to be the same. Notice, however, the fake negative reviews actually have a wider distribution and more variance than positive reviews. Why might this be? Probably it is due in part to the fact that I had to trick ChatGPT to create the fake negative reviews (ChatGPT is designed to say positive statements) and I had to actually provide more prompts to ChatGPT to get enough negative reviews vs. positive ones. This helps the dataset because, with the additional diversity in the dataset, we can train higher-performing machine learning models.
Next, we can inspect the differences in the three different distributions of reviews and see if there are any distinguishing patterns.
What do we see? Visually, we can see that the bulk of the reviews for the dataset are centered around the origin and span from -10 to 10. This is a positive sign and supports the use of fake reviews for training prediction models. The variances are somewhat the same, however, the original reviews had a wider variance in their distribution, both laterally and longitudinally, a proxy that there is more diversity in the lexicon within those reviews. The reviews from ChatGPT definitely had similar distributions, but the positive reviews had more outliers. As stated, these distinctions could be a result of the way I was prompting the system to generate reviews.