top of page
Search

Data analysis on the Data Science book collection of Amazon

  • Writer: Patrick Dankerlui
    Patrick Dankerlui
  • Apr 16, 2023
  • 7 min read

This is my first Data Analytics/Science project. I am working hard to improve my skills in this area, so I decided to use Data Analysis/Science to get better at Data Analysis/Science. I remember that as a student, I preferred books over classes. Therefore I decided to find out which books I should get to get better at Data Analysis and Data Science.


I found a beautiful data set of a large collection of Books about data that are available on Amazon. The data set can be found here. If you'd like to take a closer look at what I've done, you can find my script here on Deepnote.


In my analysis I will address the following topics:

  1. Do more expensive books have better reviews?

  2. Do longer books have higher prices?

  3. What are the best Python books, what are the best ML books?

  4. Cluster analysis of book names/TF-IDF K-means and GloVe K-means

The data set

The data set contains a lot of information of each book. However, for this analysis I have made use of the following:

  • Title

  • Price

  • Average review score

  • Number of reviews

In total there are 830 books in the dataset.

Do more expensive books have better reviews?

If this questions is true, then there exists some correlation between the price of a book and the review score it has. Such a correlation can be discovered with a scatter plot. Therefore I have decided to plot the scatter plot of the book price vs. the average review score. I have also the decided to incorporate the actual number of reviews in the scatter plot as the size of each datapoint, because a book could have a perfect score but only one rating.

ree
Scatter plot of book price vs. average review score

If there would be a correlation between book price and average review score then the datapoints would follow a pattern, for example an increasing line to indicate a positive correlation. No such can be found in the scatter plot, indicating there is no relationship between the price of a book and the review rating it has received. Furthermore, the R-squared value was found to be 0.0035. This means that only a small fraction of the variation in review rating could be explained by the book price.


What are the best python books?

To find the best Python books, I have filtered on the book titles containing the word Pyhton/python and I have sorted these books based on the average reviews score and the number of reviews. The results are presented below.


Up on further inspection, it turns out that the most popular book isn't even about the programming language python, but it's a religious book. Considering my interest, I find the book "Python Programming for Beginners: ...." and the book "Intro to Python for Computer Science and ......" the most interesting.

ree
Top 10 of the best Python books




What are the best Machine Learning books?

Analogously, I found the following books to be the best Machine learning books.

ree
Top 10 of the best Machine Learning books

Cluster analysis of book names/TF-IDF K-means and GloVe

K-means


To have a better understanding of the data set and the subject in general I tried to find if there are any more major subjects like "Machine Learning" or "Python" in the data set.


K-means

One of the most popular methods of clustering data is called K-means. K-means is a popular unsupervised machine learning algorithm used for clustering data points. The algorithm aims to partition the dataset into k clusters by grouping similar data points together. It starts by randomly selecting k initial cluster centers and assigning each data point to its closest center. The centroids of the clusters are then updated based on the mean of the data points in each cluster. The algorithm repeats this process until convergence, where the cluster assignments no longer change, and outputs k clusters, each represented by its centroid, and each data point is assigned to the closest cluster.


TF-IDF

However, before we can use K-means we have to express the information contained in the book titles nummerically. A common method for doing this is called TF-IDF, which stands for term frequency-inverse document frequency. It is used to determine the importance of a term within a document or a collection of documents. The method works by multiplying the term frequency (TF), which measures the frequency of a term within a document, by the inverse document frequency (IDF), which measures the rarity of the term in the collection of documents. The result is a score that reflects the significance of the term within the document or collection of documents. The higher the TF-IDF score, the more important the term is in the context of the document or collection.


GloVe

A downside of TF-IDF is that it is unable to understand semantic relationships. While using TF-IDF, I realized that this might be important. Therefore, I have decided to also use a method that is able to "understand" or capture semantic relationships. GloVe is an unsupervised word embedding technique that stands for Global Vectors for Word Representation. It generates dense vector representations of words based on their co-occurrence in large text datasets. The algorithm constructs a global word-word co-occurrence matrix and learns embeddings by optimizing a specific objective function. This objective function ensures that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. As a result, words with similar meanings have similar vector representations in the embedding space. Pre-trained GloVe models are available, with glove.6B being a popular choice, trained on Wikipedia 2014 and Gigaword 5. Glove 6B could be downloaded HERE.


Setting up TF-IDF

There are multiple parameters that can be specified in TF-IDF. The ones specified for this study are:

  • ngram_range: This parameter specifies the range of n-grams to be considered during feature extraction. It takes a tuple of two integers, representing the lower and upper bounds of the range. For example, setting ngram_range=(1, 3) will consider unigrams, bigrams, and trigrams. For this study, ngram_range was specied as from 1 to 2. This means that single words (eg: Python) and two words (eg: Data Science) can be considered as a single term. It is important to note that the performance of Unigrams was better than Unigram & Bigram, as can be seen in the figure below. However, allowing only for Bigrams means that a single term like Data Science could be split up, placing Data in one cluster and Science in another.

ree

  • max_df: This parameter specifies the maximum frequency of a word in the corpus that is allowed to be included in the vocabulary. It can be set to an integer or a float between 0 and 1, which represents the proportion of documents in which a word can appear to be included. For this study it was found that a value of 0.1 was optimal


  • stop_words: This parameter specifies the set of stop words to be removed during feature extraction. It can be set to 'English' to remove a standard set of English stop words or a custom list of stop words. Because the book titles are primarily in English stop words were specified as "English"


Finding the optimal number of clusters for K-means with TF-IDF

The optimal number of clusters for a dataset can be determined using the elbow method in combination with TF-IDF and K-means clustering. In the elbow method, the total within-cluster sum of squares (WSS) is calculated for different numbers of clusters, and the number of clusters where the rate of decrease in WSS starts to level off is considered the optimal number of clusters.


Finding the optimal number of clusters is important in clustering analysis because it determines the appropriate level of granularity in grouping similar data points together. If the number of clusters is too small, the clusters may be too broad and may not capture the subtle differences between the data points. On the other hand, if the number of clusters is too large, the clusters may be too specific and may not capture the broader patterns and relationships in the data.

Determining the optimal number of clusters can help improve the interpretability and effectiveness of the clustering analysis, as it can help identify the natural groupings and patterns in the data without overfitting or underfitting. Additionally, it can help with downstream tasks such as classification, anomaly detection, and data visualization, as it can provide a more meaningful representation of the underlying structure of the data.


Together with finding the optimal number of clusters I have also considered the optimum value for max_df. The results are summarized below. As can be seen below, a only at df =0.1 a very slight elbow can be found at 5 clusters. Also at these values the silhouette score show a local maximum. Despite that df = 0.1 has the highest inertia.


ree

Setting up GloVe for K-means

The GloVe dataset contains 4 models with different dimensionalities. The higher the dimensionality the greater the potential for capturing nuances, often resulting in higher accuracy. However, this comes at the expense of lower computational efficiency. Therefore, I have decided to compare the performance of the 4 different files. Since we have already used the method of inertia/elbow method and silhouette for comparing the number of clusters and other parameters for the TF-IDF model, I will do the same here.


ree

As can be seen in the figure above, the simplest model with only 50 dimensions shows the best performance when only considering inertia. However, no clear elbow can be detected for the 50 dimensional model. When we consider the silhouette score we see that at three clusters the 50 dimensional model also has the best performance. It seems that the 50 dimensional model with three clusters has the best overall performance. However, at two and 7 clusters the 100 dimensional model seems to also have favorable performance.


To have a better understanding of the data we can make a word cloud of each cluster. Furthermore, by doing so, we could find if we have found any major subjects in the data set.


Wordclouds

the first set of word clouds I have generated are for the TF-IDF K-means clustering with 5 clusters. The word cloud for this clustering does not seem to be giving any insight as there seems to be a hughe amount of overlap between the clusters. Considering the very low Silhouette score found for al settings of the TF-IDF K-means clusterings, this was to be expected.

ree

Next, I have made the word cloud for the 50 dimensional Glove model with three clusters and K-means. The results are presented below.




ree


This seems to show an even poorer performance than the TF-IDF model. Which is surprising as the Glove models in general had much higher Silhouette scores. However, it appears that the Silhouettes scores were so low, also for the GloVe models, that no useful information can be found in from the clusters made.







 
 
 

Comments


Patrick Dankerlui

©2023 by Patrick Dankerlui

bottom of page