Exploratory Data Analysis of Coffee Reviews Dataset

In this blog post, we’ll dive into an exploratory data analysis (EDA) of a coffee reviews datase. Using Python, we’ll analyze coffee reviews from around the world, examining different attributes to uncover insights and trends. By the end, you’ll understand how data analysis can reveal patterns in coffee quality, preferences, and pricing.

Overview of the Coffee Dataset

The dataset is from Kaggle and can be downloaded here.

This dataset contains a collection of coffee reviews, each detailing various attributes related to specific coffee brands, their roasters, and ratings. Key columns in this dataset include:

Name: Brand name of the coffee
Roaster: Name of the company or person roasting the coffee
Roast: Type of roast, such as Light or Medium
Loc_Country: Country of the roaster
Origin: Origin country of the coffee beans
100g_USD: Price per 100 grams in USD
Rating: Rating of the coffee, on a numerical scale
Review_Date: Date of the coffee review
Review: Detailed text review of the coffee

Each of these columns provides us with useful information to explore, and we’ll use Python to analyze how these attributes impact coffee ratings and what we can learn about the coffee industry.

Setup and Imports

We start by importing the necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

These libraries help us load, analyze, and visualize our data. We’ll use pandas for data handling, matplotlib and seaborn for visualizations, and CountVectorizer from sklearn to help analyze text data in the reviews.

1. How Do Coffee Ratings Vary by Roast Type?

Our first exploration focuses on understanding how the type of roast influences the ratings. Do certain roasts consistently receive higher ratings than others?

To answer this, we’ll calculate the median rating for each roast type:

df_grouped = df.groupby('roast')['rating'].median().reset_index().sort_values('rating', ascending=False)

sns.barplot(x=df_grouped.roast, y=df_grouped.rating, palette='flare', hue=df_grouped.rating)
plt.title('Median Coffee Ratings by Roast Type')
plt.ylabel('Median Rating')
plt.ylim((80, 100))
plt.show()

This bar plot reveals which roast types receive the highest ratings, helping us understand if lighter or darker roasts are generally preferred by reviewers.

2. Is There a Correlation Between Price and Rating?

Next, we’ll explore the relationship between price per 100 grams and coffee rating to see if higher-priced coffees generally receive better ratings.

First, let’s calculate the correlation coefficient:

df[['100g_USD', 'rating']].corr()

Then, we’ll visualize this relationship with a scatter plot and add a regression line for better clarity:

plt.figure(figsize=(8,8))
sns.scatterplot(x=df['100g_USD'], y=df['rating'], hue=df['roast'], palette='viridis')
sns.regplot(x=df['100g_USD'], y=df['rating'], scatter=False, color='red', line_kws={"linewidth": 2})
plt.title('Relationship of Rating and Price')
plt.show()

This analysis will tell us if more expensive coffees tend to score higher, providing insight into the perceived value of pricier brands.

3. Distribution of Ratings Across Coffee Brands

Our third question looks at the distribution of ratings across different coffee brands. Which brands consistently receive high marks?

To focus on the most popular roasters, we’ll plot the ratings for brands with over 80 reviews:

main_roasters = df.roaster.value_counts().reset_index().query('count > 80').roaster.values
sns.kdeplot(data=df[df.roaster.isin(main_roasters)], x='rating', hue='roaster', fill=True, palette='autumn')
plt.title('Distribution of Ratings Across Main Coffee Brands')
plt.show()

This distribution plot allows us to compare the reputation of popular roasters and see if some brands tend to score higher than others.

4. Which Countries Produce the Best Coffee?

Another interesting question to explore is which countries produce the best-rated coffees. By grouping the data by the origin country of the beans, we can see where the top-rated coffees are grown.

We’ll calculate the average rating for each country with at least 10 reviews:

df_origin = df.groupby('origin')['rating'].agg(['mean', 'count']).reset_index().query('count >= 10').sort_values('mean', ascending=False)

And visualize these ratings by country:

plt.figure(figsize=(18, 7))
sns.barplot(df_origin, x='origin', y='mean', palette='flare')
plt.ylim((80, 100))
plt.ylabel('Mean Rating')
plt.title('Average Rating per Country')
plt.xticks(rotation=45)
plt.show()

This bar plot shows the countries that consistently produce high-quality coffee, providing insights into the global coffee industry.

5. What Are the Frequent Keywords in the Reviews of the Highest-Rated Coffees?

Lastly, we’ll explore the most common keywords in reviews of the highest-rated coffees. This will help us identify which flavors or qualities reviewers frequently mention in top-rated coffees.

First, we filter for reviews with a rating above 95:

high_rated = df[df['rating'] > 95]

Then, we use CountVectorizer to extract and count keywords:

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(high_rated['review'])

keywords = X.sum(axis=0)
words_freq = [(word, keywords[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

Finally, we display the top 10 most frequent keywords:

top_keywords = pd.DataFrame(words_freq[:10], columns=['Word', 'Count'])
plt.figure(figsize=(18, 7))
sns.barplot(top_keywords, x='Word', y='Count', palette='flare')
plt.title('Word Count in Top Rated Coffee Reviews')
plt.xticks(rotation=45)
plt.show()

This keyword analysis provides insights into common attributes in highly rated coffees, helping us understand what makes a coffee stand out.

Conclusion

In this exploration of the coffee reviews dataset, we answered five key questions, from understanding the impact of roast type on ratings to identifying top coffee-producing countries. We learned how price correlates with quality and explored keywords that frequently appear in high-rated reviews. Through data analysis, we uncovered patterns and insights that could be invaluable for coffee enthusiasts, producers, and marketers alike.

Thank you for reading, and stay tuned for more data explorations!

Full Code

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
import warnings

# Suppress warnings to keep the output clean
warnings.filterwarnings('ignore')

# Load the coffee dataset into a pandas DataFrame
df = pd.read_csv('simplified_coffee.csv')

# Display dataset information, including column types and missing values
df.info()
# Preview the first few rows of the dataset
df.head()

# QUESTION 1: How Do Coffee Ratings Vary by Roast Type?
# Group the data by 'roast' type and calculate the median rating for each roast
df_grouped = df.groupby('roast')['rating'].median().reset_index().sort_values('rating', ascending=False)

# Plot a bar chart to visualize the median ratings by roast type
sns.barplot(x=df_grouped.roast, y=df_grouped.rating, palette='flare', hue=df_grouped.rating)
plt.title('Median Coffee Ratings by Roast Type')  # Set the plot title
plt.ylabel('Median Rating')  # Label the y-axis
plt.ylim((80, 100))  # Set y-axis limits for better visualization
plt.show()

# QUESTION 2: Is There a Correlation Between Price and Rating?
# Calculate the correlation between the '100g_USD' (price) and 'rating' columns
df[['100g_USD', 'rating']].corr()

# Create a scatter plot to visualize the relationship between price and rating
plt.figure(figsize=(8,8))  # Set figure size for better readability
sns.scatterplot(x=df['100g_USD'], y=df['rating'], hue=df['roast'], palette='viridis')  # Scatter plot with roast type coloring
sns.regplot(x=df['100g_USD'], y=df['rating'], scatter=False, color='red', line_kws={"linewidth": 2})  # Add regression line in red
plt.title('Relationship of Rating and Price')  # Set the plot title
plt.show()

# QUESTION 3: What is the Distribution of Ratings Across Coffee Brands?
# Filter for the most popular roasters with more than 80 reviews
main_roasters = df.roaster.value_counts().reset_index().query('count > 80').roaster.values

# Plot the distribution of ratings for these main coffee brands
sns.kdeplot(data=df[df.roaster.isin(main_roasters)], x='rating', hue='roaster', fill=True, palette='autumn')
plt.title('Distribution of Ratings Across Main Coffee Brands')  # Set the plot title
plt.show()

# QUESTION 4: Which Countries Produce the Best Coffee?
# Group the data by 'origin' country, calculating mean rating and review count for each country
df_origin = df.groupby('origin')['rating'].agg(['mean', 'count']).reset_index().query('count >= 10').sort_values('mean', ascending=False)

# Plot a bar chart to show the average rating per country
plt.figure(figsize=(18, 7))  # Set figure size for readability
sns.barplot(df_origin, x='origin', y='mean', palette='flare')  # Bar plot with origin on x-axis and mean rating on y-axis
plt.ylim((80, 100))  # Set y-axis limits
plt.ylabel('Mean Rating')  # Label the y-axis
plt.title('Average Rating per Country')  # Set the plot title
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()

# QUESTION 5: What Are the Frequent Keywords in the Reviews of the Highest-Rated Coffees?
# Filter the dataset for coffees with a rating above 95
high_rated = df[df['rating'] > 95]

# Initialize CountVectorizer to extract keywords from the 'review' column
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(high_rated['review'])  # Transform the reviews to keyword counts

# Calculate the frequency of each keyword in high-rated reviews
keywords = X.sum(axis=0)
words_freq = [(word, keywords[0, idx]) for word, idx in vectorizer.vocabulary_.items()]  # Create a list of (word, count) pairs
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)  # Sort the words by frequency in descending order

# Create a DataFrame for the top 10 most frequent keywords
top_keywords = pd.DataFrame(words_freq[:10], columns=['Word', 'Count'])

# Plot a bar chart of the top 10 keywords by frequency
plt.figure(figsize=(18, 7))  # Set figure size for readability
sns.barplot(top_keywords, x='Word', y='Count', palette='flare')  # Bar plot with words on x-axis and counts on y-axis
plt.title('Word Count in Top Rated Coffee Reviews')  # Set the plot title
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()