Exploratory Data Analysis of World Cities Dataset

In this article I’ll demonstrate an exploratory data analysis of the World Cities dataset using Python. We’ll use libraries like Pandas, Matplotlib, and Seaborn to explore, clean, and visualize global city population data. You’ll learn how to uncover insights into the largest cities, population distributions, and different visualizations.

Video

About the Dataset

The dataset we are analyzing comes from Simple Maps World Cities Dataset. It includes details on cities’ geographic locations (latitude and longitude), population sizes, country information, and whether or not the city is a capital. This dataset allows us to explore patterns in urban populations, large and small cities, and regional population distributions.

Loading and Inspecting the Data

To start the analysis, we load the data into a pandas DataFrame and inspect the structure of the dataset:

import pandas as pd
df = pd.read_csv('worldcities.csv')
df.shape
df.head()

After loading the data, we check its shape and take a glimpse at the first few rows. The dataset contains multiple columns, including city names, populations, and geographical coordinates. This initial inspection gives us a sense of the data we are working with.

Handling Missing Data

In most datasets, missing values are inevitable, and it’s crucial to understand where those gaps exist. Using a heatmap, we visualize missing data to determine the extent of incomplete information in the dataset.

plt.figure(figsize=(15,5))
sns.heatmap(df.isnull(), yticklabels=False, cmap='plasma')
plt.xticks(rotation=45)
plt.show()

This heatmap allows us to visualize missing values and decide how to handle them. In this case, we will fill missing population values with the median population to maintain data consistency.

Visualizing the Largest Cities

A key focus of this analysis is identifying the world’s largest cities by population. We use the nlargest() function to find the 12 cities with the highest populations and visualize them with a bar plot.

top_cities = df.nlargest(12, 'population')
plt.figure(figsize=(15,5))
sns.barplot(data=top_cities, x='city', y='population', palette='winter')
plt.xticks(rotation=45)
plt.title('Top 10 Cities with the Largest Population')
plt.show()

This bar plot provides a clear visualization of the largest cities, showcasing their massive populations in comparison to smaller cities around the world.

Distribution of Smaller Cities

While large cities are interesting, a majority of cities have smaller populations. To better understand this, we filter the dataset to include only cities with populations below 1 million. We then plot a distribution of these smaller cities.

filtered_data = df[df['population'] <= 1000000]
sns.displot(filtered_data.population, bins=100, height=5, aspect=3)
plt.xticks(np.arange(0,1000000,20000), rotation=90)
plt.title('Population Distribution of Cities Under 1 Million')
plt.show()

This distribution plot gives us insights into the population density of smaller cities, revealing that the majority of cities fall within lower population ranges.

Countries with the Most Large Cities

Next, we explore which countries have the highest number of cities with populations over 1 million. This can help identify regions with a high level of urbanization and concentration of large cities.

major_cities_grouped = df[df.population > 1000000].groupby('country')['city'].nunique().reset_index().sort_values('city', ascending=False).nlargest(10, 'city')
plt.figure(figsize=(15,5))
sns.barplot(data=major_cities_grouped, x='country', y='city', palette='flare')
plt.xticks(rotation=45)
plt.title('Countries with the Most Cities Over 1 Million Population')
plt.show()

This bar plot shows which countries have the most cities with populations exceeding 1 million. The results highlight countries with strong urban centers and dense populations.

Exploring US Cities

Focusing specifically on the United States, we filter the dataset to include only cities with populations above 100,000. We then plot the geographical distribution of these cities.

us_cities_over_100k = df.query('country == "United States" & population > 100000')
plt.figure(figsize=(10,10))
sns.scatterplot(us_cities_over_100k, x='lng', y='lat', hue='population', legend=False, size='population', palette='coolwarm', hue_norm=(100000, 2000000), sizes=(10, 500))
plt.title('Geographical Distribution of US Cities with Over 100k Population')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

This scatter plot shows the distribution of major US cities across the country, with dot

Third Largest City in Each Country

Finally, for each country, we identify the third-largest city by population. This allows us to explore medium-sized cities and their role in national demographics.

df_sorted = df.sort_values(['country','population'], ascending=[True, False])
df_sorted_third = df_sorted.groupby('country').nth(2).reset_index()
df_sorted_third.query('country in ["Spain", "Brazil", "Japan"]')

This query provides a look at the third-largest cities in Spain, Brazil, and Japan, offering insights into important but less prominent cities.

Conclusion

Through this exploratory data analysis, we have uncovered key insights into global city populations. From identifying the largest cities in the world to comparing capital and non-capital populations, this analysis shows possible ways to analyze a dataset.

Full Code

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Disable scientific notation globally
mpl.rcParams['axes.formatter.useoffset'] = False
mpl.rcParams['axes.formatter.use_locale'] = False
mpl.rcParams['axes.formatter.limits'] = (-10, 10)  
df = pd.read_csv('worldcities.csv')
df.shape
df.head()
plt.figure(figsize=(15,5))
sns.heatmap(df.isnull(),yticklabels=False,cmap='plasma')
plt.xticks(rotation=45)
plt.show()
top_cities = df.nlargest(12, 'population')

plt.figure(figsize=(15,5))
sns.barplot(data=top_cities, x='city', y='population',palette='winter')
plt.xticks(rotation=45)
plt.title('Top 10 Cities Most People')
plt.show()
df.population.median()
filtered_data = df[df['population'] <= 1000000]
sns.displot(filtered_data.population, bins=100, height=5, aspect=3)
plt.xticks(np.arange(0,1000000,20000), rotation=90)
plt.title('Distribution of Population')
plt.show()
df.population.fillna(df.population.median(),inplace=True)


major_cities_grouped = df[df.population > 1000000].groupby('country')['city'].nunique().reset_index().sort_values('city',ascending=False) \
.nlargest(10,'city')

plt.figure(figsize=(15,5))
sns.barplot(data=major_cities_grouped, x='country', y='city',palette='flare')
plt.xticks(rotation=45)
plt.title('Cities with over 1Mio.')
plt.show()
median_population_country = df.groupby('country')['population'].median().reset_index() \
.sort_values('population',ascending = False).head(10)
median_population_country
us_cities_over_100k = df.query('country == "United States" & population > 100000')
us_cities_over_100k

plt.figure(figsize=(10,10))
sns.scatterplot(us_cities_over_100k,x='lng',y='lat',  hue='population', legend=False, size ='population', palette='coolwarm',hue_norm=(100000,2000000),    sizes=(10, 500))
plt.title('Geographical Distribution of Cities')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
df.sample(5)
capital_comparison = df.groupby('capital')['population'].mean().reset_index()
sns.barplot(data=capital_comparison, x='capital', y='population',palette='mako')
plt.title('Average Population: Capital vs Non-Capital Cities')
plt.show()
cities_sorted =  df.sort_values('population', ascending =False).reset_index()
cities_sorted['Rank'] = np.arange(1,len(cities_sorted)+1)
cities_sorted.query('city == "Stockholm"')
df_sorted = df.sort_values(['country','population'], ascending = [True,False])

df_sorted_third = df_sorted.groupby('country').nth(2).reset_index()
df_sorted_third.query('country in ["Spain","Brazil","Japan"]')