EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is the initial step in data analysis where we explore datasets to understand their structure, patterns, and key characteristics. It helps uncover relationships, anomalies, and trends before applying machine learning or statistical models.
β¨ Key Objectives of EDA:
π Understand the dataset β shape, size, data types, missing values.
π Summarize data distributions β mean, median, variance, outliers.
π§© Identify patterns & correlations between features.
π¨ Detect anomalies or unusual observations.
π¨ Visualize data for deeper insights using plots and graphs.
π οΈ Tools & Techniques:
Descriptive Statistics β mean, median, mode, standard deviation.
Data Visualization β histograms, scatter plots, boxplots, heatmaps.
Data Cleaning β handling null values, duplicates, and outliers.
Feature Understanding β identifying categorical vs numerical variables.
π§© PART 1 β Web Scraping & Data Collection π Libraries Used
π requests β Fetches the websiteβs HTML content.
π₯£ BeautifulSoup β Parses the HTML to extract structured data.
π csv β Saves the extracted data into a CSV file.
ποΈ Script Structure
π‘οΈ Uses a try...except block for error handling.
π Data stored in a list of dictionaries (author, quote, tags).
π while loop iterates through pages to fetch quotes.
π Scraping Process
Starts from page 1, continues until all pages are processed.
Constructs the page URL dynamically.
Uses requests.get() β retrieves HTML.
BeautifulSoup extracts:
βοΈ Author
π¬ Quote text
π·οΈ Tags
Each entry is stored as a dictionary β appended to the list.
π Handling Multiple Pages
π Script continues as long as thereβs a "next" button.
π Current setup β scrapes up to 10 pages.
β Ensures all available quotes are collected.
πΎ Saving to CSV
Exports results to quotes.csv.
Defines columns: author, quote, tag_name.
Uses csv.DictWriter to store rows.
π Structured dataset ready for analysis.
π§© PART 2 β SQL Queries on Quotes Data π Query 1 β Count Quotes per Author SELECT author, COUNT(*) AS quote_count FROM quotes GROUP BY author ORDER BY quote_count DESC;
π Shows authors ranked by number of quotes.
π·οΈ Query 2 β Top 5 Most Common Tags SELECT tag_name, COUNT(tag_name) AS tag_count FROM quotes GROUP BY tag_name ORDER BY tag_count DESC LIMIT 5;
π Retrieves top 5 tags with the highest frequency.
βοΈ Query 3 β Authors with More Than 5 Quotes SELECT author, COUNT(author) AS quote_count FROM quotes GROUP BY author HAVING COUNT(author) > 5;
π Filters only authors with >5 quotes.
π Query 4 β Find the Longest Quote SELECT author, quote_text FROM quotes ORDER BY LENGTH(quote_text) DESC LIMIT 1;
π Returns the longest quote and its author.
π§© PART 3 β Exploratory Data Analysis (EDA) π Steps
πΌ import pandas as pd β Import Pandas.
π₯ pd.read_csv("quotes.csv") β Load dataset.
π df.info() β Summary (rows, cols, datatypes, null values).
π df.head() β Preview first 5 rows.
π’ df['author'].nunique() β Count of unique authors.
π df.describe(include='all') β Descriptive statistics for all columns.
β Insights
π Identifies missing values & datatypes.
β¨ Shows sample data rows.
π©βπ» Finds number of distinct authors.
π Provides statistical overview of dataset.
RELAVENT TAGS
-
Exploratory-data-analysis
-
eda
-
data-science
-
data-visualization
-
data-analysis
-
pandas
-
python
-
statistics
-
jupyter-notebook