Translate Numbers into Insights
Here's a mock dataset representing a fictional e-commerce company's sales records:
Copy code
order_id,customer_id,product_category,order_date,order_value
1,C1001,Electronics,2023-05-01,1200
2,C1002,Clothing,2023-05-02,500
3,C1003,Electronics,2023-05-02,800
4,C1004,Beauty,2023-05-03,150
5,C1001,Books,2023-05-04,250
6,C1005,Electronics,2023-05-05,900
7,C1002,Clothing,2023-05-06,700
8,C1006,Electronics,2023-05-06,1000
9,C1007,Beauty,2023-05-07,300
10,C1008,Books,2023-05-07,200
Now, let's dive into some exploratory data analysis questions that you can practice with:
How many records are there in the dataset?
df.count()
#10 in each column, 10 records
df.size()
#total size 50, five columns, 10 records
len(df.columns) #for column amount
What are the different product categories available?
df2 = df.groupby("product_category").size().reset_index(name = "count")
print(df2)
output = product_category count
0 Beauty 2
1 Books 2
2 Clothing 2
3 Electronics 4
#four different categories available
How many unique customers made purchases?
distinct_values = df['customer_id'].unique()
print(distinct_values)
#gives unique customer ids
len(distinct_values)
#gives the number of ids, 8
What is the range of dates covered in the dataset?
df['order_date'] = pd.to_datetime(df['order_date'])
date_range = []
date_range.append(df['order_date'].min())
date_range.append(df['order_date'].max())
date_range
#[Timestamp('2023-05-01 00:00:00'), Timestamp('2023-05-07 00:00:00')]
What is the average order value?
orderVal_average =df['order_value'].mean()
orderVal_average
#600
What is the minimum and maximum order value?
minmax_orderVal = []
minmax_orderVal.append(df['order_value'].min())
minmax_orderVal.append(df['order_value'].max())
minmax_orderVal
Which product category has the highest total sales value?
df2 = df.groupby('product_category')['order_value'].sum()
df2
largest_category = df2.idxmax()
print("Product category with the largest order value sum:", largest_category)
How many orders were placed per product category?
df3 = df.groupby('product_category')['order_id'].count()
df3
Can you calculate the average order value per product category?
df4 = df.groupby('product_category')['order_value'].mean()
df4
What is the distribution of order values? Can you plot a histogram?
plt.hist(df['order_value'], bins=5, color='skyblue', edgecolor='black')
plt.xlabel('Order Value')
plt.ylabel('Number of Orders')
plt.title('Order Value Distribution')
plt.grid(True)
plt.show()