Advanced Data Analytics Course (Google)

Here's a mock dataset representing a fictional e-commerce company's sales records:

Copy code
order_id,customer_id,product_category,order_date,order_value
1,C1001,Electronics,2023-05-01,1200
2,C1002,Clothing,2023-05-02,500
3,C1003,Electronics,2023-05-02,800
4,C1004,Beauty,2023-05-03,150
5,C1001,Books,2023-05-04,250
6,C1005,Electronics,2023-05-05,900
7,C1002,Clothing,2023-05-06,700
8,C1006,Electronics,2023-05-06,1000
9,C1007,Beauty,2023-05-07,300
10,C1008,Books,2023-05-07,200

Now, let's dive into some exploratory data analysis questions that you can practice with:

How many records are there in the dataset?

df.count()
	#10 in each column, 10 records
df.size()
	#total size 50, five columns, 10 records
len(df.columns) #for column amount

What are the different product categories available?

df2 = df.groupby("product_category").size().reset_index(name = "count")
print(df2)

output = product_category  count
0           Beauty      2
1            Books      2
2         Clothing      2
3      Electronics      4
#four different categories available

How many unique customers made purchases?

distinct_values = df['customer_id'].unique()
print(distinct_values)
#gives unique customer ids
len(distinct_values)
#gives the number of ids, 8

What is the range of dates covered in the dataset?

df['order_date'] = pd.to_datetime(df['order_date'])
date_range = []
date_range.append(df['order_date'].min()) 
date_range.append(df['order_date'].max())
date_range

#[Timestamp('2023-05-01 00:00:00'), Timestamp('2023-05-07 00:00:00')]

What is the average order value?

orderVal_average =df['order_value'].mean()
orderVal_average
#600

What is the minimum and maximum order value?

minmax_orderVal = []
minmax_orderVal.append(df['order_value'].min())
minmax_orderVal.append(df['order_value'].max())
minmax_orderVal

Which product category has the highest total sales value?

df2 = df.groupby('product_category')['order_value'].sum()
df2
largest_category = df2.idxmax()

print("Product category with the largest order value sum:", largest_category)

How many orders were placed per product category?

df3 = df.groupby('product_category')['order_id'].count()
df3

Can you calculate the average order value per product category?

df4 = df.groupby('product_category')['order_value'].mean()
df4

What is the distribution of order values? Can you plot a histogram?

plt.hist(df['order_value'], bins=5, color='skyblue', edgecolor='black')
plt.xlabel('Order Value')
plt.ylabel('Number of Orders')
plt.title('Order Value Distribution')
plt.grid(True)
plt.show()

Untitled