Foundations of Data Science

Get Started With Python

Translate Numbers into Insights

The Power of Statistics


Here's a mock dataset representing a fictional e-commerce company's sales records:

Copy code
order_id,customer_id,product_category,order_date,order_value
1,C1001,Electronics,2023-05-01,1200
2,C1002,Clothing,2023-05-02,500
3,C1003,Electronics,2023-05-02,800
4,C1004,Beauty,2023-05-03,150
5,C1001,Books,2023-05-04,250
6,C1005,Electronics,2023-05-05,900
7,C1002,Clothing,2023-05-06,700
8,C1006,Electronics,2023-05-06,1000
9,C1007,Beauty,2023-05-07,300
10,C1008,Books,2023-05-07,200

Now, let's dive into some exploratory data analysis questions that you can practice with:

  1. How many records are there in the dataset?

    df.count()
    	#10 in each column, 10 records
    df.size()
    	#total size 50, five columns, 10 records
    len(df.columns) #for column amount
    
  2. What are the different product categories available?

    df2 = df.groupby("product_category").size().reset_index(name = "count")
    print(df2)
    
    output = product_category  count
    0           Beauty      2
    1            Books      2
    2         Clothing      2
    3      Electronics      4
    #four different categories available
    
  3. How many unique customers made purchases?

    distinct_values = df['customer_id'].unique()
    print(distinct_values)
    #gives unique customer ids
    len(distinct_values)
    #gives the number of ids, 8
    
  4. What is the range of dates covered in the dataset?

    df['order_date'] = pd.to_datetime(df['order_date'])
    date_range = []
    date_range.append(df['order_date'].min()) 
    date_range.append(df['order_date'].max())
    date_range
    
    #[Timestamp('2023-05-01 00:00:00'), Timestamp('2023-05-07 00:00:00')]
    
  5. What is the average order value?

    orderVal_average =df['order_value'].mean()
    orderVal_average
    #600
    
  6. What is the minimum and maximum order value?

    minmax_orderVal = []
    minmax_orderVal.append(df['order_value'].min())
    minmax_orderVal.append(df['order_value'].max())
    minmax_orderVal
    
  7. Which product category has the highest total sales value?

    df2 = df.groupby('product_category')['order_value'].sum()
    df2
    largest_category = df2.idxmax()
    
    print("Product category with the largest order value sum:", largest_category)
    
  8. How many orders were placed per product category?

    df3 = df.groupby('product_category')['order_id'].count()
    df3
    
  9. Can you calculate the average order value per product category?

    df4 = df.groupby('product_category')['order_value'].mean()
    df4
    
  10. What is the distribution of order values? Can you plot a histogram?

    plt.hist(df['order_value'], bins=5, color='skyblue', edgecolor='black')
    plt.xlabel('Order Value')
    plt.ylabel('Number of Orders')
    plt.title('Order Value Distribution')
    plt.grid(True)
    plt.show()
    

    Untitled