Interview Question
You’re tasked with analyzing a Spotify-like dataset that captures user listening habits. For each user, calculate the total listening time and the count of unique songs they’ve listened to. In the database duration values are displayed in seconds. Round the total listening duration to the nearest whole minute. The output should contain three columns: ‘user_id’, ‘total_listen_duration’, and ‘unique_song_count’.
Link to the question on Stratascratch
What you’ll learn
- DataFrame groupby
- .agg function
- .reset_index()
- nunique
Solution: Aggregate Listening Data
Disclaimer: This code is provided for educational purposes only. Please test the code thoroughly before using it in production. You are responsible for any issues that arise from using this code in your environment.
import pandas as pd
# create a new column to store the seconds in minutes
listening_habits['listen_duration_mints']=round(listening_habits['listen_duration']/60)
#create a new dataframe to store the output values
result2=(listening_habits.groupby('user_id').agg(
total_listen_duration=pd.NamedAgg(column="listen_duration_mints",aggfunc='sum'),
unique_song_count=pd.NamedAgg(column='song_id',aggfunc='nunique'),
).reset_index()
)
result2
Code Explanation
Explanation of code elements.
- In above code, we are grouping the output by user_id and then aggregating it. Also, we are creating two new columns *’total_listen_duration’* and *’unique_song_count’*. For each for of these columns we are applying **NamedAgg** function which takes the column name and the function to apply (aggfunc=’sum’).
- Pandas aggregation operations like groupby() and agg() will make the grouping column (user_id) into the index of the resulting DataFrame and won’t display it. To display the user_id column, we have to use reset_index().
- listening_habits.groupby(‘user_id’): This groups the DataFrame by the user_id column.
- .agg(): This applies the aggregation functions to the grouped data.
- The NamedAgg function in pandas is used to specify a custom name for an aggregated column when using the agg function. It takes two arguments: column: The name of the column to be aggregated.aggfunc: The aggregation function to be applied.
Leave a Reply