Identifying a minimum collection of baskets that covers the universe of products using PySpark
Problem Statement
From a collection of hundreds or even thousands of items (the universe of items), your data analytics team may want to recommend a personalized basket of items to each customer. Customer A may be recommended the basket {A, B, C} and customer B may be recommended {B, C, D}, etc. What is the minimum collection of baskets that together cover the universe of items? These types of problems are known as set cover problems.
Solution
We are going to use a greedy algorithm to solve the above set cover problem. It does not provide an exact solution, but it provides a good enough solution with much less effort and time. To use the greedy algorithm, we need to know:
- Universe of items
- Collection of sets: personalized basket recommendations (each basket is internally sorted alphabetically)
Let’s first gather the collection of sets with the function distinct_basket_of_items
. Given a DataFrame
and a list of column-names whose products make up a basket, ['productA', 'productB', 'productC']
, the function will return a DataFrame
of sorted-baskets for each customer_id
and a List
of all unique sorted-baskets.
import pyspark.sql.functions as F
from pyspark.sql.functions import coldef distinct_baskets_of_items(
df,
item_cols
):
'''
Args…