Identifying a minimum collection of baskets that covers the universe of products using PySpark

Cpak
2 min readSep 1, 2020

Problem Statement

From a collection of hundreds or even thousands of items (the universe of items), your data analytics team may want to recommend a personalized basket of items to each customer. Customer A may be recommended the basket {A, B, C} and customer B may be recommended {B, C, D}, etc. What is the minimum collection of baskets that together cover the universe of items? These types of problems are known as set cover problems.

Solution

We are going to use a greedy algorithm to solve the above set cover problem. It does not provide an exact solution, but it provides a good enough solution with much less effort and time. To use the greedy algorithm, we need to know:

  • Universe of items
  • Collection of sets: personalized basket recommendations (each basket is internally sorted alphabetically)

Let’s first gather the collection of sets with the function distinct_basket_of_items. Given a DataFrame and a list of column-names whose products make up a basket, ['productA', 'productB', 'productC'] , the function will return a DataFrame of sorted-baskets for each customer_id and a List of all unique sorted-baskets.

import pyspark.sql.functions as F
from pyspark.sql.functions import col
def distinct_baskets_of_items(
df,
item_cols
):
'''
Args…

--

--