Is identical data okay to run apriori algorithm?

17 Views Asked by At

Hi I'm trying to generate small virtual dataset for association rules analysis using apriori algorithm. And I'm wondering if it matters when there's same data in the dataset. For example, [milk,milk,banana,yogurt], [apple, meat, soap, apple] Thanks for reading!

Here's what i've coded so far

  1. generate random sample (weighted)

    products = ['홈','어웨이','마킹','입장용트랙탑(블랙)','레인자켓(블랙)','레인자켓(레드)','패딩수트상의(블랙)','패딩수트상의(레드)','선수단롱다운(블랙)','패딩베스트(블랙)','이동복상의(블랙)','이동복상의(블랙)', '트레이닝상의(블랙)','트레이닝상의(레드)','바람막이피스테(블랙)','바람막이피스테(레드)','연습복긴팔(블랙)','연습복긴팔(레드)', '연습복반팔(블랙)','연습복반팔(레드)','폴로티긴팔(블랙)','폴로티긴팔(레드)','폴로티반팔(블랙)','폴로티반팔(레드)','트레이닝하의(블랙)','3/4팬츠(블랙)', '연습복반바지(블랙)','응원용품','FC서울로고니트머플러','FC서울SoulofSeoul니트머플러','FC서울브랜딩니트머플러','FC서울WHITE니트머플러', '서울오리지널머플러','기성용캡틴머플러','전사골드머플러','전사블랙머플러','아동유니폼','선수단볼캡블랙','선수단볼캡레드','선수단동계비니','40주년백구','선수단신발주머니','FC서울MINI레인보우', 'FC서울포토레인보우','유니폼뱃지','엠블럼뱃지','레터링뱃지'] #총 47개 prob=[0.2454,0.0966,0.2316,0.0026,0.0031,0.0016,0.0027,0.0001,0.0033,0.0003,0.0021,0.0004,0.0010,0.0017,0.0013,0.0034,0.0011, 0.0029,0.0030,0.0024,0.0007,0.0009,0.0023,0.0006,0.0014,0.0019,0.0020,0.0368,0.0174,0.0464,0.0116,0.0058,0.0232, 0.0406,0.0348,0.0291,0.0107,0.0231,0.0069,0.0208,0.0162,0.0046,0.0093,0.0116,0.0185,0.0023,0.0139]

random_data = np.random.choice(products, size=50000,p=prob)


  1. Generate 1000 customers list, I tried to solve the redundancy with set(list_random) but It doesn't follow the weights of sample above

    store=[] for i in range(1,1000): for j in range(random.randint(1,4)): randsample=random.sample(set(list_random),j) store.append(randsample) #print(store)

    df = pd.DataFrame(store) df.head(10)


(I thought there shouldn't be same data in the dataset for apriori algorithm But I ran the algorithm with dataset that has same data, it worked.)

0

There are 0 best solutions below