๐Ÿšจ ์ €๋ฒˆ ๊ธ€์—์„œ๋Š” ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„์„ ํ†ตํ•ด ๊ฐ attribute๋“ค ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ฐพ์•„๋ณด์•˜๋‹ค. ํ•˜์ง€๋งŒ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„์ด ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜์™€ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋งŒ์„ ์•Œ์•„๋ณด๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ชฐ๋ผ์„œ ์•ฝ๊ฐ„ ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด ๋„ˆ๋ฌด ์•„์‰ฝ๋‹ค๐Ÿ˜ฅ.

๐Ÿšจ ์ด๋ฒˆ ๊ธ€๋ถ€ํ„ฐ ๋‹ค์Œ ๋ช‡๊ฐ€์ง€ ๊ธ€๊นŒ์ง€๋Š” ํŒจํ„ด๋ถ„์„์„ ํ†ตํ•ด attribute ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์•Œ์•„๋ณผ ์˜ˆ์ •์ด๋‹ค.

๐Ÿซ€ 1. support / confidence / lift ์ด๋ก ์  ๋ฐฐ๊ฒฝ

  • ๐Ÿšจ Support
    • ์ง€์ง€๋„
    • x์™€ y๋ฅผ ๋™์‹œ์— ํฌํ•จํ•˜๋Š” ๋น„์œจ
    • ์‹ ๋ขฐ๋„(Confidence)๋ฅผ ์ง€์ง€ํ•˜๋Š” ์ฒ™๋„
    • confidence์— ์˜ํ•œ ๊ทœ์น™์ด ์ง€์ง€๋ฐ›๊ธฐ ์œ„ํ•ด์„œ๋Š” support ๊ฐ’์ด ๋†’์•„์•ผ ํ•จ.
  • ๐Ÿšจ Confidence
    • ์‹ ๋ขฐ๋„
    • x๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฑฐ๋ž˜ ๋‚ด์—ญ ์ค‘, y๊ฐ€ ํฌํ•จ๋œ ๋น„์œจ
    • ๊ทœ์น™์˜ ์‹ ๋ขฐ๋„์— ๋Œ€ํ•œ ์ฒ™๋„
    • P(Y X)
  • ๐Ÿšจ Lift
    • ์‹ ๋ขฐ๋„์˜ ๊ฒฐ๊ณผ๊ฐ€ 0.9๋ผ๊ณ  ๊ฐ€์ •ํ•˜์˜€์„ ๋•Œ p(Y)๊ฐ€ 0.9๋ฉด x,y๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— x๋Š” y๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐ์— ์•„๋ฌด๋Ÿฐ ๋„์›€์„ ์ค„์ˆ˜ ์—†์Œ
    • ๊ทœ์น™์ด ์ง„์งœ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ ์ฒ™๋„
    • P(Y X) / P(Y)
      • Lift = 1 : x์™€ y๋Š” ์•„๋ฌด ๊ด€๊ณ„๊ฐ€ ์—†์Œ. ๋…๋ฆฝ.
      • Lift > 1 : x๊ฐ€ y์˜ ๋ฐœ์ƒ ์ฆ๊ฐ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋จ. (์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„).
      • Lift < 1 : x๊ฐ€ y์˜ ๋ฐœ์ƒ ๊ฐ์†Œ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋จ. (์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„).

๐Ÿซ€ 2. Preprocessing

๐Ÿšจ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋ฒ”์ฃผํ˜• ์ž๋ฃŒ์™€ ์ˆ˜์น˜ํ˜• ์ž๋ฃŒ๊ฐ€ ์ด๊ฒƒ์ €๊ฒƒ ์„ž์—ฌ์žˆ๋‹ค. ํŒจํ„ด ๋ถ„์„์„ ํ†ตํ•ด ๊ทœ์น™์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ, ์ฆ‰ Boolean ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์—ฌ๋งŒ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ attribute๋“ค์„ ์ผ์ •ํ•œ ๊ธฐ์ค€์„ ๊ฐ€์ง€๊ณ  ๋ชจ๋‘ ๋ฒ”์ฃผํ™” ํ•œ ๋’ค, ์ตœ์ข…์ ์œผ๋กœ ์ด๋ ‡๊ฒŒ ๋ฒ”์ฃผํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ Boolean ํ‘œํ˜„ํ˜•์œผ๋กœ ๋ฐ”๊ฟ” ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•  ๊ฒƒ์ด๋‹ค.

  • ๐Ÿšจ Support, Confidence ๊ณ„์‚ฐ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ transaction table ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝ
    • pre_tran : ๊ฐ attribute์˜ binary ๊ฐ’์„ category ํ˜•ํƒœ๋กœ ๋ฐ”๊พผ dataframe ์ƒ์„ฑ
    • transaction : mlxtend ๋ฉ”์†Œ๋“œ์˜ transform ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ boolean dataframe ์ƒ์„ฑ
  • ๐Ÿšจ Support ,Confidence ๊ณ„์‚ฐ
    • mlxtend.frequent_patterns ๋ชจ๋“ˆ์˜ apriori, association_rules ํ•จ์ˆ˜
    • apriori() : itemsets ๊ฐ„์˜ Support๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ dataframe์œผ๋กœ ๋ฐ˜ํ™˜ - ์„ค์ •ํ•œ min_support๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๋งŒ ๋ฐ˜ํ™˜
    • association_rules() ํ•จ์ˆ˜์˜ metric, min_threshold ์˜ต์…˜ : ์„ค์ •ํ•œ metric์ด min_threshold ์ด์ƒ์ธ ๊ฒฝ์šฐ๋งŒ ๋ฐ˜ํ™˜
  • ๐Ÿšจ ์šฐ๋ฆฌ๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ cardio์™€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” attribute ๊ฐ„์˜ ์ธ๊ณผ๊ด€๊ณ„์ด๊ธฐ ๋•Œ๋ฌธ์— cardio๋ฅผ consequents๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฃผ๋กœ ์‚ดํŽด๋ณผ ์˜ˆ์ •์ž„
    • confidence, Lift, support ์ˆœ์„œ๋กœ ์šฐ์„ ์ˆœ์œ„๋ฅผ ์„ค์ •
    • min_confidence = 0.6 / Lift > 1 / min_support = 0.01
    • support๋ฅผ ๋‚ฎ๊ฒŒ ์„ค์ •ํ•œ ์ด์œ ๋Š” confidence์™€ Lift๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ์— antecedents์˜ support๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„ ์ „์ฒด์ ์ธ support๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•œ ๊ฒƒ์ด๋‹ค.

๐Ÿซ€ 3. Code

# BMI attribute๋ฅผ ์œ„ํ•œ categorize ํ•จ์ˆ˜ ์ƒ์„ฑ
# BMI < 18.5 : ์ €์ฒด์ค‘
# 18.5 =< BMI < 25 : ์ •์ƒ
# 25 =< BMI < 30 : ๊ณผ์ฒด์ค‘
# 30 =< BMI < 39.9 : ๋น„๋งŒ
# 39.9 =< BMI  : ๊ณ ๋„๋น„๋งŒS
def bmi(x):
    if x < 18.5:
        x = 'LOW'
    elif (x >= 18.5) & (x<25):
        x = 'NORMAL'
    elif (x >= 25) & (x < 30):
        x = 'OVER'
    elif (x >= 30) & (x < 39.9):
        x = 'OBESITY'
    else:
        x = 'HIGH_OBESITY'
    return x
# cardio ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผํ™”
pre_tran = cardio.copy()

# gender : 1 2
pre_tran = pre_tran.replace({'gender':1},'Women')
pre_tran = pre_tran.replace({'gender':2},'Men')

# cholesterol : 1 2 3
pre_tran = pre_tran.replace({'cholesterol':1},'Normal_cho')
pre_tran = pre_tran.replace({'cholesterol':2},'Above_Normal_cho')
pre_tran = pre_tran.replace({'cholesterol':3},'Well_Above_Normal_cho')

# gluc : 1 2 3
pre_tran = pre_tran.replace({'gluc':1},'Normal_gluc')
pre_tran = pre_tran.replace({'gluc':2},'Above_Normal_gluc')
pre_tran = pre_tran.replace({'gluc':3},'Well_Above_Normal_gluc')

# smoke : 0 1
pre_tran = pre_tran.replace({'smoke':0},'No_Smoke')
pre_tran = pre_tran.replace({'smoke':1},'Smoke')

# alco : 0 1
pre_tran = pre_tran.replace({'alco':0},'No_Alcohol')
pre_tran = pre_tran.replace({'alco':1},'Alcohol')

# active : 0 1
pre_tran = pre_tran.replace({'active':0},'No_Active')
pre_tran = pre_tran.replace({'active':1},'Active')

# cardio : 0 1, target
pre_tran = pre_tran.replace({'cardio':0},'No_cardio')
pre_tran = pre_tran.replace({'cardio':1},'Cardio')

# ap_hi๊ฐ€ 140์ด์ƒ์ด๋ฉด HBP_SYS(๊ณ ํ˜ˆ์••), ๊ทธ ์™ธ์—๋Š” NBP_SYS(์ •์ƒ)
# ap_lork 90 ์ด์ƒ์ด๋ฉด HBP_DIAS(๊ณ ํ˜ˆ์••), ๊ทธ ์™ธ์—๋Š” NBP_DIAS(์ •์ƒ)
pre_tran["ap_hi"] = np.where(pre_tran["ap_hi"] >=140, 'HBP_SYS', 'NBP_SYS')
pre_tran["ap_lo"] = np.where(pre_tran["ap_lo"] >=90, 'HBP_DIAS', 'NBP_DIAS')

# age : ์—ฐ๋ น๋Œ€๋กœ ๋ถ„๋ฅ˜
pre_tran.loc[pre_tran['age'] // 10 == 3, 'age'] = 30
pre_tran.loc[pre_tran['age'] // 10 == 4, 'age'] = 40
pre_tran.loc[pre_tran['age'] // 10 == 5, 'age'] = 50
pre_tran.loc[pre_tran['age'] // 10 == 6, 'age'] = 60

# BMI : ์•ž์„œ ์ƒ์„ฑํ•œ BMI ํ•จ์ˆ˜ ์‚ฌ์šฉ
pre_tran['BMI'] = pre_tran['BMI'].apply(bmi)

print('row : ', len(pre_tran))
print('columns : ', len(pre_tran.columns))
pre_tran.head()
>>
row :  64500
columns :  11


๐Ÿ“Œ ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์„ ๊ฑฐ์น˜๋ฉด ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋‘ ๋ฒ”์ฃผํ˜•์œผ๋กœ ๋ณ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด์ œ๋Š” ์ด ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋ฉด ๋œ๋‹ค.

# transaction table ์ƒ์„ฑ
# mlxtend.preprocessing ๋ชจ๋“ˆ์˜ TransactionEncoder ์ž„ํฌํŠธ
from mlxtend.preprocessing import TransactionEncoder

# transaction ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
# ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ mlxtend ๋ฉ”์†Œ๋“œ์˜ transform ํ•จ์ˆ˜์— ๋„ฃ๊ธฐ ์œ„ํ•ด listํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ : trans_data
trans_data = np.array(pre_tran)
trans_data = np.array(trans_data.tolist())

# transform() ํ•จ์ˆ˜๋กœ trans_data๊ฐ€ one-hot encoding ๋œ ํ˜•ํƒœ์˜ boolean list๋ฅผ te_ary๋กœ ๋ฐ›์Œ
# te_ary๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ transaction data ์ƒ์„ฑ 
# transaction : attribute์˜ ๊ฐ category์— ๋Œ€ํ•œ value๋ฅผ column์œผ๋กœ ๋ฐ›์Œ
te = TransactionEncoder()
te_ary = te.fit(trans_data).transform(trans_data)
transaction = pd.DataFrame(te_ary, columns=te.columns_)
transaction


๐Ÿ˜ฅ ์–ด์—‡ ์‚ฌ์ง„์ด ์ž˜ ์•ˆ๋ณด์ธ๋‹ค,,, ํ•œ๋ฒˆ์— ์ตœ๋Œ€ํ•œ ๋งŽ์€ attribute๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์‹ถ์–ด์„œ ์บก์ฒ˜๋ฅผ ์ €๋ ‡๊ฒŒ ๋–ด๋Š”๋ฐ ์•„์‰ฝ๋‹ค. ํ˜น์‹œ ๋” ์ž์„ธํžˆ ๋ณด๊ณ  ์‹ถ์œผ์‹  ๋ถ„๋“ค์€ ์‚ฌ์ง„์„ ํ•œ๋ฒˆ๋งŒ ๋” ํด๋ฆญํ•ด์ฃผ์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค!!


๐Ÿšจ ๋“œ๋””์–ด ์›ํ•˜๋Š” ํ˜•ํƒœ์˜ ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ๋Š” ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ํŒจํ„ด์„ ๋ถ„์„ํ•ด์„œ ๊ฐ™์ด ๋‚˜์˜ค๋Š” ์นœ๊ตฌ๋“ค์ด ๋ฌด์—‡์ด ์žˆ๋Š”์ง€, ๊ทธ ์ˆ˜์น˜๋Š” ์–ด๋– ํ•œ์ง€์— ๋Œ€ํ•ด์„œ ๋ถ„์„ํ•˜๋ฉด ๋œ๋‹ค.

๐Ÿšจ ๋ฐ์ดํ„ฐ์—์„œ ๊ทธ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ฝ‘์•„๋‚ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ์—ญ์‹œ ์ค‘์š”ํ•˜์ง€๋งŒ, ์ด๋ฅผ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ๋Œ€ํ•œ ์ข€ ๋” ๋„“์€ ์‹œ์•ผ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค. ์ˆ˜์—…์‹œ๊ฐ„์— ๋ฐฐ์šด ์ด๋ก ๋งŒ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ •๋ง ๋งŽ์€ ์‚ฝ์งˆ์„ ํ•˜๋ฉด์„œ ๋ฐฐ์šด ๋ฐฉ๋ฒ•๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์—, ์•„๋งˆ ๋‘๊ณ ๋‘๊ณ  ์ƒ๊ฐ๋‚˜์ง€ ์•Š์„๊นŒ ์‹ถ๋‹ค๐Ÿ˜€๐Ÿ˜€.

๐Ÿšจ ๋‹ค์Œ ๊ธ€์—์„œ๋Š” ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ์—์„œ support, confidence, lift๋ฅผ ๊ตฌํ•˜๊ณ  ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ถ€๋ถ„์„ ๋‹ค๋ฃฐ ๊ฒƒ์ด๋‹ค.


Leave a comment