๐Ÿ ์•ž์„  ํฌ์ŠคํŒ…์—์„œ ๋งŒ๋“  transaction data ๋กœ๋ถ€ํ„ฐ ๊ฐ column ๋“ค์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํŒจํ„ด๋“ค์„ ๊ตฌํ•ด๋ณด์ž.

๐Ÿ ์ฝ”๋“œ ์ง„ํ–‰์˜ ์ดํ•ด๋ฅผ ์œ„ํ•ด ์ด๋ฒˆ ๊ธ€์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋จผ์ € ๋ณด๋„๋ก ํ•˜์ž.


๐Ÿ“Œ 1. pre_tran : ์ˆ˜์น˜ํ˜•/๋ฒ”์ฃผํ˜• attribute๊ฐ€ ์„ž์—ฌ์žˆ๋˜ ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“  ๊ฒƒ


๐Ÿ“Œ 2. transaction : pre_tran์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งŒ๋“  ์ตœ์ข… ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ - Boolean ํ‘œํ˜„ํ˜•


2. support / confidence / lift ๊ตฌํ•˜๊ธฐ

๐Ÿ ํŒจํ„ด์„ ๋ถ„์„ํ•ด์„œ ์–ด๋–ค attribute๊ฐ€ cardio label๊ณผ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š”์ง€ ์•Œ์•„๋ณด๋Š” ๊ฒƒ์ด ํŒจํ„ด ๋ถ„์„์˜ ๋ชฉ์ ์ด๋‹ค.

๐Ÿ ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” attribute๋ฅผ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€์ด ํ•„์š”ํ•œ๋ฐ, ๊ฐ method์˜ ๊ธฐ์ค€์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

๐Ÿ ์šฐ๋ฆฌ๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ cardio์™€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” attribute ๊ฐ„์˜ ์ธ๊ณผ๊ด€๊ณ„์ด๊ธฐ ๋•Œ๋ฌธ์— cardio๋ฅผ consequents๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฃผ๋กœ ์‚ดํŽด๋ณผ ์˜ˆ์ •์ž„

  • confidence, Lift, support ์ˆœ์„œ๋กœ ์šฐ์„ ์ˆœ์œ„๋ฅผ ์„ค์ •
  • min_confidence = 0.6 / Lift > 1 / min_support = 0.01
  • support๋ฅผ ๋‚ฎ๊ฒŒ ์„ค์ •ํ•œ ์ด์œ ๋Š” confidence์™€ Lift๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ์— antecedents์˜ support๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„ ์ „์ฒด์ ์ธ support๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•œ ๊ฒƒ์ด๋‹ค.

    ๐Ÿ ์ด์ œ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํŒจํ„ด์„ ๋ถ„์„ํ•ด๋ณด๋„๋ก ํ•˜์ž.

๐Ÿ“Œ 2.1. Age / Cardio

# apriori ์•Œ๊ณ ๋ฆฌ์ฆ˜, association_rules ์ƒ์„ฑ 
# mlxtend.frequent_patterns ๋ชจ๋“ˆ์˜ apriori, association_rules ์ž„ํฌํŠธ
from mlxtend.frequent_patterns import apriori, association_rules
# pre_tran dataframe์œผ๋กœ๋ถ€ํ„ฐ ์›ํ•˜๋Š” attribute๋งŒ์„ ์ถ”์ถœ
# ํ•ด๋‹นํ•˜๋Š” attribute์˜ category์— ๋Œ€ํ•œ transaction dataframe ์ƒ์„ฑ
Age_C = pre_tran[['age','cardio']]
train_data_Age_C = np.array(Age_C)
train_data_Age_C = np.array(train_data_Age_C.tolist())
te = TransactionEncoder()
te_Age_C = te.fit(train_data_Age_C).transform(train_data_Age_C)
transaction_Age_C = pd.DataFrame(te_Age_C, columns=te.columns_)

# mlxtend.frequent_patterns ๋ชจ๋“ˆ์˜ apriori, association_rules ํ•จ์ˆ˜
# apriori() : itemsets ๊ฐ„์˜ Support๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ dataframe์œผ๋กœ ๋ฐ˜ํ™˜ - ์„ค์ •ํ•œ min_support๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๋งŒ ๋ฐ˜ํ™˜
# association_rules() : antecedents(์„ ํ–‰)๊ณผ consequents(ํ›„ํ–‰)์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ support, confidence, lift, leverage, conviction dataframe ๋ฐ˜ํ™˜
# association_rules() ํ•จ์ˆ˜์˜ metric, min_threshold ์˜ต์…˜ : ์„ค์ •ํ•œ metric์ด min_threshold ์ด์ƒ์ธ ๊ฒฝ์šฐ๋งŒ ๋ฐ˜ํ™˜
frequent_itemsets_Age_C = apriori(transaction_Age_C, min_support=0.000001,use_colnames=True)
frequent_itemsets_Age_C = frequent_itemsets_Age_C.sort_values('support',ascending = False)
rule_Age_C = association_rules(frequent_itemsets_Age_C, metric="lift", min_threshold=0)

# cardio์™€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” attribute ๊ฐ„์˜ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด cardio๋ฅผ consequents๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ์„ค์ •
# confidence๊ฐ€ ํฐ ์ˆœ์„œ๋Œ€๋กœ ์ถœ๋ ฅ
y = rule_Age_C['consequents'].apply(lambda rule_Age_C: "Cardio" in str(rule_Age_C))
y = y[y==True].index
rule_Age_C_cardio = rule_Age_C.loc[y].sort_values('confidence', ascending = False)
rule_Age_C_cardio

๐Ÿšฉ ๋งŒ๋“ค์–ด์ง„ ํŒจํ„ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ด ํ‘œ๋ฅผ ๋ณด๊ณ  ํŒ๋‹จํ•ด๋ณด๋ฉด, Age attribute๋Š” ์„ค์ •ํ•œ ๊ธฐ์ค€์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ๊ด€์„ฑ์ด ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.


๐Ÿšฉ ์‚ฌ์‹ค ์•ž์œผ๋กœ 10๊ฐœ๊ฐ€ ๋„˜๋Š” attribute์— ๋Œ€ํ•ด์„œ ์ด ๊ณผ์ •์„ ๊ฑฐ์น ํ…๋ฐ, ์ด๋ ‡๊ฒŒ ์ผ์ผ์ด ํ‘œ๋ฅผ ์ฐพ๊ธฐ์—๋Š” ๋ฒˆ๊ฑฐ๋กญ๋‹ค. ๊ทธ๋ž˜๋„ ์ข€ ํŽธํ•˜๊ธฐ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํ•œ๋ฒˆ์— ๋น„๊ตํ•ด๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„ ๊ฑฐ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค.

fig = go.Figure()

# Support๋ฅผ scatterํ•˜๋Š” ๋ถ€๋ถ„
fig.add_trace(
    go.Scatter(
    x = ['60','50','40','30'], y = rule_Age_C_cardio['support'], name = 'Support', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Age_C_cardio['support'].round(3), textposition = 'middle right', textfont_size = 15)) 
        
# Confidence๋ฅผ scatterํ•˜๋Š” ๋ถ€๋ถ„        
fig.add_trace(
    go.Scatter(
    x = ['60','50','40','30'], y = rule_Age_C_cardio['confidence'], name = 'Confidence', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Age_C_cardio['confidence'].round(3), textposition = 'middle right', textfont_size = 15)) 
        
# Lift๋ฅผ scatterํ•˜๋Š” ๋ถ€๋ถ„        
fig.add_trace(
    go.Scatter(
    x = ['60','50','40','30'], y = rule_Age_C_cardio['lift'], name = 'Lift', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Age_C_cardio['lift'].round(3), textposition = 'middle right', textfont_size = 15)) 

# Lift ํŒ๋‹จ ๊ธฐ์ค€(Lift > 1)์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•œ ์ˆ˜์ง์„  plot
fig.add_hline(y=1, line_dash="dot",
              line_color = "#1dd1ad",
              annotation_text="        Lift > 1", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
# support ํŒ๋‹จ ๊ธฐ์ค€(support > 0.01)์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•œ ์ˆ˜์ง์„  plot
fig.add_hline(y=0.01, line_dash="dot",
              line_color = "#8c8cf5",
              annotation_text="        support > 0.01", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
# confidence ํŒ๋‹จ ๊ธฐ์ค€(confidence > 0.6)์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•œ ์ˆ˜์ง์„  plot
fig.add_hline(y=0.6, line_dash="dot",
              line_color = "#ff0000",
              annotation_text="        confidence > 0.6", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
fig.update_layout(
    {
        'title' : {'text':'Age-Cardio ๋ณ„ Support, Confidence, Lift ๊ฐ’', 'font':{'size' : 25}},
        'xaxis' : {'title':{'text' : 'Age', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
        'yaxis' : {'title':{'text' : 'Support / Confidence / Lift', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
        
        'template' : 'plotly_dark'
    })
    
fig.show()

๐Ÿ ์œ„์˜ ๋„ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ๊ฐ ํŒจํ„ด ๋ฉ”์†Œ๋“œ์˜ ๊ธฐ์ค€๊ฐ’์— ์ˆ˜ํ‰์„ ์„ ๊ทธ์–ด ํ•œ๋ˆˆ์— ๋“ค์–ด์˜ค๊ธฐ ํŽธํ•˜๊ฒŒ ์‹œ๊ฐํ™”ํ–ˆ๋‹ค. ์ด์ œ ๋‹ค๋ฅธ attribute๋ฅผ ๋ถ„์„ํ•ด๋ณด๋„๋ก ํ•˜์ž.


๐Ÿ“Œ 2.2 Gender / Cardio

# ์ „์ž์™€ ๋™์ผ
Gender_C = pre_tran[['gender','cardio']]
train_data_Gender_C = np.array(Gender_C)
train_data_Gender_C = np.array(train_data_Gender_C.tolist())
te = TransactionEncoder()
te_Gender_C = te.fit(train_data_Gender_C).transform(train_data_Gender_C)
transaction_Gender_C = pd.DataFrame(te_Gender_C, columns=te.columns_)
frequent_itemsets_Gender_C = apriori(transaction_Gender_C, min_support=0.0000001,use_colnames=True)
frequent_itemsets_Gender_C = frequent_itemsets_Gender_C.sort_values('support',ascending = False)
rule_Gender_C = association_rules(frequent_itemsets_Gender_C, metric="lift", min_threshold=0)
y = rule_Gender_C['consequents'].apply(lambda rule_Gender_C: "Cardio" in str(rule_Gender_C))
y = y[y==True].index
rule_Gender_C_cardio = rule_Gender_C.loc[y].sort_values('confidence', ascending = False)

#์‹œ๊ฐํ™”
fig = go.Figure()
fig.add_trace(
    go.Scatter(
    x = ['Women','Men'], y = rule_Gender_C_cardio['support'], name = 'Support', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Gender_C_cardio['support'].round(3), textposition = 'middle right', textfont_size = 15))
        
fig.add_trace(
    go.Scatter(
    x = ['Women','Men'], y = rule_Gender_C_cardio['confidence'], name = 'Confidence', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Gender_C_cardio['confidence'].round(3), textposition = 'middle right', textfont_size = 15))
        
fig.add_trace(
    go.Scatter(
    x = ['Women','Men'], y = rule_Gender_C_cardio['lift'], name = 'Lift', 
        mode = 'markers+text', marker_size = 20, 
        text = rule_Gender_C_cardio['lift'].round(3), textposition = 'middle right', textfont_size = 15))
        
fig.add_hline(y=1, line_dash="dot",
              line_color = "#1dd1ad",
              annotation_text="    Lift > 1", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
fig.add_hline(y=0.01, line_dash="dot",
              line_color = "#8c8cf5",
              annotation_text="    support > 0.01", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
fig.add_hline(y=0.6, line_dash="dot",
              line_color = "#ff0000",
              annotation_text="    confidence > 0.6", 
              annotation_position="bottom left",
              annotation_font_size=17,
              annotation_font_color="gray"
             )
             
fig.update_layout(
    {
        'title' : {'text':'Gender-Cardio ๋ณ„ Support, Confidence, Lift ๊ฐ’', 'font':{'size' : 25}},
        'xaxis' : {'title':{'text' : 'Gender', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
        'yaxis' : {'title':{'text' : 'Support / Confidence / Lift', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
        
        'template' : 'plotly_dark'
    })
    
fig.show()

๐Ÿšฉ ์„ค์ •ํ•œ ๊ธฐ์ค€์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ๊ด€์„ฑ์ด ์žˆ๋‹ค๊ณ  ๋ณด๊ธฐ ํž˜๋“ค๋‹ค.


๐Ÿ“Œ 2.3. BMI / Cardio

๐Ÿšฉ ์„ค์ •ํ•œ ๊ธฐ์ค€์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ๊ด€์„ฑ์ด ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿšฉ ์ด ๊ณผ์ •์„ ์ผ์ผ์ด ๋‚˜์—ดํ•˜๋Š” ๊ฒƒ๋„ ์˜๋ฏธ๊ฐ€ ์—†์„ ๊ฑฐ๋ผ ์ƒ๊ฐํ•œ๋‹ค. ๋‚˜๋จธ์ง€ ๋ชจ๋“  attribute์— ๋Œ€ํ•ด์„œ๋„ support, confidence, lift๋ฅผ ๊ตฌํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.


๐Ÿ“Œ 2.4. Support / Confidence / Lift ๊ฒฐ๊ณผ

๐Ÿ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋งŒ์กฑํ•˜๋Š” attribute ์ถ”์ถœ

  • [age, ap_hi, ap_lo, cholesterol, gluc, BMI]
  • ์•ž์„œ ์ˆ˜ํ–‰ํ•œ correlation ๊ฒฐ๊ณผ์— gluc attribute๊ฐ€ ํฌํ•จ๋จ.

๐Ÿ ์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ํŒจํ„ด์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ์œ„์˜ ์„ธ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ํ•œ ๊ฐ€์ง€ ์ƒ๊ฐํ•  ์ ์ด ์žˆ๋Š”๋ฐ, ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•œ ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ๊ฐ€ Null value์— ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด ๊ฒฝ์šฐ ์ด๋ก ์ ์œผ๋กœ Null-invariant measure๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๊ณ  ์žˆ์—ˆ๊ธฐ์—, ์šฐ๋ฆฌ๋Š” Kulczynski ๊ฐ’๊ณผ Imbalanced Ratio ๊ฐ’์„ ์‚ฌ์šฉํ•ด์„œ ํŒจํ„ด์„ ๋ถ„์„ํ–ˆ๋‹ค. ์ด ๊ณผ์ •์€ ๋‹ค์Œ๊ธ€์—์„œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ๋‹ค.๐Ÿ˜Š๐Ÿ˜Š.

๐Ÿ ๊ฑฐ๋“ญ ๋งํ•˜์ง€๋งŒ, ์ด ํ”„๋กœ์ ํŠธ๋Š” ์ˆ˜์—…์—์„œ ๋ฐฐ์šด ๋‚ด์šฉ์„ ์ง์ ‘ ์‚ฌ์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•œ ํ”„๋กœ์ ํŠธ์ผ ๋ฟ์ด๋‹ค. ๋ฏธ์ˆ™ํ•˜์ง€๋งŒ, ํฅ๋ฏธ์žˆ๊ฒŒ ์ฝ์–ด์ฃผ์…จ์œผ๋ฉด ํ•œ๋‹ค.


Leave a comment