๐งฉ ๋ฐ์ดํฐ๋ง์ด๋(27) ํจํด๋ถ์_7 : Support/Confidence/Lift(2)
๐ ์์ ํฌ์คํ
์์ ๋ง๋ transaction data ๋ก๋ถํฐ ๊ฐ column ๋ค์ ์ฌ๋ฌ๊ฐ์ง ํจํด๋ค์ ๊ตฌํด๋ณด์.
๐ ์ฝ๋ ์งํ์ ์ดํด๋ฅผ ์ํด ์ด๋ฒ ๊ธ์์ ์ฌ์ฉํ ๋ฐ์ดํฐํ๋ ์์ ๋จผ์ ๋ณด๋๋ก ํ์.
๐ 1. pre_tran : ์์นํ/๋ฒ์ฃผํ attribute๊ฐ ์์ฌ์๋ ์๋ ๋ฐ์ดํฐ๋ฅผ ๋ฒ์ฃผํ ๋ฐ์ดํฐ๋ก ๋ง๋ ๊ฒ
๐ 2. transaction : pre_tran์ ์ฌ์ฉํ์ฌ ๋ง๋ ์ต์ข ํธ๋์ญ์ ๋ฐ์ดํฐ - Boolean ํํํ
2. support / confidence / lift ๊ตฌํ๊ธฐ
๐ ํจํด์ ๋ถ์ํด์ ์ด๋ค attribute๊ฐ cardio label๊ณผ ์์ฃผ ๋ฑ์ฅํ๋์ง ์์๋ณด๋ ๊ฒ์ด ํจํด ๋ถ์์ ๋ชฉ์ ์ด๋ค.
๐ ์ด๋ฅผ ์ํด์๋ attribute๋ฅผ ์ ํํ๊ธฐ ์ํ ๊ธฐ์ค์ด ํ์ํ๋ฐ, ๊ฐ method์ ๊ธฐ์ค์ ์๋์ ๊ฐ๋ค.
๐ ์ฐ๋ฆฌ๊ฐ ์ฐพ๊ณ ์ ํ๋ ๊ฒ์ cardio์ ์ํฅ์ ๋ฏธ์น๋ attribute ๊ฐ์ ์ธ๊ณผ๊ด๊ณ์ด๊ธฐ ๋๋ฌธ์ cardio๋ฅผ consequents๋ก ํ๋ ๊ฒฝ์ฐ๋ฅผ ์ฃผ๋ก ์ดํด๋ณผ ์์ ์
- confidence, Lift, support ์์๋ก ์ฐ์ ์์๋ฅผ ์ค์
- min_confidence = 0.6 / Lift > 1 / min_support = 0.01
- support๋ฅผ ๋ฎ๊ฒ ์ค์ ํ ์ด์ ๋ confidence์ Lift๋ฅผ ๋ง์กฑํ๋ ๊ฒฝ์ฐ์ antecedents์ support๊ฐ ๋๋ฌด ์์ ์ ์ฒด์ ์ธ support๊ฐ ๋ฎ๊ฒ ๋์ค๋ ๊ฒฝ์ฐ๋ฅผ ๊ณ ๋ คํ ๊ฒ์ด๋ค.
๐ ์ด์ ๋ณธ๊ฒฉ์ ์ผ๋ก ํจํด์ ๋ถ์ํด๋ณด๋๋ก ํ์.
๐ 2.1. Age / Cardio
# apriori ์๊ณ ๋ฆฌ์ฆ, association_rules ์์ฑ
# mlxtend.frequent_patterns ๋ชจ๋์ apriori, association_rules ์ํฌํธ
from mlxtend.frequent_patterns import apriori, association_rules
# pre_tran dataframe์ผ๋ก๋ถํฐ ์ํ๋ attribute๋ง์ ์ถ์ถ
# ํด๋นํ๋ attribute์ category์ ๋ํ transaction dataframe ์์ฑ
Age_C = pre_tran[['age','cardio']]
train_data_Age_C = np.array(Age_C)
train_data_Age_C = np.array(train_data_Age_C.tolist())
te = TransactionEncoder()
te_Age_C = te.fit(train_data_Age_C).transform(train_data_Age_C)
transaction_Age_C = pd.DataFrame(te_Age_C, columns=te.columns_)
# mlxtend.frequent_patterns ๋ชจ๋์ apriori, association_rules ํจ์
# apriori() : itemsets ๊ฐ์ Support๋ฅผ ๊ณ์ฐํ์ฌ dataframe์ผ๋ก ๋ฐํ - ์ค์ ํ min_support๋ฅผ ๋ง์กฑํ๋ ๊ฒฝ์ฐ๋ง ๋ฐํ
# association_rules() : antecedents(์ ํ)๊ณผ consequents(ํํ)์ ์์๋ฅผ ๊ณ ๋ คํ์ฌ support, confidence, lift, leverage, conviction dataframe ๋ฐํ
# association_rules() ํจ์์ metric, min_threshold ์ต์
: ์ค์ ํ metric์ด min_threshold ์ด์์ธ ๊ฒฝ์ฐ๋ง ๋ฐํ
frequent_itemsets_Age_C = apriori(transaction_Age_C, min_support=0.000001,use_colnames=True)
frequent_itemsets_Age_C = frequent_itemsets_Age_C.sort_values('support',ascending = False)
rule_Age_C = association_rules(frequent_itemsets_Age_C, metric="lift", min_threshold=0)
# cardio์ ์ํฅ์ ๋ฏธ์น๋ attribute ๊ฐ์ ์ธ๊ณผ๊ด๊ณ๋ฅผ ์์๋ณด๊ธฐ ์ํด cardio๋ฅผ consequents๋ก ํ๋ ๊ฒฝ์ฐ๋ฅผ ๋ฐํํ๋๋ก ์ค์
# confidence๊ฐ ํฐ ์์๋๋ก ์ถ๋ ฅ
y = rule_Age_C['consequents'].apply(lambda rule_Age_C: "Cardio" in str(rule_Age_C))
y = y[y==True].index
rule_Age_C_cardio = rule_Age_C.loc[y].sort_values('confidence', ascending = False)
rule_Age_C_cardio
๐ฉ ๋ง๋ค์ด์ง ํจํด ๋ฐ์ดํฐํ๋ ์์ ์๋์ ๊ฐ๋ค. ์ด ํ๋ฅผ ๋ณด๊ณ ํ๋จํด๋ณด๋ฉด, Age attribute๋ ์ค์ ํ ๊ธฐ์ค์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋ง์กฑํ๋ ๊ฒฝ์ฐ๊ฐ ์๊ธฐ ๋๋ฌธ์ ์ฐ๊ด์ฑ์ด ์๋ค๊ณ ๋ณผ ์ ์๋ค.
๐ฉ ์ฌ์ค ์์ผ๋ก 10๊ฐ๊ฐ ๋๋ attribute์ ๋ํด์ ์ด ๊ณผ์ ์ ๊ฑฐ์น ํ
๋ฐ, ์ด๋ ๊ฒ ์ผ์ผ์ด ํ๋ฅผ ์ฐพ๊ธฐ์๋ ๋ฒ๊ฑฐ๋กญ๋ค. ๊ทธ๋๋ ์ข ํธํ๊ธฐ ํ๊ธฐ ์ํด์ ์๊ฐํ๋ฅผ ํตํด ํ๋ฒ์ ๋น๊ตํด๋ณด๋ ๊ฒ๋ ์ข์ ๊ฑฐ๋ผ๋ ์๊ฐ์ด ๋ค์๋ค.
fig = go.Figure()
# Support๋ฅผ scatterํ๋ ๋ถ๋ถ
fig.add_trace(
go.Scatter(
x = ['60','50','40','30'], y = rule_Age_C_cardio['support'], name = 'Support',
mode = 'markers+text', marker_size = 20,
text = rule_Age_C_cardio['support'].round(3), textposition = 'middle right', textfont_size = 15))
# Confidence๋ฅผ scatterํ๋ ๋ถ๋ถ
fig.add_trace(
go.Scatter(
x = ['60','50','40','30'], y = rule_Age_C_cardio['confidence'], name = 'Confidence',
mode = 'markers+text', marker_size = 20,
text = rule_Age_C_cardio['confidence'].round(3), textposition = 'middle right', textfont_size = 15))
# Lift๋ฅผ scatterํ๋ ๋ถ๋ถ
fig.add_trace(
go.Scatter(
x = ['60','50','40','30'], y = rule_Age_C_cardio['lift'], name = 'Lift',
mode = 'markers+text', marker_size = 20,
text = rule_Age_C_cardio['lift'].round(3), textposition = 'middle right', textfont_size = 15))
# Lift ํ๋จ ๊ธฐ์ค(Lift > 1)์ ๋ํ๋ด๊ธฐ ์ํ ์์ง์ plot
fig.add_hline(y=1, line_dash="dot",
line_color = "#1dd1ad",
annotation_text=" Lift > 1",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
# support ํ๋จ ๊ธฐ์ค(support > 0.01)์ ๋ํ๋ด๊ธฐ ์ํ ์์ง์ plot
fig.add_hline(y=0.01, line_dash="dot",
line_color = "#8c8cf5",
annotation_text=" support > 0.01",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
# confidence ํ๋จ ๊ธฐ์ค(confidence > 0.6)์ ๋ํ๋ด๊ธฐ ์ํ ์์ง์ plot
fig.add_hline(y=0.6, line_dash="dot",
line_color = "#ff0000",
annotation_text=" confidence > 0.6",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
fig.update_layout(
{
'title' : {'text':'Age-Cardio ๋ณ Support, Confidence, Lift ๊ฐ', 'font':{'size' : 25}},
'xaxis' : {'title':{'text' : 'Age', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
'yaxis' : {'title':{'text' : 'Support / Confidence / Lift', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
'template' : 'plotly_dark'
})
fig.show()
๐ ์์ ๋ํ์์ ๋ณผ ์ ์๋ฏ์ด ๊ฐ ํจํด ๋ฉ์๋์ ๊ธฐ์ค๊ฐ์ ์ํ์ ์ ๊ทธ์ด ํ๋์ ๋ค์ด์ค๊ธฐ ํธํ๊ฒ ์๊ฐํํ๋ค. ์ด์ ๋ค๋ฅธ attribute๋ฅผ ๋ถ์ํด๋ณด๋๋ก ํ์.
๐ 2.2 Gender / Cardio
# ์ ์์ ๋์ผ
Gender_C = pre_tran[['gender','cardio']]
train_data_Gender_C = np.array(Gender_C)
train_data_Gender_C = np.array(train_data_Gender_C.tolist())
te = TransactionEncoder()
te_Gender_C = te.fit(train_data_Gender_C).transform(train_data_Gender_C)
transaction_Gender_C = pd.DataFrame(te_Gender_C, columns=te.columns_)
frequent_itemsets_Gender_C = apriori(transaction_Gender_C, min_support=0.0000001,use_colnames=True)
frequent_itemsets_Gender_C = frequent_itemsets_Gender_C.sort_values('support',ascending = False)
rule_Gender_C = association_rules(frequent_itemsets_Gender_C, metric="lift", min_threshold=0)
y = rule_Gender_C['consequents'].apply(lambda rule_Gender_C: "Cardio" in str(rule_Gender_C))
y = y[y==True].index
rule_Gender_C_cardio = rule_Gender_C.loc[y].sort_values('confidence', ascending = False)
#์๊ฐํ
fig = go.Figure()
fig.add_trace(
go.Scatter(
x = ['Women','Men'], y = rule_Gender_C_cardio['support'], name = 'Support',
mode = 'markers+text', marker_size = 20,
text = rule_Gender_C_cardio['support'].round(3), textposition = 'middle right', textfont_size = 15))
fig.add_trace(
go.Scatter(
x = ['Women','Men'], y = rule_Gender_C_cardio['confidence'], name = 'Confidence',
mode = 'markers+text', marker_size = 20,
text = rule_Gender_C_cardio['confidence'].round(3), textposition = 'middle right', textfont_size = 15))
fig.add_trace(
go.Scatter(
x = ['Women','Men'], y = rule_Gender_C_cardio['lift'], name = 'Lift',
mode = 'markers+text', marker_size = 20,
text = rule_Gender_C_cardio['lift'].round(3), textposition = 'middle right', textfont_size = 15))
fig.add_hline(y=1, line_dash="dot",
line_color = "#1dd1ad",
annotation_text=" Lift > 1",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
fig.add_hline(y=0.01, line_dash="dot",
line_color = "#8c8cf5",
annotation_text=" support > 0.01",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
fig.add_hline(y=0.6, line_dash="dot",
line_color = "#ff0000",
annotation_text=" confidence > 0.6",
annotation_position="bottom left",
annotation_font_size=17,
annotation_font_color="gray"
)
fig.update_layout(
{
'title' : {'text':'Gender-Cardio ๋ณ Support, Confidence, Lift ๊ฐ', 'font':{'size' : 25}},
'xaxis' : {'title':{'text' : 'Gender', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
'yaxis' : {'title':{'text' : 'Support / Confidence / Lift', 'font':{'size' : 20}}, 'showticklabels':True, 'tickfont' : {'size' : 15}},
'template' : 'plotly_dark'
})
fig.show()
๐ฉ ์ค์ ํ ๊ธฐ์ค์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋ง์กฑํ๋ ๊ฒฝ์ฐ๊ฐ ์๊ธฐ ๋๋ฌธ์ ์ฐ๊ด์ฑ์ด ์๋ค๊ณ ๋ณด๊ธฐ ํ๋ค๋ค.
๐ 2.3. BMI / Cardio
๐ฉ ์ค์ ํ ๊ธฐ์ค์ธ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋ง์กฑํ๋ ๊ฒฝ์ฐ๊ฐ ์๊ธฐ ๋๋ฌธ์ ์ฐ๊ด์ฑ์ด ์๋ค๊ณ ๋ณผ ์ ์๋ค.
๐ฉ ์ด ๊ณผ์ ์ ์ผ์ผ์ด ๋์ดํ๋ ๊ฒ๋ ์๋ฏธ๊ฐ ์์ ๊ฑฐ๋ผ ์๊ฐํ๋ค. ๋๋จธ์ง ๋ชจ๋ attribute์ ๋ํด์๋ support, confidence, lift๋ฅผ ๊ตฌํ ๊ฒฐ๊ณผ๋ ์๋์ ๊ฐ๋ค.
๐ 2.4. Support / Confidence / Lift ๊ฒฐ๊ณผ
๐ min_confidence = 0.6 / Lift > 1 / min_support = 0.01 ๋ฅผ ๋ง์กฑํ๋ attribute ์ถ์ถ
- [age, ap_hi, ap_lo, cholesterol, gluc, BMI]
- ์์ ์ํํ correlation ๊ฒฐ๊ณผ์ gluc attribute๊ฐ ํฌํจ๋จ.
๐ ์ด๋ฒ ๊ธ์์๋ ํจํด์ ๋ถ์ํ๊ธฐ ์ํด ์์ ์ธ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์๋ค. ํ์ง๋ง ํ ๊ฐ์ง ์๊ฐํ ์ ์ด ์๋๋ฐ, ์ฐ๋ฆฌ๊ฐ ์ฌ์ฉํ ํธ๋์ญ์
๋ฐ์ดํฐ๊ฐ Null value์ ์ํฅ์ ๋ฐ์ ์ ์๋ค๋ ์ ์ด๋ค. ์ด ๊ฒฝ์ฐ ์ด๋ก ์ ์ผ๋ก Null-invariant measure๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ผ๋ก ์๊ณ ์์๊ธฐ์, ์ฐ๋ฆฌ๋ Kulczynski ๊ฐ๊ณผ Imbalanced Ratio ๊ฐ์ ์ฌ์ฉํด์ ํจํด์ ๋ถ์ํ๋ค. ์ด ๊ณผ์ ์ ๋ค์๊ธ์์ ๋ค๋ฃจ๋๋ก ํ๊ฒ ๋ค.๐๐.
๐ ๊ฑฐ๋ญ ๋งํ์ง๋ง, ์ด ํ๋ก์ ํธ๋ ์์
์์ ๋ฐฐ์ด ๋ด์ฉ์ ์ง์ ์ฌ์ฉํด์ ๋ฐ์ดํฐ๋ก๋ถํฐ ์ธ์ฌ์ดํธ๋ฅผ ๋ฝ์๋ด๊ธฐ ์ํ ํ๋ก์ ํธ์ผ ๋ฟ์ด๋ค. ๋ฏธ์ํ์ง๋ง, ํฅ๋ฏธ์๊ฒ ์ฝ์ด์ฃผ์
จ์ผ๋ฉด ํ๋ค.
Leave a comment