๐Ÿ ํ™•์‹คํžˆ ํ•™๊ธฐ๊ฐ€ ์ง„ํ–‰ ์ค‘์— ์žˆ์œผ๋‹ˆ ๋ธ”๋กœ๊ทธ์— ์‹ ๊ฒฝ ์“ธ ์‹œ๊ฐ„์ด ์ ์  ์ ์–ด์ง€๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋ณธ์ „๊ณต๊ณผ ๋ถ€์ „๊ณต, ๊ต์–‘ ๊ณผ๋ชฉ๋“ค์„ ๋“ฃ๊ณ  ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ SQL ๊ณต๋ถ€๋„ ๋ณ‘ํ–‰ํ•˜๊ณ  ์žˆ์–ด์„œ ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ๋‚ด์šฉ์— ๋Œ€ํ•ด ์ •๋ฆฌํ•  ์—ฌ์œ ๊ฐ€ ์—†๋‹คโ€ฆ๐Ÿ˜ฅ๐Ÿ˜ฅ. ํ•˜์ง€๋งŒ ์ „๊ณต ์ˆ˜์—…์—์„œ ๋ฆฌ๋ˆ…์Šค๋„ ๋‹ค๋ฃจ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์—… ๋‚ด์šฉ์„ ์ฐจ๊ทผ์ฐจ๊ทผ ์ •๋ฆฌํ•˜๋ฉด์„œ ๋ธ”๋กœ๊ทธ๋ฅผ ์ข€ ๋” ์•Œ์ฐจ๊ฒŒ ๋งŒ๋“ค์–ด๋ด์•ผ๊ฒ ๋‹ค.

๐Ÿ ์ €๋ฒˆ ๊ธ€์—์„œ๋Š” support, confidence, lift๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ attribute ๊ฐ„์˜ ํŒจํ„ด์„ ์•Œ์•„๋ณด์•˜๋‹ค. ์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ์ด ์ˆ˜์น˜๋“ค์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ Null-Invariant Measures๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒจํ„ด์„ ๋ถ„์„ํ•  ๊ฒƒ์ด๋‹ค.

๐Ÿ ์ฝ”๋“œ ์ง„ํ–‰์˜ ์ดํ•ด๋ฅผ ์œ„ํ•ด ์ด๋ฒˆ ๊ธ€์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋จผ์ € ๋ณด๋„๋ก ํ•˜์ž.


๐Ÿ“Œ 1. pre_tran : ์ˆ˜์น˜ํ˜•/๋ฒ”์ฃผํ˜• attribute๊ฐ€ ์„ž์—ฌ์žˆ๋˜ ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“  ๊ฒƒ


๐Ÿ“Œ 2. transaction : pre_tran์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งŒ๋“  ์ตœ์ข… ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ - Boolean ํ‘œํ˜„ํ˜•


1. Null-Invariant Measures ๊ฐœ๋…

๐Ÿ transaction dataframe์—์„œ Null data์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ Null Invariantํ•œ metric ์‚ฌ์šฉ

๐Ÿ Jaccard, kulczynski, IR, chi-square, p-value ์‚ฌ์šฉ

  • Jaccard : ๊ต์ง‘ํ•ฉ/ํ•ฉ์ง‘ํ•ฉ. ๋‘ ์ง‘ํ•ฉ์ด ๋™์ผํ•˜๋ฉด 1, ๊ณตํ†ต์˜ ์›์†Œ๊ฐ€ ํ•˜๋‚˜๋„ ์—†์œผ๋ฉด 0
  • kulczynski : (0.5) * ((๊ต์ง‘ํ•ฉ/X) + (๊ต์ง‘ํ•ฉ/Y)). ๋‘ ์ง‘ํ•ฉ์ด ๋™์ผํ•˜๋ฉด 1, ๊ณตํ†ต์˜ ์›์†Œ๊ฐ€ ํ•˜๋‚˜๋„ ์—†์œผ๋ฉด 0
    • 0.5๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‘ ์ง‘ํ•ฉ์˜ ๊ด€๊ณ„๊ฐ€ negative์ธ์ง€ positive ์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ.
  • IR : |X-Y| / ํ•ฉ์ง‘ํ•ฉ. ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ํ˜•ํƒœ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ.
  • chi-square : ๋ฐ์ดํ„ฐ์˜ ์—ฐ๊ด€๊ด€๊ณ„ ๋ถ„์„. ํด์ˆ˜๋ก ์—ฐ๊ด€์„ฑ์ด ๊นŠ์€ ์ง‘ํ•ฉ์ž„.
  • p-value : ์œ ์˜ํ™•๋ฅ ์„ ์ œ๊ณตํ•ด์„œ ๋Œ€๋ฆฝ๊ฐ€์„ค์„ ์ฑ„ํƒํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํŒ๋‹จ๊ฐ€๋Šฅ.

๐Ÿ ์ผ๋ฐ˜์ ์œผ๋กœ chi-square๊ฐ€ ํฌ๊ณ , kulczynski๊ฐ€ ์ž‘์œผ๋ฉด negativeํ•˜๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ.

๐Ÿ ์˜ˆ๋ฅผ ๋“ค์–ด, kulczynski๊ฐ€ 0.1 ์ด๊ณ  IR์ด 0.98, chi-square๊ฐ€ 111 ์ •๋„๋กœ ๋‚˜ํƒ€๋‚˜๋ฉด ๋‘ ์ง‘ํ•ฉ์˜ ๊ด€๊ณ„๋Š” negativeํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋ถˆ๊ท ํ˜•์ ์ด๋ผ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ.


2. chi-square ๊ณ„์‚ฐ

# chi-square ๊ณ„์‚ฐ(original cardio dataframe์˜ attribute ๊ธฐ์ค€)
# scipy.stats์˜ chi2_contingency๋ฅผ ํ†ตํ•ด์„œ contingency table ์ƒ์„ฑ.
# contingency table์„ ๋ฐ”ํƒ•์œผ๋กœ chi-square์™€ p-value ๊ณ„์‚ฐ.
pre_tran_2 = pre_tran.drop('cardio', axis = 1)
chi_list_origin = pd.DataFrame()

from scipy.stats import chi2_contingency
for i in range(len(pre_tran_2.columns)):
    f=pre_tran_2.columns[i]
    contigency = pd.crosstab(pre_tran_2[f],pre_tran['cardio'])
    chi, p, dof, expected = chi2_contingency(contigency)
    chi_list_origin = chi_list_origin.append((pd.DataFrame({'chi' :chi, 'p-value':p}, index = [pre_tran_2.columns[i]+" & cardio"])))
chi_list_origin = chi_list_origin.sort_values('chi', ascending = False)

chi_list_origin


๐Ÿ“Œ ์œ„์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ฒ˜๋Ÿผ ๊ฐ original dataframe์˜ attribute์™€ cardio ๊ฐ„์˜ chi-square ๊ฐ’๊ณผ p-value ๊ฐ’์ด ๊ตฌํ•ด์ง„๋‹ค. ์ด๋ ‡๊ฒŒ ์ˆ˜์น˜๋งŒ์œผ๋กœ ๋น„๊ตํ•˜๋ฉด ๋ฒˆ๊ฑฐ๋กœ์šฐ๋‹ˆ๊นŒ ํ•œ๋ฒˆ ์‹œ๊ฐํ™”ํ•ด์„œ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž.

# Category Data ๊ธฐ์ค€ Attribute์˜ Cardio์™€์˜ chi-square ์—ฐ์‚ฐ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
fig = go.Figure()

fig.add_trace(
    go.Scatter(
    x = chi_list_origin.index, y = chi_list_origin['chi'], mode = 'markers+lines+text',
    text = chi_list_origin['chi'].round(3), textposition = 'top right', textfont_size = 15))

fig.update_layout(
    {
        'title' : {'text':'Attribute-Cardio ๋ณ„ Chi-Square ๊ฐ’', 'font':{'size' : 25}},
        'xaxis' : {'showticklabels':True, 'tickfont' : {'size' : 15}},
        'template' : 'plotly_white'
    })

fig.show()

๐Ÿ ์ด๋ ‡๊ฒŒ ํ•ด์„œ ๋ณ„๋กœ ๊ธธ์ง€ ์•Š์€ ์ฝ”๋”ฉ์„ ํ†ตํ•ด ๊ฐ„๋‹จํ•˜๊ฒŒ chi-square ๊ฐ’๊ณผ p-value๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์œ„์˜ 1์ ˆ์—์„œ ๊ฐœ๋…์„ ์‚ดํŽด๋ณผ ๋•Œ ๋ณด์•˜๋“ฏ์ด chi-square ๊ฐ’๋งŒ ๊ฐ€์ง€๊ณ  ์ด๋“ค๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‹จ์ •์ง€์„ ์ˆ˜๋Š” ์—†๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๊ฐ’๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ Null-invariant measure๋“ค๊นŒ์ง€ ์‚ฌ์šฉํ•ด์„œ ๋ณตํ•ฉ์ ์œผ๋กœ ๋ถ„์„ํ•  ๊ฒƒ์ด๋‹ค.

๐Ÿ ๋‹ค์Œ ๊ธ€์—์„œ๋Š” ์ด๋Ÿฌํ•œ measure ๋กœ๋ถ€ํ„ฐ ๊ตฌํ•ด์ง„ ๊ฐ’๋“ค์„ ๊ฐ€์ง€๊ณ  ์‹ค์ œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค์–ด๋ณด๊ณ  ์‹œ๊ฐํ™” ํ•œ ํ›„ ๋ถ„์„๊นŒ์ง€ ์ง„ํ–‰ํ•  ๊ฒƒ์ด๋‹ค๐Ÿ˜Š๐Ÿ˜Š.


Leave a comment