๐Ÿ ์ €๋ฒˆ ๊ธ€์— ์ด์–ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ Null invariant measure๋“ค์„ ํ†ตํ•ด attribute ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.


1. Null Invariant Measure ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

๐Ÿ“Œ chi-square, p-value ๊ณ„์‚ฐ

# chi-square ๊ณ„์‚ฐ(transaction dataframe ๊ธฐ์ค€ : ๊ฐ attribute์˜ category)
# scipy.stats์˜ chi2_contingency๋ฅผ ํ†ตํ•ด์„œ contingency table ์ƒ์„ฑ.
# contingency table์„ ๋ฐ”ํƒ•์œผ๋กœ chi-square์™€ p-value ๊ณ„์‚ฐ.
chi_list = pd.DataFrame()
from scipy.stats import chi2_contingency
for i in range(len(transaction.columns)):
    f=transaction.columns[i]
    contigency = pd.crosstab(transaction[f],transaction['Cardio'])
    chi, p, dof, expected = chi2_contingency(contigency)
    chi_list = chi_list.append((pd.DataFrame({'chi' :chi, 'p-value':p}, index = [transaction.columns[i]+" & cardio"])))

๐Ÿ“Œ support ๊ณ„์‚ฐ

# support ๊ณ„์‚ฐ ํ•จ์ˆ˜
# transaction data์™€ item์„ parameter๋กœ ๋ฐ›์Œ
# item์ด list ํ˜•ํƒœ๋ฉด ๊ทธ๋Œ€๋กœ item์„ ๋ฐ˜ํ™˜ํ•˜๋ฉฐ, list๊ฐ€ ์•„๋‹ˆ๋ฉด list ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์„œ items_list๋กœ ๋ฐ˜ํ™˜ํ•จ - sum()์—ฐ์‚ฐ
# transaction data์—์„œ item_list์˜ attribute๋ฅผ ๋ฐ›์•„ column๊ธฐ์ค€ sum ์—ฐ์‚ฐ ์ˆ˜ํ–‰. 
# ๋ณ€์ˆ˜ a : ์œ„์˜ sum ์—ฐ์‚ฐ๊ฒฐ๊ณผ๊ฐ€ items_list์˜ ๊ธธ์ด์™€ ๊ฐ™์€ row์˜ ๊ฐœ์ˆ˜, ์ฆ‰ ํ•ด๋‹นํ•˜๋Š” items_list์˜ ์š”์†Œ๋ฅผ ๋ชจ๋‘ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” row์˜ ๊ฐœ์ˆ˜๋ฅผ sum()
# ์ด๋ฅผ ์ „์ฒด transaction ์ˆ˜(๋ณ€์ˆ˜ b)๋กœ ๋‚˜๋ˆ ์„œ ํ•ด๋‹น item ์ง‘ํ•ฉ์ด ๋ฐœ์ƒํ•œ ํ™•๋ฅ ์„ ๊ตฌํ•จ.
def support(transaction, item): 
    items_list = item if isinstance(item,list) else list(item)
    a = np.sum(transaction.loc[:,items_list].sum(axis=1)==len(items_list)) 
    b = transaction.loc[:,items_list].shape[0]
    return a/b

๐Ÿ“Œ Jaccard, Kulczynski, IR ๊ณ„์‚ฐ

# ์œ„์—์„œ ๋งŒ๋“  support ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ง‘ํ•ฉ X, ์ง‘ํ•ฉ Y, ๊ต์ง‘ํ•ฉ์˜ support๋ฅผ ๊ตฌํ•ด์„œ metric ๊ณต์‹์— ๋Œ€์ž….
# ์–ป์€ ๊ฒฐ๊ณผ๊ฐ’์„ dataframeํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
Null_inv = pd.DataFrame()
for i in range(len(transaction.columns)):
    rslt01 = support(transaction, [transaction.columns[i]])
    rslt02 = support(transaction, ['Cardio'])
    rslt03 = support(transaction, (transaction.columns[i],'Cardio'))
    
    jacc = rslt03 / (rslt01+rslt02-rslt03)
    kulc = (0.5)*((rslt03/rslt01) + (rslt03/rslt02)) 
    ir = abs((rslt01 - rslt02)) / (rslt01+rslt02-rslt03)
    Null_inv = Null_inv.append((pd.DataFrame({'Jaccard' :jacc, 'kulczynski' :kulc, 'IR' : ir}, index = [transaction.columns[i]+" & cardio"])))

๐Ÿ“Œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ

# chi-square์™€ p-value, jaccard, kulczynski, IR์„ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ํ†ตํ•ฉ - Null_inv
Null_inv = pd.merge(Null_inv, chi_list, how = 'outer', left_index=True, right_index=True)
Null_inv = Null_inv.drop(['Cardio & cardio', 'No_cardio & cardio'], axis = 0)

# kulczynski๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํฐ ์ˆœ์„œ๋ถ€ํ„ฐ ๋‚˜์—ด
Null_inv = Null_inv.sort_values('kulczynski', ascending = False)
Null_inv


๐Ÿ ์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ตœ์ข… ๊ฐ’๋“ค์„ ๋ชจ๋‘ ๊ตฌํ–ˆ๋Š”๋ฐ, ์ˆซ์ž๋„ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ณ  ๊ทธ ํฌ๊ธฐ๋ฅผ ๋น„๊ตํ•˜๊ธฐ๋„ ํž˜๋“ค๋‹ค. ๋”ฐ๋ผ์„œ ์ด ์ค‘์—์„œ ๊ฐ€์žฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ๋กœ ์•Œ๋ ค์ง„ kulczynski์™€ IR์„ ์‹œ๊ฐํ™”ํ•ด์„œ attribute ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•  ๊ฒƒ์ด๋‹ค.


2. ์‹œ๊ฐํ™”

# ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ plotly library ์ž„ํฌํŠธ
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode()

# Transaction Data์˜ Attribute-Cardio ๋ณ„ Kulczynski / IR ๊ฐ’ ์‹œ๊ฐํ™”
fig = go.Figure()

fig.add_trace(
    go.Scatter(
    x = Null_inv.index, y = Null_inv['IR'], mode = 'markers+lines+text', name = 'IR'))
    
fig.add_trace(
    go.Scatter(
    x = Null_inv.index, y = Null_inv['kulczynski'], mode = 'markers+lines+text', name = 'Kulczynski'))
    
fig.update_layout(
    {
        'title' : {'text':'Attribute-Cardio ๋ณ„ Kulczynski / IR ๊ฐ’', 'font':{'size' : 25}},
        'template' : 'plotly_white'
    })
    
fig.add_hline(y=0.5, line_dash="dot",
              line_color = "#ff0000",
              annotation_text="Kulczynski = 0.5", 
              annotation_position="bottom right",
              annotation_font_size=17,
              annotation_font_color="black"
             )
             
fig.show()

3. Null-Invariant Measures ๊ฒฐ๊ณผ

  • kulczynski, IR, chi-square ๊ฐ’์œผ๋กœ ๋ช‡๊ฐ€์ง€ ์œ ์ถ”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
    • (1) 30 & cardio : kulczynski = 0.109444, IR = 0.980692, chi-square = 111.024261 ์ด๋ฏ€๋กœ ๋‘˜์˜ ๊ด€๊ณ„๋Š” negativeํ•˜๋ฉฐ, ๋‘ ์ง‘ํ•ฉ์˜ ๋ถ„ํฌ๊ฐ€ imbalanced ํ•˜๋‹ค.
    • (2) HBP_SYS & cardio : kulczynski = 0.640039, IR = 0.431196, chi-square = 10586.494635 ์ด๋ฏ€๋กœ ๋‘˜์˜ ๊ด€๊ณ„๋Š” positiveํ•˜๊ณ , ๋‘ ์ง‘ํ•ฉ์˜ ๋ถ„ํฌ ์—ญ์‹œ ์–ด๋Š์ •๋„ ๊ณ ๋ฅด๋‹ค.
  • kulczynski์˜ ๊ฒฐ๊ณผ๊ฐ€ 0.5๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋‰˜๊ณ , ๊ทธ์— ๋”ฐ๋ฅธ IR๋กœ๋ถ€ํ„ฐ ๋ช‡๊ฐ€์ง€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
    • ์ด๋•Œ IR์˜ ๊ฐ’์ด ํ•˜๋‚˜์˜ categroy๊ฐ€ ๊ฐ€์ง€๋Š” ๊ฐ’์— ๋Œ€ํ•ด์„œ ๋‹ค์–‘ํ•˜๊ฒŒ ๋‚˜์˜ค๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์ง‘ํ•ฉ๋“ค ์‚ฌ์ด์˜ ๋ถ„ํฌ๊ฐ€ ๊ณ ๋ฅด์ง€ ์•Š์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ.
    • ๋”ฐ๋ผ์„œ ์ „์ฒด ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๋ถ„ํฌ์˜ ์˜ํ–ฅ(e.x. BMI -> HIGH_OBESITY)์„ ๋ฐฐ์ œํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— kulczynski์™€ IR ๊ฐ’๋งŒ์œผ๋กœ attribute๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์›€.
    • ๊ทธ๋Ÿผ์—๋„ ์ผ๋‹จ kulczynski์™€ IR ๊ฐ’์œผ๋กœ ์„ ํƒํ•œ attribute๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ.
    • [ap_hi, ap_lo, gluc, gender, active, age, cholesterol]
  • ์ „์ฒด์ ์œผ๋กœ ํฐ chi-square๊ฐ’์„ ๋ณด์ด๋‚˜, ๊ทธ์— ๋น„ํ•ด ์ž‘์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” attribute๋ฅผ ์ œ์™ธํ•˜๋ฉด ๋‚จ์€ attribute๋Š” ์•„๋ž˜์™€ ๊ฐ™์Œ.
    • [age, ap_hi, ap_lo, cholesterol, gluc, active, BMI]

๐Ÿ ์ด๋ ‡๊ฒŒ ํ•ด์„œ Null-invariant measure๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ํŒŒ์•…ํ•ด ๋ณด์•˜๋‹ค. ์‚ฌ์‹ค ์ด ๊ฐ’๋“ค์˜ ํฌ๊ธฐ๋ฅผ ํ†ตํ•ด ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ์—๋Š” ์ ˆ๋Œ€์ ์ธ ๊ธฐ์ค€์น˜๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ attribute๋“ค ์‚ฌ์ด์˜ ๊ฐ’๋“ค๋งŒ ์ƒ๋Œ€ํ‰๊ฐ€ํ•ด์„œ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์˜€๋‹ค.

๐Ÿ ์•„๋ฌด๋ž˜๋„ ์ƒ๋Œ€์ ์ธ ๊ฐ’๋“ค์„ ํ†ตํ•ด ๊ทธ ๊ด€๊ฒŒ๋ฅผ ๋น„๊ตํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์ •ํ™•ํ•œ ํŒจํ„ด์ด๋ผ๊ณ  ๋งํ•˜๊ธฐ๋Š” ์–ด๋ ต๊ฒ ์ง€๋งŒ, ์ผ๋‹จ์€ ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ ๊ฐ’๋“ค๊ณผ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•  ๊ฒƒ์ด๋‹ค.


๐Ÿงฉ ์ด๋ฒˆ ํฌ์ŠคํŒ…๊ณผ ์ง€๋‚œ ์„ธ ๊ฐœ์˜ ํฌ์ŠคํŒ…์„ ํ†ตํ•ด์„œ ํ”„๋กœ์ ํŠธ ์ง„ํ–‰์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ํŒจํ„ด ๋ถ„์„์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์•˜๋‹ค. ํ˜น์‹œ๋‚˜ ๋” ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์„ ์œ„ํ•ด ์•„๋ž˜์— ์ง€๋‚œ ๋งํฌ๋“ค์„ ๋‚จ๊ฒจ๋†“์„ ํ…Œ๋‹ˆ, ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๐Ÿ˜Š๐Ÿ˜Š.

๐Ÿ“Project 1
๐Ÿ“Project 2
๐Ÿ“Project 3


๐Ÿงฉ ๋‹น๋ถ„๊ฐ„ ์‹œํ—˜๊ธฐ๊ฐ„์ด๋ผ์„œ ๋ธ”๋กœ๊ทธ ํ™œ๋™์ด ์ข€ ๋” ๋œธํ•ด์งˆ ๊ฒƒ ๊ฐ™๋‹คโ€ฆ๐Ÿ˜ฅ ๊ทธ๋ž˜๋„ ๊ธฐ์กด์— ๊ณต๋ถ€ํ•˜๋˜ ์ž๋ฃŒ๋“ค์„ ์‹œ๊ฐ„์ด ๋œ๋‹ค๋ฉด ํ‹ˆํ‹ˆ์ด ์˜ฌ๋ฆด ์ƒ๊ฐ์ด๋‹ค. ๋ณด์ž˜๊ฒƒ์—†๋Š” ๋ธ”๋กœ๊ทธ์ด์ง€๋งŒ ์ฝ์–ด์ฃผ์‹œ๋Š” ๋ถ„๋“ค๊ผ ๊ฐ์‚ฌ์˜ ๋ง์”€์„ ๋“œ๋ฆฐ๋‹ค๐Ÿ™‡โ€โ™‚๏ธ๐Ÿ™‡โ€โ™‚๏ธ.

Leave a comment