๐Ÿšจ ์ €๋ฒˆ ์—…๋กœ๋“œ ๋‚ด์šฉ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ์•ž์œผ๋กœ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•  ๋ฐฉํ–ฅ์„ ์•Œ์•„๋ณด์•˜๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•ด์„œ ์‚ฌ์šฉํ•  ๋งŒํ•œ ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“œ๋Š” ๊ณผ์ •์„ ๋‹ค๋ฃฐ ๊ฒƒ์ด๋‹ค.

๐Ÿšจ ๋จผ์ € ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ์—ˆ๋Š”์ง€ ํ•œ๋ฒˆ ๋” ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž!!


๐Ÿซ€ Data Info

๐Ÿšจ ๋ฐ์ดํ„ฐ๋Š” ์œ„์™€ ๊ฐ™์ด ์ƒ๊ฒผ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ์ •๋ณด๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด์ž.

cardio.info()
>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64   
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB

๐Ÿšจ ์œ„์—์„œ ๋ดค๋“ฏ์ด ๋ฐ์ดํ„ฐ๋Š” 70000๋ช…์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์œผ๋ฉฐ, weight ๋ฅผ ์ œ์™ธํ•œ ๊ฐ๊ฐ์˜ attribute๋Š” int64 ์˜ ์ž๋ฃŒํ˜•์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹คํ–‰ํžˆ๋„ ๊ฒฐ์ธก๊ฐ’์€ ํ•˜๋‚˜๋„ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿซ€ Data adjustment

๐Ÿšจ ์œ„์˜ ๋ฐ์ดํ„ฐ์—์„œ weight์™€ height attribute๊ฐ€ ์žˆ๋Š” ์ด์œ ๋Š” ๊ฐ ํ”ผํ—˜์ž๋“ค์˜ ๋น„๋งŒ์œจ์„ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด์„œ๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ์ข€ ๋” ๋ณดํŽธ์ ์ธ ์ˆ˜์น˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ ์œ„ํ•ด์„œ BMI ๋ผ๋Š” ์ˆ˜์น˜๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

๐Ÿšจ ๋˜ํ•œ ์œ„์˜ ๋ฐ์ดํ„ฐ์—์„œ age๊ฐ€ ์ƒ๋‹นํžˆ ํฐ ์ˆ˜์ž„์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์•„๋งˆ ์‚ด์•„์˜จ ๋‚ ์งœ๋กœ ๋‚˜์ด๋ฅผ ํ‘œํ˜„ํ•œ ๊ฒƒ ๊ฐ™์•„, ๊ฐ๊ฐ์„ 365๋กœ ๋‚˜๋ˆ„๊ณ  ์†Œ์ˆ˜์  ์ฒซ์งธ์ž๋ฆฌ์—์„œ ๋ฐ˜์˜ฌ๋ฆผํ•ด์ฃผ์—ˆ๋‹ค.

๐Ÿšจ ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ attribute์ธ id, height, weight๋ฅผ ๋ชจ๋‘ ์‚ญ์ œํ–ˆ๋‹ค. ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

cardio['age'] = cardio['age'] / 365
cardio['age'] = round(cardio['age'], 0).astype('int64').copy()
cardio['height'] = cardio['height'] / 100
cardio['BMI'] = cardio['weight'] / (cardio['height']**2)
cardio['BMI'] = round(cardio['BMI'], 2).copy()
cardio = cardio.drop(['id','height','weight'], axis = 1)
cardio.head()


๐Ÿซ€ Blood pressure preprocessing

๐Ÿšจ ์ผ๋ฐ˜์ ์œผ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์ƒ๊ฐํ•˜๊ธฐ์— ์‹ฌํ˜ˆ๊ด€ ์งˆํ™˜์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์€ ํ˜ˆ์••์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ์—์„œ ํ˜ˆ์••์— ๋Œ€ํ•œ ๋ถ€๋ถ„์€ ํ•œ๋ฒˆ ์งš์–ด๋ด์• ๊ฒ ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค. ๋”ฐ๋ผ์„œ ap_hi(์ˆ˜์ถ•๊ธฐํ˜ˆ์••)์™€ ap_lo(์ด์™„๊ธฐํ˜ˆ์••) attribute ๊ฐ๊ฐ์˜ info๋ฅผ ํ™•์ธํ–ˆ๋‹ค.

pd.DataFrame({'ap_hi' : cardio['ap_hi'].describe(), 'ap_lo' : cardio['ap_lo'].describe()})  
>>  
	         ap_hi	         ap_lo
count	  70000.000000	  70000.000000
mean	  128.817286	  96.630414
std	  154.011419	  188.472530
min	  -150.000000	  -70.000000
25%	  120.000000	  80.000000
50%	  120.000000	  80.000000
75%	  140.000000	  90.000000
max	  16020.000000	  11000.000000

๐Ÿšจ ๋ถ„๋ช…ํžˆ ํ˜ˆ์••์— ๋Œ€ํ•œ attribute ์ž„์—๋„ ์Œ์ˆ˜ ๋˜๋Š” max ๊ฐ’์ด 10,000 ์ด ๋„˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์ด ์–ผ๋งˆ๋‚˜ ํผ์ ธ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด์„œ ap_hi, ap_lo ์— ๋Œ€ํ•œ box plot์„ ๊ทธ๋ ค๋ณด์•˜๋‹ค.

# plotly library๋ฅผ ์‚ฌ์šฉํ•ด์„œ Box plot ์ƒ์„ฑ
fig = go.Figure()
fig.add_trace(go.Box(y=cardio['ap_hi'], name = 'ap_hi'))
fig.add_trace(go.Box(y=cardio['ap_lo'], name = 'ap_lo'))

fig.show()


๐Ÿšจ ๋ณธ ๊ธ€์— ์žˆ๋Š” box plot์€ ์ถœ๋ ฅ๋œ ๊ทธ๋ž˜ํ”„๋ฅผ ์บก์ฒ˜ํ•œ ๊ฒƒ์ด๋ผ ๋ณด์ด์ง€ ์•Š์ง€๋งŒ, iplot์„ ์‚ฌ์šฉํ•ด์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๋ฉด ์‚ฌ์šฉ์ž์˜ ๋งˆ์šฐ์Šค ์ปค์„œ์— ๋ฐ˜์‘ํ•˜์—ฌ ๊ฐ box plot์˜ upper fence / lower fence๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ ๊ฐ๊ฐ์„ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • ap_hi : upper_fence = 170 / lower_fence = 90
  • ap_lo : upper_fence = 105 / lower_fence = 65

๐Ÿšจ ์ด์ œ ์ด ๊ฐ’๋“ค์˜ ์ด์ƒ / ์ดํ•˜์— ์†ํ•˜๋Š” ๊ฐ’๋“ค์„ outlier๋กœ ์ƒ๊ฐํ•ด์„œ ์—†์• ์ฃผ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•  ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ, ์ผ๋ฐ˜์ ์œผ๋กœ ํ˜ˆ์••์ด ์ˆ˜์ถ•ํ•  ๋•Œ๊ฐ€ ์ด์™„ํ•  ๋•Œ ๋ณด๋‹ค ๋†’์•„์•ผ ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ap_hi๊ฐ€ ap_lo๋ณด๋‹ค ๋‚ฎ์€ ํ–‰ ์—ญ์‹œ ์‚ญ์ œํ•ด ์ค„ ๊ฒƒ์ด๋‹ค.

# ap_hi (์ˆ˜์ถ•ํ˜ˆ์••)๊ฐ€ ap_lo (์ด์™„ํ˜ˆ์••)๋ณด๋‹ค ๋‚ฎ์€ ํ–‰ ์‚ญ์ œ
# ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” row์˜ index๋ฅผ ๋ฐ›์•„์„œ ์‚ญ์ œํ•  ์ธ๋ฑ์Šค๋กœ ์ €์žฅ - Index ์ž๋ฃŒํ˜•
# .drop() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์‚ญ์ œ ์ง„ํ–‰ - ์ €์žฅํ•œ ์ธ๋ฑ์Šค๋ฅผ parameter๋กœ ๋ฐ›์•„ row ์‚ญ์ œ

low_drop_index = cardio[(cardio['ap_hi'] < cardio['ap_lo'])].index
cardio = cardio.drop(low_drop_index).copy()

# Ap_hi Preprocessing
# ์ „์ž์™€ ๋™์ผ

drop_index_sys = cardio[(cardio['ap_hi'] < 90) | (cardio['ap_hi'] > 170)].index
cardio = cardio.drop(drop_index_sys).copy()

# Ap_lo Preprocessing
# ์ „์ž์™€ ๋™์ผ

drop_index_dias = cardio[(cardio['ap_lo'] < 65) | (cardio['ap_lo'] > 105)].index
cardio = cardio.drop(drop_index_dias).copy()
len(cardio)
64500

๐Ÿšจ ์ด๋ ‡๊ฒŒ ํ˜ˆ์••์— ๋Œ€ํ•œ ์ „์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ 70000๊ฐœ์˜ ํ–‰ ์ค‘์—์„œ 64500๊ฐœ์˜ ํ–‰๋งŒ ๋‚จ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ box plot์„ ๊ทธ๋ ค๋ณด๋„๋ก ํ•˜์ž.

# plotly library๋ฅผ ์‚ฌ์šฉํ•ด์„œ Box plot ์ƒ์„ฑ
fig = go.Figure()
fig.add_trace(go.Box(y=cardio['ap_hi'], name = 'ap_hi'))
fig.add_trace(go.Box(y=cardio['ap_lo'], name = 'ap_lo'))
fig.show()



๐Ÿšจ ์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ฃผ์š” attribute์˜ ์ „์ฒ˜๋ฆฌ๋ฅผ ์™„๋ฃŒํ–ˆ๋‹ค. ๋‹ค์Œ ๊ธ€์—์„œ๋Š” ๋‹ค๋ฅธ attribute์˜ ๊ฒฝ์šฐ์—๋Š” ์ด๋Ÿฐ noisy data๋‚˜ case์— ํ•ด๋‹นํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณด๋„๋ก ํ•˜์ž.


Leave a comment