🚨 μ €λ²ˆ κΈ€μ—μ„œλŠ” μœ μΌν•˜κ²Œ classify λ˜μ§€ μ•Šμ€ ν˜ˆμ•• attribute의 λ²”μœ„λ₯Ό μ•Œμ•„λ³΄κ³  이λ₯Ό μ „μ²˜λ¦¬ν•˜λŠ” λ‚΄μš©μ„ λ‹€λ€˜λ‹€. 이번 κΈ€μ—μ„œλŠ” λ‚˜λ¨Έμ§€ classify 된 attribute μ€‘μ—μ„œ noise κ°€ μžˆλŠ”μ§€ ν™•μΈν•΄λ³΄μž.

🚨 원본 데이터 70000개의 object μ€‘μ—μ„œ Systolic / Diastolic Blood Presure Preprocessing 의 κ²°κ³Ό 남은 objectκ°€ 64500κ°œμ˜€λ‹€. λ”°λΌμ„œ 각 attribute의 classλ“€μ˜ 개수λ₯Ό ν•©μΉ˜λ©΄ 각각 64500 이 λ‚˜μ˜€λŠ”μ§€ ν™•μΈν•˜λŠ” λ°©λ²•μœΌλ‘œ noisy dataλ₯Ό νŒλ‹¨ν•  것이닀.

🚨 사싀, μƒλ‹Ήνžˆ λ‹¨μˆœλ¬΄μ‹ν•œ 방법이라고 μƒκ°ν•œλ‹€. 각 attribute κ°€ κ°€μ§€λŠ” 값을 ν™•μΈν•˜λŠ” .unique() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•΄μ„œ 이λ₯Ό μ‰½κ²Œ μ•Œ μˆ˜λ„ μžˆμ§€λ§Œ, ν”„λ‘œμ νŠΈμ˜ μ½”λ“œλ§ŒμœΌλ‘œλ„ 진행방ν–₯을 μ„€λͺ…ν•˜κΈ° μœ„ν•΄ μ΄λŸ¬ν•œ 방식을 μ‚¬μš©ν•˜μ˜€λ‹€. λ™μ‹œμ— 각 attribute의 class의 뢄포λ₯Ό μ•Œ 수 μžˆλ‹€λŠ” μž₯점 λ˜ν•œ μžˆλ‹€κ³  μƒκ°ν•˜μ˜€λ‹€πŸ˜€πŸ˜€.


πŸ«€ 3.1. Label (cardio) - Binary

print('cardiovascular : ', len(cardio[cardio['cardio']==1]))
print('Non-cardiovascular : ', len(cardio[cardio['cardio']==0]))
print('Cardio summation : ', len(cardio[cardio['cardio']==1]) + len(cardio[cardio['cardio']==0]))  
>>  
cardiovascular :  32146
Non-cardiovascular :  32354
Cardio summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.

🚨 λ˜ν•œ, μš°λ¦¬κ°€ μ˜ˆμΈ‘ν•˜κ³ μž ν•˜λŠ” target λ“€μ˜ 뢄포가 거의 1:1 에 κ°€κΉλ‹€λŠ” 것을 확인 ν•  수 μžˆμ—ˆλ‹€.

πŸ«€ 3.2. Cholesterol - Ordinal

print('normal-cholesterol : ', len(cardio[cardio['cholesterol']==1]))
print('above normal-cholesterol : ', len(cardio[cardio['cholesterol']==2]))
print('well above normal-cholesterol : ', len(cardio[cardio['cholesterol']==3]))
print('Cholesterol summation : ', len(cardio[cardio['cholesterol']==1]) + len(cardio[cardio['cholesterol']==2]) + len(cardio[cardio['cholesterol']==3]))  
>>  
normal-cholesterol :  48461
above normal-cholesterol :  8583
well above normal-cholesterol :  7456
Cholesterol summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.

πŸ«€ 3.3. Glucose - Ordinal

print('normal-Glucose : ', len(cardio[cardio['gluc']==1]))
print('above normal-Glucose : ', len(cardio[cardio['gluc']==2]))
print('well above normal-Glucose : ', len(cardio[cardio['gluc']==3]))
print('Glucose summation : ', len(cardio[cardio['gluc']==1]) + len(cardio[cardio['gluc']==2]) + len(cardio[cardio['gluc']==3]))  
>>  
normal-Glucose :  54886
above normal-Glucose :  4673
well above normal-Glucose :  4941
Glucose summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.

πŸ«€ 3.4. Smoke / Non-smoke - Binary

print('Smoke : ', len(cardio[cardio['smoke']==1]))
print('Non-smoke : ', len(cardio[cardio['smoke']==0]))
print('Smoke summation : ', len(cardio[cardio['smoke']==1]) + len(cardio[cardio['smoke']==0]))  
>>  
Smoke :  5651
Non-smoke :  58849
Smoke summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.

πŸ«€ 3.5. Alcohol / Non-Alcohol - Binary

print('alcohol : ', len(cardio[cardio['alco']==1]))
print('Non-alcohol : ', len(cardio[cardio['alco']==0]))
print('Alcohol summation : ', len(cardio[cardio['alco']==1]) + len(cardio[cardio['alco']==0]))  
>>  
alcohol :  3422
Non-alcohol :  61078
Alcohol summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.

πŸ«€ 3.6. Active - Binary

print('Active : ', len(cardio[cardio['active']==1]))
print('Non-Active : ', len(cardio[cardio['active']==0]))
print('Active summation : ', len(cardio[cardio['active']==1]) + len(cardio[cardio['active']==0]))  
>>  
Active :  51825
Non-Active :  12675
Active summation :  64500  

🚨 각 case의 경우의 λ°μ΄ν„°μˆ˜μ˜ 합이 총 λ°μ΄ν„°μˆ˜μ™€ λ™μΌν•˜λ―€λ‘œ noisy dataκ°€ μ—†λ‹€.


🚨 각 attributeκ°€ noise 없이 κΉ”λ”ν•˜κ²Œ μ •λ¦¬λœ κ±Έ 확인할 수 μžˆμ—ˆλ‹€. μ΅œμ’…μ μœΌλ‘œ μ‚¬μš©ν•  데이터와 infoλ₯Ό ν™•μΈν•˜λŠ” κ²ƒμœΌλ‘œ 데이터 μ „μ²˜λ¦¬ 과정을 μ •λ¦¬ν•˜λ„λ‘ ν•˜κ² λ‹€πŸ˜ŠπŸ˜Š.

cardio  


cardio.info()  
>>  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 64500 entries, 0 to 69999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          64500 non-null  int64  
 1   gender       64500 non-null  int64  
 2   ap_hi        64500 non-null  int64  
 3   ap_lo        64500 non-null  int64  
 4   cholesterol  64500 non-null  int64  
 5   gluc         64500 non-null  int64  
 6   smoke        64500 non-null  int64  
 7   alco         64500 non-null  int64  
 8   active       64500 non-null  int64  
 9   cardio       64500 non-null  int64  
 10  BMI          64500 non-null  float64
dtypes: float64(1), int64(10)
memory usage: 5.9 MB  

Leave a comment