๐Ÿงฉ ์ด๋•Œ๊นŒ์ง€ ์ •๋ง ๋งŽ์€ Data Preprocessing ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๊ทธ ๊ฐœ๋…๋“ค์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์œผ๋กœ Preprocessing์„ ๋งˆ๋ฌด๋ฆฌ์ง“๋„๋ก ํ•˜์ž.


1. Data Cleaning

  • Missing Data / Noisy Data / Outlier / Inconsistence ๋“ฑ์„ ๋‹ค๋ค„ ๋ฐ์ดํ„ฐ๋ฅผ ๊นจ๋—์ด ์ •๋ฆฌ.

2. Data Integration

  • ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ๋‚˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ†ตํ•ฉ.
  • Redundancy ์กฐ์ ˆ ํ•„์š” : ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„ / ๊ณต๋ถ„์‚ฐ ๋ถ„์„

  • Categorical Data - correlation analysis : chi-square test

  • Numerical Data - variance analysis
    • covariance ($ฯƒ_{12}$) : range [$-โˆž, +โˆž$]
    • correlation ($ฯ_{12}$) : range [$-1,1$]

3. Data Reduction

  • ์ง€๋‚˜์น˜๊ฒŒ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๋ถ„์„์—๋Š” ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์ด ๋“ค๊ธฐ์— ๋ถˆํ•„์š”ํ•œ attribute๋ฅผ ์ œ๊ฑฐํ•ด์•ผ ํ•จ.
  • ๋‹จ, Reduction ์ „ํ›„์˜ ๊ฒฐ๊ณผ๋Š” ๋น„์Šทํ•˜๊ฒŒ ์œ ์ง€๋˜์–ด์•ผ ํ•จ.

  • Attribute Reduction (Demensionality Reduction) - Subset selection, PCA

  • Observation Reduction (Numerosity Reduction)
    • Parametric model : Regression
    • Non-parametric model : histogram / Clustering / Sampling
  • Data Compression - String Compression / Audio-Video Compression

4. Dimensionality Reduction

  • Random Variables์˜ ์ˆ˜๋ฅผ ์ค„์—ฌ์„œ ์ฃผ์š”ํ•œ Variables๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ
  • Irrelevant attribute ์ œ๊ฑฐ

  • Feature Selection
    • Best subset selection
    • Forward stepwise selection
    • Backward stepwise selection
  • Feature Extraction - PCA

5. Data Transformation


๐Ÿงฉ ์ด๋ ‡๊ฒŒ ์ •๋ฆฌ๊ฐ€ ๋๋‚ฌ๋‹ค. ์ง€๋‚œ ํ•™๊ธฐ์— ๋ฐฐ์šฐ๋Š” ๋™์•ˆ์—๋Š” ๊ฐœ๋…๋„ ๋งŽ์ด ํ—ท๊ฐˆ๋ฆฌ๊ณ , ๋ญ๊ฐ€ ๋ญ”์ง€ ์ •ํ™•ํžˆ ์•Œ๊ธฐ๋„ ์–ด๋ ค์› ๋˜ ๊ฒƒ ๊ฐ™์€๋ฐ ๋ธ”๋กœ๊ทธ์— ์ •๋ฆฌ๋ฅผ ํ•˜๋ฉด์„œ ๋‚˜๋„ ๋ชฐ๋ž๋˜ ๊ฐœ๋…๋“ค์„ ๋‹ค์‹œ ์žก์•„๋‚˜๊ฐ€๋Š” ๊ธฐํšŒ๊ฐ€ ๋๋˜ ๊ฒƒ ๊ฐ™๋‹ค. ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŒ… ๋‚ด์šฉ๋„ ๊ฝค๋‚˜ ๋งŽ์•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ๊ฐœ๋…๋“ค์— ๋Œ€ํ•œ ๋ธ”๋กœ๊ทธ ๋งํฌ๋ฅผ ์ฒจ๋ถ€ํ•  ํ…Œ๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๐Ÿ˜ƒ๐Ÿ˜ƒ.

๐Ÿ“ Preprocessing_1 : Data Cleaning
๐Ÿ“ Preprocessing_2 : Data Integration - chi-square test
๐Ÿ“ Preprocessing_3 : Data Integration - Numerical Data
๐Ÿ“ Preprocessing_4 : Data Reduction - Introduce
๐Ÿ“ Preprocessing_5 : Data Reduction - Linear Regression
๐Ÿ“ Preprocessing_6 : Data Reduction - Nonlinear Regression
๐Ÿ“ Preprocessing_7 : Data Reduction - Nonparametric
๐Ÿ“ Preprocessing_8 : Data Reduction - Dimensionality
๐Ÿ“ Preprocessing_9 : Data Reduction - Subset Selection
๐Ÿ“ Preprocessing_10 : Data Reduction - PCA
๐Ÿ“ Preprocessing_11 : Data Reduction - PCA_2
๐Ÿ“ Preprocessing_12 : Data Transformation

๐Ÿงฉ ๋‹ค์Œ ํฌ์ŠคํŒ…๋ถ€ํ„ฐ๋Š” pattern analysis์— ๋Œ€ํ•ด ๋ฐฐ์›Œ๋ณด๋„๋ก ํ•˜์ž๐Ÿƒโ€โ™‚๏ธ๐Ÿƒโ€โ™‚๏ธ.


๐Ÿ’ก์œ„ ํฌ์ŠคํŒ…์€ ํ•œ๊ตญ์™ธ๊ตญ์–ด๋Œ€ํ•™๊ต ๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ๊ณตํ•™๋ถ€ ๊ณ ์œคํฌ ๊ต์ˆ˜๋‹˜์˜ [์ƒ๋ช…์ •๋ณดํ•™์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹] ๊ฐ•์˜ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•จ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment