๐งฉ ๋ฐ์ดํฐ๋ง์ด๋(20) ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ_13 : Review
๐งฉ ์ด๋๊น์ง ์ ๋ง ๋ง์ Data Preprocessing ๋ฐฉ๋ฒ์ ๋ํด ์์๋ณด์๋ค. ์ด๋ฒ ํฌ์คํ ์์ ๊ทธ ๊ฐ๋ ๋ค์ ๊ฐ๋จํ๊ฒ ์ดํด๋ณด๋ ๊ฒ์ผ๋ก Preprocessing์ ๋ง๋ฌด๋ฆฌ์ง๋๋ก ํ์.
1. Data Cleaning
- Missing Data / Noisy Data / Outlier / Inconsistence ๋ฑ์ ๋ค๋ค ๋ฐ์ดํฐ๋ฅผ ๊นจ๋์ด ์ ๋ฆฌ.
2. Data Integration
- ์ฌ๋ฌ ๋ฐ์ดํฐ๋ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ํตํฉ.
-
Redundancy ์กฐ์ ํ์ : ์๊ด๊ด๊ณ ๋ถ์ / ๊ณต๋ถ์ฐ ๋ถ์
-
Categorical Data - correlation analysis : chi-square test
- Numerical Data - variance analysis
- covariance ($ฯ_{12}$) : range [$-โ, +โ$]
- correlation ($ฯ_{12}$) : range [$-1,1$]
3. Data Reduction
- ์ง๋์น๊ฒ ๋ณต์กํ ๋ฐ์ดํฐ ๋ถ์์๋ ๋ง์ ์๊ฐ๊ณผ ๋น์ฉ์ด ๋ค๊ธฐ์ ๋ถํ์ํ attribute๋ฅผ ์ ๊ฑฐํด์ผ ํจ.
-
๋จ, Reduction ์ ํ์ ๊ฒฐ๊ณผ๋ ๋น์ทํ๊ฒ ์ ์ง๋์ด์ผ ํจ.
-
Attribute Reduction (Demensionality Reduction) - Subset selection, PCA
- Observation Reduction (Numerosity Reduction)
- Parametric model : Regression
- Non-parametric model : histogram / Clustering / Sampling
- Data Compression - String Compression / Audio-Video Compression
4. Dimensionality Reduction
- Random Variables์ ์๋ฅผ ์ค์ฌ์ ์ฃผ์ํ Variables๋ฅผ ๋ง๋๋ ๊ฒ
-
Irrelevant attribute ์ ๊ฑฐ
- Feature Selection
- Best subset selection
- Forward stepwise selection
- Backward stepwise selection
- Feature Extraction - PCA
5. Data Transformation
-
์ ์ฒด Attribute ๊ฐ์ ์๋ก์ด ๊ฐ์ผ๋ก ๋ณ๊ฒฝํด์ฃผ๋ ํจ์
- Smoothing
- Attribute / Feature Construction
- Aggregation
- Normalization
- Min-Max Normalization
- Z - score Normalization
- Decimal Scaling
- Discretization
- Binning
- Histogram / Clustering
- Classification : Decision Tree
- Correlation
๐งฉ ์ด๋ ๊ฒ ์ ๋ฆฌ๊ฐ ๋๋ฌ๋ค. ์ง๋ ํ๊ธฐ์ ๋ฐฐ์ฐ๋ ๋์์๋ ๊ฐ๋
๋ ๋ง์ด ํท๊ฐ๋ฆฌ๊ณ , ๋ญ๊ฐ ๋ญ์ง ์ ํํ ์๊ธฐ๋ ์ด๋ ค์ ๋ ๊ฒ ๊ฐ์๋ฐ ๋ธ๋ก๊ทธ์ ์ ๋ฆฌ๋ฅผ ํ๋ฉด์ ๋๋ ๋ชฐ๋๋ ๊ฐ๋
๋ค์ ๋ค์ ์ก์๋๊ฐ๋ ๊ธฐํ๊ฐ ๋๋ ๊ฒ ๊ฐ๋ค. ๋ธ๋ก๊ทธ ํฌ์คํ
๋ด์ฉ๋ ๊ฝค๋ ๋ง์๊ธฐ ๋๋ฌธ์, ๊ฐ ๊ฐ๋
๋ค์ ๋ํ ๋ธ๋ก๊ทธ ๋งํฌ๋ฅผ ์ฒจ๋ถํ ํ
๋ ์ฐธ๊ณ ํ์๋ฉด ์ข์ ๊ฒ ๊ฐ๋ค๐๐.
๐ Preprocessing_1 : Data Cleaning
๐ Preprocessing_2 : Data Integration - chi-square test
๐ Preprocessing_3 : Data Integration - Numerical Data
๐ Preprocessing_4 : Data Reduction - Introduce
๐ Preprocessing_5 : Data Reduction - Linear Regression
๐ Preprocessing_6 : Data Reduction - Nonlinear Regression
๐ Preprocessing_7 : Data Reduction - Nonparametric
๐ Preprocessing_8 : Data Reduction - Dimensionality
๐ Preprocessing_9 : Data Reduction - Subset Selection
๐ Preprocessing_10 : Data Reduction - PCA
๐ Preprocessing_11 : Data Reduction - PCA_2
๐ Preprocessing_12 : Data Transformation
๐งฉ ๋ค์ ํฌ์คํ ๋ถํฐ๋ pattern analysis์ ๋ํด ๋ฐฐ์๋ณด๋๋ก ํ์๐โโ๏ธ๐โโ๏ธ.
Leave a comment