๐Ÿงฉ Distance Measure ๋งˆ์ง€๋ง‰ ํฌ์ŠคํŒ…์ด๋‹ค๐Ÿ™„. Document Frequency๋ฅผ ์œ„ํ•œ Cosine Similarity์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

1. Cosine Similarity of two vectors

๐Ÿงฉ Document Frequency๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๊ถ๊ธˆํ•  ์ˆ˜ ์žˆ์„ ํ…๋ฐ, ์‹ ๋ฌธ๊ธฐ์‚ฌ๋‚˜ ์ธํ„ฐ๋„ท ๊ธฐ์‚ฌ๋ฅผ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค. ์—ฐ์˜ˆ๊ธฐ์‚ฌ์—๋Š” ์—ฐ์˜ˆ๊ธฐ์‚ฌ๋งŒ์˜ ์ž์ฃผ ๋‚˜์˜ค๋Š” ์šฉ์–ด๋“ค์ด ์žˆ์„ ๊ฒƒ์ด๊ณ , ์Šคํฌ์ธ  ๊ธฐ์‚ฌ์—๋Š” ๊ทธ๋งŒ์˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์šฉ์–ด๋“ค์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ธฐ์‚ฌ๋“ค ๊ฐ„์˜ similarity๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ์œ ์‚ฌ์„œ์„ ์•Œ์•„๋ณด๋Š” ๊ฒƒ์ด Cosine Similarity์˜ ๋ชฉ์ ์ด๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœํžˆ ํ…์ŠคํŠธ๋“ค์˜ ์œ ์‚ฌ์„ฑ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์œ ์ „์ฒด์— ๋Œ€ํ•œ ๋ถ„์„๋„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” measure์ด๊ธฐ ๋•Œ๋ฌธ์— Gene feature ํ˜น์€ biologic toxonomy๋“ฑ์˜ ๋„๋ฉ”์ธ์—์„œ๋„ ์‚ฌ์šฉํ•˜๋Š” ์ถ”์„ธ์ด๋‹ค.

๐Ÿงฉ ์˜ˆ์‹œ๋ถ€ํ„ฐ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž.

๐Ÿ‘‰ ์œ„์˜ ์˜ˆ์‹œ์—์„œ ๊ฐ๊ฐ์˜ Document๋“ค์ด ์‹ ๋ฌธ๊ธฐ์‚ฌ๋ฅผ ์˜๋ฏธํ•˜๊ณ , ๊ฐ column๋“ค์ด ์‹ ๋ฌธ๊ธฐ์‚ฌ์—์„œ ๋‚˜์˜ค๋Š” ์šฉ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ด์ œ ์šฐ๋ฆฌ๋Š” Frequency๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฐ document์˜ ๋นˆ๋„๋ฅผ vector๋กœ ํ‘œํ˜„ํ•  ๊ฒƒ์ด๋‹ค. ๊ฐ๊ฐ์˜ ๋ฒกํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.

$\overrightarrow{d_{1}}=[5,0,3,0,2,0]$


$\overrightarrow{d_{2}}=[3,0,2,0,1,1]$


$\overrightarrow{d_{3}}=[0,7,0,2,1,0]$


$\overrightarrow{d_{4}}=[0,1,0,0,1,2]$


์•ž์œผ๋กœ ์ด ๋ฒกํ„ฐ๋“ค์„ term-frquency vector ๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. ์ด์ œ ์ด ๋ฒกํ„ฐ๋“ค์„ ๊ฐ€์ง€๊ณ  similarity๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ measure๋ฅผ ์‚ดํŽด๋ณด์ž.

๐Ÿ“ Cosine Measure

๋‘ ๋ฒกํ„ฐ $\overrightarrow{d_{1}}$, $\overrightarrow{d_{2}}$์— ๋Œ€ํ•˜์—ฌ (๋‹จ, ๋‘ ๋ฒกํ„ฐ๋Š” term-frquency vector)

$cos(\overrightarrow{d_{1}}, \overrightarrow{d_{2}}) = \frac{\overrightarrow{d_{1}}\cdot\overrightarrow{d_{2}}}{|\overrightarrow{d_{1}}|\times|\overrightarrow{d_{2}}|}$


๐Ÿ‘‰ ๋ฒกํ„ฐ๋„ ๋‚˜์˜ค๊ณ , ๋‚ด์ ๋„ ๋‚˜์™€์„œ ์–ผํ•๋ณด๋ฉด ๋ณต์žกํ•ด๋ณด์ด๋Š” ์‹์ด๊ธด ํ•˜์ง€๋งŒ ๊ทธ๋ƒฅ ๋‹จ์ˆœํžˆ ๋‚ด์  ๊ณ„์‚ฐ ์‹์—์„œ ํŒŒ์ƒ๋˜๋Š” measure์ด๋‹ค. ๋‚ด์ ๊ฐ’์€ ๋‘ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ์˜ ๊ณฑ์— ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ์ธ ฮธ์˜ ์ฝ”์‚ฌ์ธ ๊ฐ’์„ ๊ตฌํ•ด ๊ณฑํ•ด์ฃผ๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๋ƒฅ ๊ทธ ์‹์„ ๋„˜๊ฒจ์ฃผ๋Š” ๊ฒƒ ๋ฟ์ด๋‹ค.

๐Ÿ‘‰ ์ฝ”์‚ฌ์ธ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์ฝ”์‚ฌ์ธ ๊ฐ’์€ ฮธ๊ฐ€ ์ž‘์„์ˆ˜๋ก ์ปค์ง„๋‹ค. ๋”ฐ๋ผ์„œ cosine similarity ๊น‚์ธ $cos(\overrightarrow{d_{1}}, \overrightarrow{d_{2}})$ ๊ฐ€ ์ปค์ง€๋ฉด ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ž‡๊ฐ์ธ ฮธ๊ฐ€ ์ž‘์•„์„œ ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์„œ๋กœ ๊ฐ€๊น๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๐Ÿงฉ ์œ„์˜ ์˜ˆ์‹œ์—์„œ ์ง์ ‘ cosine similarity๋ฅผ ๊ตฌํ•ด๋ณด๋Š” ๊ฒƒ์œผ๋กœ ๋๋‚ด์ž๐Ÿ˜‰.

$\overrightarrow{d_{1}}=[5,0,3,0,2,0]$


$\overrightarrow{d_{2}}=[3,0,2,0,1,1]$


$\overrightarrow{d_{3}}=[0,7,0,2,1,0]$


$\overrightarrow{d_{4}}=[0,1,0,0,1,2]$


$cos(\overrightarrow{d_{1}}, \overrightarrow{d_{2}})$ ์— ๋Œ€ํ•ด์„œ

$\overrightarrow{d_{1}}\cdot\overrightarrow{d_{2}} = 15+0+6+0+2+0=23$

$|\overrightarrow{d_{1}}| = \sqrt{25+0+9+0+4+0} = \sqrt{38}$

$|\overrightarrow{d_{2}}| = \sqrt{9+0+4+0+1+1} = \sqrt{15}$

$cos(\overrightarrow{d_{1}}, \overrightarrow{d_{2}})=\frac{23}{\sqrt{38}\times\sqrt{15}} = 0.963$

  • cosine similarity ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šด ํฐ ๊ฐ’์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๋ฒกํ„ฐ๋Š” ์„œ๋กœ ๊ฐ€๊น๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Distance Measure ์š”์•ฝ

๐Ÿงฉ Distance Measure ๊ด€๋ จ ๋งํฌ๋ฅผ ์•„๋ž˜์— ์ฒจ๋ถ€ํ•ด๋‘์—ˆ์œผ๋‹ˆ ํ•„์š”ํ•œ ์‚ฌ๋žŒ์€ ์ฐธ๊ณ ํ•˜๋ฉด ์ •๋ฆฌ์— ๋„์›€์ด ๋  ๊ฒƒ ๊ฐ™๋‹ค๐Ÿ˜‰.


๐Ÿ“ 1. QQ plot / Scatter plot ๊ด€๋ จ ํฌ์ŠคํŒ…
๐Ÿ“ 2. Distance Matrix ๊ด€๋ จ ํฌ์ŠคํŒ…
๐Ÿ“ 3. Categorical / Binary Attributes ๊ด€๋ จ ํฌ์ŠคํŒ…
๐Ÿ“ 4. Numeric Data - Minkowski Distance ๊ด€๋ จ ํฌ์ŠคํŒ…


๐Ÿงฉ ์ด๋ ‡๊ฒŒ ํ•ด์„œ Distance Measure๋ฅผ ๋ชจ๋‘ ์•Œ์•„๋ณด์•˜๋‹ค. ์ˆ˜์‹์ด ๋ณต์žกํ•ด๋ณด์ด๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๊ณ , ๊ทธ ๊ฐœ๋…์ด ํ—ท๊ฐˆ๋ฆฌ๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์ง€๋งŒ ์–ด๋–ค ์ž๋ฃŒํ˜•์˜ ๋ฐ์ดํ„ฐ์— ์–ด๋– ํ•œ measure๋ฅผ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์•Œ๊ณ  ์žˆ์œผ๋ฉด distance๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ์—๋Š” ์ „ํ˜€ ์–ด๋ ค์›€์ด ์—†์„ ๊ฒƒ ๊ฐ™๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ measure๊ฐ€ ํŒŒ์ด์ฌ์ด๋‚˜ R์— ๊ตฌํ˜„๋˜์–ด ์žˆ์œผ๋‹ˆ ๋ง์ด๋‹ค๐Ÿ˜€๐Ÿ˜€ใ…Žใ…Ž.

๐Ÿงฉ ๋‹ค์Œ ํฌ์ŠคํŒ…๋ถ€ํ„ฐ๋Š” Data Preprocessing์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์•Œ์•„๋ณด์ž!!


๐Ÿ’ก์œ„ ํฌ์ŠคํŒ…์€ ํ•œ๊ตญ์™ธ๊ตญ์–ด๋Œ€ํ•™๊ต ๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ๊ณตํ•™๋ถ€ ๊ณ ์œคํฌ ๊ต์ˆ˜๋‹˜์˜ [์ƒ๋ช…์ •๋ณดํ•™์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹] ๊ฐ•์˜ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•จ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment