๐Ÿงฉ ์ €๋ฒˆ ํฌ์ŠคํŒ…์„ ํ†ตํ•ด์„œ object๋“ค ๊ฐ„์˜ Distance๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” Matrix๋ฅผ ๋งŒ๋“œ๋Š” ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜๋‹ค. ์ด์ œ๋Š” ๋ณธ๊ฒฉ์ ์œผ๋กœ Distance measure์— ๋Œ€ํ•ด ์•Œ์•„๋ณผํ…๋ฐ, ์ด measure๋“ค์€ feature์˜ ์ž๋ฃŒํ˜•์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ๋œ๋‹ค. ๋จผ์ € categorical feature์™€ binary feature์— ๋Œ€ํ•œ measure์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.


1. Categorical Attributes - Nominal

  • Simple Matching
    • ๋จผ์ € ์•Œ์•„๋ณผ ๋ฐฉ๋ฒ•์€ simple matching์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•œ object ์‚ฌ์ด์˜ distance๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.
    $d(i,j)=\frac{(p-m)}{p}$


    • ์ด๋•Œ $m$์€ feature์— ๋Œ€ํ•ด ๊ฐ™์€ ๊ฐ’์˜ ๊ฐœ์ˆ˜์ด๊ณ , $p$๋Š” ์ „์ฒด ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
    • ์‚ฌ์‹ค ์œ„์˜ ์ˆ˜์‹๋งŒ ๋ณด๊ณ  ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์˜ˆ๋ฅผ ํ•œ๋ฒˆ ๋ณด๋„๋ก ํ•˜์ž.

์œ„์—์„œ student 2์™€ 3์€ Blood Type์€ ๊ฐ™์ง€๋งŒ Hair Color๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— distance๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$d(s2,s3)=\frac{(2-1)}{2}=\frac{1}{2}$


๋ฐ˜๋ฉด student 2์™€ student 4๋Š” ๋‘ feature๊ฐ€ ๋ชจ๋‘ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— distacne๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$d(s2,s4)=\frac{(2-0)}{2}=1$


์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ simple matching ์„ ํ†ตํ•ด distance๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Use a large number of binary attributes
    • ๊ฐ nominal state์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด binary attribute๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ฆ‰, categorical ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง„ ๊ฐ feature๋“ค์„ binary ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ฃผ๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. ์ด๋ฅผ ์œ„ ์˜ˆ์‹œ์˜ student 1๊ณผ student 2์— ์ ์šฉํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฐ”๋€๋‹ค. Blodd type A๋ฅผ 0์œผ๋กœ, B๋ฅผ 1๋กœ ๋ฐ”๊ฟ”์คฌ์œผ๋ฉฐ, Hair Color Black์„ 1๋กœ, Brown์„ 0์œผ๋กœ ๋ฐ”๊ฟ” ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค.

๐Ÿ‘‰ ๊ทธ ํ›„์— distance๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ simple matching๊ณผ ๊ฐ™๋‹ค.

$d(i,j)=\frac{(p-m)}{p}$



2. Categorical Attributes - Ordinal

์œ„์—์„œ ๋‹ค๋ฃฌ nominal data์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์ˆœ์œ„๊ฐ€ ์žˆ๋Š” ์ž๋ฃŒํ˜•์ด๋‹ค.

๐Ÿงฉ ์ด ๋ฐ์ดํ„ฐ์˜ distance๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ordinal variables๋ฅผ ๊ทธ๊ฒƒ์˜ ์ˆœ์œ„๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ •ํ•ด์ง„๋‹ค.

$feature\;f,\;index\;\,i,\;\;r_{if}โˆˆ{1,2,...,M_{if}}\;\;and\;\;r_{if}=\;value\;ranking,\;M_{if}=\;amount$


$Z_{if}=\frac{r_{if}-1}{M_{if}-1}$


๐Ÿ‘‰ ์ˆ˜์‹๋งŒ ๋ณด๋ฉด ๋ญ”๊ฐ€ ๋ณต์žกํ•ด๋ณด์ด๋Š”๋ฐ, ๊ทธ๋ƒฅ ๋‹จ์ˆœํžˆ ์ˆœ์œ„๋ฅผ ๋งค๊ธด๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ํŽธํ•  ๊ฒƒ ๊ฐ™๋‹ค. ์˜ˆ์‹œ๋ฅผ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž.

freshman 1 / sopomore 2 / junior 3 / senior 4 ์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ์˜ $Z$๊ฐ’์„ ๋จผ์ € ๋ณด๋ฉด,
$Z_{if}=0\;\;/\;\;\frac{1}{3}\;\;/\;\;\frac{2}{3}\;\;/\;\;1$ ๋กœ ๊ณ„์‚ฐ์ด ๋œ๋‹ค.

์ด $Z$๊ฐ’์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด์„œ distance๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ๊ทธ ๊ณ„์‚ฐ์€ ๋‹จ์ˆœ ๋บผ์…ˆ ์—ฐ์‚ฐ์ด๋‹ค.

$d(freshman,senior) = 1-0=1$


$d(junior,senior) = 1-\frac{2}{3}=\frac{1}{3}$

3. Binary Attributes - 0/1

  • binary attribute๋“ค์€ 0๋˜๋Š” 1์˜ ๊ฐ’์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ•ฉ์ณ์„œ ํ•˜๋‚˜์˜ table๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ์ด table์„ Contingency Table์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ๊ทธ ๋ชจ์Šต์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚œ๋‹ค.

๐Ÿ“ Contingency Table

๐Ÿ‘‰ ๋‘ object๊ฐ€ ๋ชจ๋‘ 1์ธ ๊ฒฝ์šฐ์—๋Š” q, ๋ชจ๋‘ 0์ด๋ฉด t, (i,j) = (1,0) ์ด๋ฉด s, (0,1) ์ด๋ฉด r๋กœ ๊ฐ๊ฐ ๊ทธ ๊ฐ’์ด ์ง‘๊ณ„๋œ๋‹ค.์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ, distance๋ฅผ ๊ตฌํ• ๋•Œ๋Š” ์ฃผ๋กœ s์™€ r์„, similarity๋ฅผ ๊ตฌํ•  ๋•Œ๋Š” q์™€ t๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

โญโญ contingency table์— ์žˆ์–ด์„œ ๋ฐ˜๋“œ์‹œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ ์ด ํ•˜๋‚˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ binary๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋‘ ๊ฐ€์ง€ ๊ฒฝ์šฐ๋กœ ๋ช…ํ™•ํžˆ ๋‚˜๋ˆ ์ ธ์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ๋งŽ์ด ์กด์žฌํ•˜์ง€๋Š” ์•Š๋Š”๋ฐ, ์ฃผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋„๋ฉ”์ธ์ด ์งˆ๋ณ‘์˜ ์–‘์„ฑ / ์Œ์„ฑ์„ ํŒ๋‹จํ•˜๋Š” ๋„๋ฉ”์ธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด ์ฝ”๋กœ๋‚˜ ๊ฒ€์‚ฌ ๊ฒฐ๊ณผ๊ฐ€ ์–‘์„ฑ(1)์ด๋ƒ ์Œ์„ฑ(0)์ด๋ƒ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ๋ผ ํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์งˆ๋ณ‘ ๊ด€๋ จ ์กฐ์‚ฌ์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๊ด€์‹ฌ์žˆ๋Š” ๋Œ€์ƒ์€ ์–‘์„ฑ์ธ ๊ฒฝ์šฐ์ด์ง€, ์Œ์„ฑ์ธ ๊ฒฝ์šฐ์ผ ๊ฐ€๋Šฅ์„ฑ์€ ๊ทธ๋ ‡๊ฒŒ ํฌ์ง€ ์•Š๋‹ค. ํ•˜์ง€๋งŒ ๋‘ ์กฐ์‚ฌ ๋Œ€์ƒ์ด ๋ชจ๋‘ ์–‘์„ฑ์ธ ๊ฒฝ์šฐ(q)๋ณด๋‹ค๋Š” ๋‹น์—ฐํžˆ ์Œ์„ฑ(t)์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๊ธฐ์—, ์œ„์˜ contingency table์—์„œ q๋ณด๋‹ค t๊ฐ€ ์›”๋“ฑํžˆ ํฐ ๊ฐ’์„ ๊ฐ€์งˆ ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ asymmetricํ•œ table์— ๋Œ€ํ•ด์„œ๋Š” ๋‹น์—ฐํžˆ ์ด ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹คโญโญ.

๐Ÿ‘‰ ์ด์ œ ๊ฐ๊ฐ์˜ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ distance๋ฅผ ๊ตฌํ•ด๋ณด๋„๋ก ํ•˜์ž.

๐Ÿงฉ Distance measure for symmetric binary variables

$d(i,j)=\frac{r+s}{q+r+s+t}$


๐ŸงฉโญDistance measure for asymmetric binary variablesโญ

$d(i,j)=\frac{r+s}{q+r+s}$


๐ŸงฉโญSimilarity measure for asymmetric binary variablesโญ

$Jaccard\;\,coefficient=Sim_{jaccard}(i,j)=\frac{q}{q+r+s}$


๐Ÿงฉ ์˜ˆ์‹œ๋ฅผ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž!!

๐Ÿ‘‰ ์–ด๋–ค ์งˆ๋ณ‘์— ๊ด€๋ จ๋œ 7๊ฐœ์˜ feature๋ฅผ ๊ฐ€์ง„ 3๊ฐœ์˜ object๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ gender๋Š” symmetricํ•œ ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Š” ์ œ์™ธํ•˜๊ณ  distance๋ฅผ ๊ณ„์‚ฐํ•ด ์ค„ ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ test์˜ ๊ฒฐ๊ณผ์—์„œ ๋‚˜์˜ค๋Š” P๋Š” 1๋กœ, N์€ 0์œผ๋กœ ๊ธด์ฃผํ•œ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ contingency table์„ ๋งŒ๋“ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

๐Ÿงฉ ์œ„์˜ ๊ณต์‹์— ๋”ฐ๋ผ์„œ distance๋ฅผ ๊ตฌํ•ด๋ณด์ž.

$d(i,j)=\frac{r+s}{q+r+s}$


$d(jack,jim)=\frac{1+1}{1+1+1}=0.67$


$d(jack,mary)=\frac{0+1}{2+0+1}=0.33$


$d(jim,mary)=\frac{1+2}{1+1+2}=0.75$


๐Ÿงฉ ์ด๋ ‡๊ฒŒ ํ•ด์„œ binary data์— ๋Œ€ํ•œ distance measure ์—ญ์‹œ ๋‹ค๋ค„๋ดค๋‹ค. ๊ณ ๋ คํ•ด์•ผ ํ•  ๊ฒƒ๋„ ์žˆ๊ณ , ๊ทธ ๊ฒฝ์šฐ๋งˆ๋‹ค ์ ์šฉ๋˜๋Š” ๊ณต์‹๋„ ์‚ด์ง์”ฉ ๋‹ฌ๋ผ์ง€์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒƒ๋“ค๋กœ distance๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ™์€ ๊ฒƒ์œผ๋กœ similarity๋ฅผ ๊ฒŒ์‚ฐํ•œ๋‹ค๋Š” ๊ฒƒ๋งŒ ์ƒ๊ฐํ•˜๋ฉด ๊ทธ๋ ‡๊ฒŒ ์–ด๋ ค์šด ๊ฐœ๋…์€ ์•„๋‹ ๊ฒƒ ๊ฐ™๋‹ค.


๐Ÿงฉ ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” categorical data์— ๋Œ€ํ•œ distance measure๋ฅผ ์•Œ์•„๋ณด์•˜๋‹ค. ์ข…๋ฅ˜๊ฐ€ ๋‹ค์–‘ํ•˜๊ณ , ๋ฐ์ดํ„ฐ์˜ ๋„๋ฉ”์ธ์— ๋”ฐ๋ผ์„œ ์ ์šฉํ•˜๋Š” ๋ฒ•์ด ๋‹ค๋ฅด์ง€๋งŒ ์œ„์˜ ์˜ˆ์‹œ๋“ค๋งŒ ์ž˜ ์‚ดํŽด๋ด๋„ ๋‚˜๋ฆ„ ์Šค๊ทผํ•˜๊ฒŒ ๋„˜์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ๋‚ด์šฉ๋“ค์ธ ๊ฒƒ ๊ฐ™๋‹ค๐Ÿ˜Š. ์•ž์œผ๋กœ ๋‚˜์˜ฌ ๋‚ด์šฉ๋“ค์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ๋ถ€๋ถ„๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‚˜๋ฆ„ ์ž์„ธํžˆ ๋‹ค๋ค„๋ณด์—ˆ๋Š”๋ฐ, ์ถฉ๋ถ„ํ•œ ์„ค๋ช…์ด ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค. ์ด์ œ ๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” Numerical Data์˜ distance๋ฅผ ๊ตฌํ•ด๋ณด๋„๋ก ํ•˜์ž๐Ÿƒโ€โ™‚๏ธ๐Ÿƒโ€โ™‚๏ธ.


๐Ÿ’ก์œ„ ํฌ์ŠคํŒ…์€ ํ•œ๊ตญ์™ธ๊ตญ์–ด๋Œ€ํ•™๊ต ๋ฐ”์ด์˜ค๋ฉ”๋””์ปฌ๊ณตํ•™๋ถ€ ๊ณ ์œคํฌ ๊ต์ˆ˜๋‹˜์˜ [์ƒ๋ช…์ •๋ณดํ•™์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹] ๊ฐ•์˜ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•จ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment