๐Ÿ’Š ์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” QSAR ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชฉ์ ๊ณผ, ๊ทธ ์„ธ๋ถ€์ ์ธ ๊ณผ์ •๋“ค์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์•˜๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ ์›ํ•˜๋Š” target์— bindingํ•˜๋Š” drug๋ฅผ ์ฐพ์•„๋‚ด๊ณ , ๊ทธ drug์˜ activity๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์šฐ๋ฆฌ๊ฐ€ QSAR ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชฉ์ ์ด๋‹ค.

๐Ÿ’Š ๋จธ์‹ ๋Ÿฌ๋‹์ด๋‚˜ ๋”ฅ๋Ÿฌ๋‹์„ ๋ฐฐ์šด์ ์ด ์žˆ๋‹ค๋ฉด ์•Œ๊ฒ ์ง€๋งŒ, ๋งŒ๋“  ๋ชจ๋ธ์„ ์‚ฌ์šฉํ• ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•ด๋‹น ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž๐Ÿ™ƒ.


6. Model validation

  • training dataset์„ ํ†ตํ•ด Model์„ ์™„์ „ํžˆ optimizeํ–ˆ๋‹ค๊ณ  ํ•ด์„œ ์ด model์ด ์™„์ „ํžˆ ์ƒˆ๋กœ์šด molecule์— ๋Œ€ํ•ด์„œ๋„ ์™„๋ฒฝํ•œ prediction accuracy๋ฅผ ๊ฐ€์งˆ ๊ฒƒ์ด๋ผ ๋ณด์žฅํ•  ์ˆ˜๋Š” ์—†๋‹ค.
  • ์ฆ‰, training dataset์— ๋Œ€ํ•œ model์˜ ์ ํ•ฉ๋„๋Š” ์ƒˆ๋กœ์šด ํ™”ํ•ฉ๋ฌผ์— ๋Œ€ํ•œ prediction accuracy๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ข‹์€ ์ง€ํ‘œ๊ฐ€ ์•„๋‹ˆ๋‹ค.

  • ์˜ˆ๋ฅผ ๋“ค์–ด training dataset์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์ง€๋‚˜์น˜๊ฒŒ ๋ณต์žกํ•œ ์ฐจ์›์˜ model์„ ๋งŒ๋“ ๋‹ค๋ฉด ์ด๋Š” training dataset์— ๋Œ€ํ•ด์„œ๋Š” ์™„๋ฒฝํžˆ ์„ค๋ช…์ด ๊ฐ€๋Šฅํ•˜๊ฒ ์ง€๋งŒ ์˜คํžˆ๋ ค overfitting์„ ์ผ์œผ์ผœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ generalization์„ ๋ณด์žฅํ•  ์ˆ˜ ์—†๋‹ค.
  • ๋”ฐ๋ผ์„œ, model์˜ complexity๊ฐ€ ๋†’๋‹ค๊ณ  ํ•ด์„œ ๋ฌด์กฐ๊ฑด ์ข‹์€ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค!!


๐Ÿ‘‰ ์œ„์˜ ๊ทธ๋ฆผ์—์„œ ๊ฐ๊ฐ์˜ point๊ฐ€ training data๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์ ์  model complexity๋ฅผ ๋Š˜๋ฆฌ๊ณ  ์žˆ๋‹ค. Sixth Order์˜ ๊ทธ๋ฆผ์˜ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋“  training data๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์ƒ๋‹นํžˆ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์ง€๋งŒ, ์•ž์„œ ์„ค๋ช…ํ•œ ๋Œ€๋กœ ์ด ๊ฒฝ์šฐ ์ƒˆ๋กœ์šด data์— ๋Œ€ํ•ด์„œ generalization์ด ๊ต‰์žฅํžˆ ํž˜๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์€ model์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜๋Š” ์—†๋‹ค๐Ÿ˜จ๐Ÿ˜จ. ์˜คํžˆ๋ ค overfitting์ด ์˜ˆ์ƒ๋˜๋Š” ๊ฒฝ์šฐ์ด๋‹ค.


7. External Test Sets and Cross Validation

  • Dataset์€ ๋ณดํ†ต Training set, Test set์œผ๋กœ ๋‚˜๋ˆ ์ง€๋Š”๋ฐ, ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๊ฐ€ ์ถฉ๋ถ„ํ•œ ๊ฒฝ์šฐ์—๋Š” Test set์„ Validation set๊ณผ Test set์œผ๋กœ ๋‚˜๋ˆ ์„œ ๋ชจ๋ธ์˜ ๊ฒ€์ฆ์— ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
  • ์ด๋•Œ Validation set์—์„œ๋Š” model์˜ hyperparameter๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค.


  • ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ, ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ ์€ Model์˜ ํ•™์Šต์€ Test set์— ๋Œ€ํ•œ Error๊ฐ€ ์ตœ์†Œํ™”๋˜๋Š” ์ง€์ ๊นŒ์ง€๋งŒ ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

  • ์ตœ์ข…์ ์ธ Model์˜ Accuray๋Š” training set๊ณผ๋Š” ์™„์ „ํžˆ ๋…๋ฆฝ์ ์ธ Test set์— ๋Œ€ํ•ด์„œ ๊ตฌํ•ด์ง„๋‹ค.


8. Data Splitting

๐Ÿ’Š Dataset์„ train, validation, test set์œผ๋กœ ๋‚˜๋ˆ„๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด์„œ ์ดํ•ดํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.


  • Random : ๋žœ๋คํ•˜๊ฒŒ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•

  • Cluster-based (scaffold split) : Chemical Similarity๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์›๋ž˜ dataset์— ๋Œ€ํ•ด clusetring ํ•˜๊ณ , ํ˜•์„ฑ๋œ ๊ฐ๊ฐ์˜ ๊ตฐ์ง‘์„ training, validation, test set์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•.
    • compound์˜ structure๊ฐ€ ๋‹ฌ๋ผ์ง„ ๊ฒฝ์šฐ์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ.
    • ํ•™์Šต์— ์“ฐ์ธ ๋ฐ์ดํ„ฐ์™€ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„ ์™„์ „ํžˆ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— prediction accuracy๋Š” ๋‚ฎ์Œ
  • Stratified : ์ „์ฒด dataset์—์„œ์˜ ๊ฐ class์˜ ๋น„์œจ๊ณผ ๋‚˜๋ˆ ์ง„ ๊ฐ๊ฐ์˜ set์—์„œ์˜ ๊ฐ class์˜ ๋น„์œจ์ด ๋™์ผํ•˜๋„๋ก ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•

  • Temporal : ๋…„๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ• - Chembl20 (training), Chembl21 (test)

9. Cross Validation

๐Ÿ’Š ๋ฐ์ดํ„ฐ์˜ ์ˆ˜๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•„์„œ training set๊ณผ test set์„ ์™„์ „ํžˆ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉํ•จ.

  • Leave-one-out : ํ•˜๋‚˜์˜ object๋งŒ ๋‚จ๊ฒจ๋‘๊ณ  ๋‚˜๋จธ์ง€๋Š” training set์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•.
  • leave-cluster-out : scaffold split๊ณผ ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•

๐Ÿ’Š ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์œ„์˜ ๋‘ ๋ฐฉ๋ฒ•๋ณด๋‹ค๋Š” n-fold cross validation๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์ด ์„ค๋ช…์ด ์ž˜ ๋˜์–ด ์žˆ์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.


  • n-fold cross validation : ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ n๋“ฑ๋ถ„ํ•ด์„œ

    • ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ์ฒซ ๋ฒˆ์งธ fold๋ฅผ test set์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ๋‚˜๋จธ์ง€ fold๋กœ ํ•™์Šต์„ ์ง„ํ–‰.
    • ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ๋‘ ๋ฒˆ์งธ fold๋ฅผ test set์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ๋‚˜๋จธ์ง€ fold๋กœ ํ•™์Šต์„ ์ง„ํ–‰โ€ฆ.
    • ๊ทธ ๊ฒฐ๊ณผ n๊ฐœ์˜ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์ง€๋ฉฐ, ์ด ์ค‘์—์„œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์„ final model๋กœ ์„ ํƒ.

10. model์˜ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” issues

๐Ÿ’Š ์ด๋ฒˆ ์ ˆ์—์„œ๋Š” ์šฐ๋ฆฌ๊ฐ€ QSAR model์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์—์„œ ๊ทธ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š” ์ฃผ์š” ๋ช‡๊ฐ€์ง€ issue์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณผ ๊ฒƒ์ด๋‹ค.


๐Ÿšฉ 10.1. Data Size

  • compound์˜ ์ˆ˜๋Š” ๋„ˆ๋ฌด ์ ๊ฑฐ๋‚˜ ๋งŽ์•„์„œ๋Š” ์•ˆ๋จ.
  • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์˜ upper limit๋Š” ๋ชจ๋ธ ๊ตฌ์ถ•์— ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ์™€ ์‹œ๊ฐ„์— ๋Œ€ํ•ด ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์œผ๋‚˜, ์ตœ๊ทผ์—๋Š” ํ•˜๋“œ์›จ์–ด์™€ ์†Œํ”„ํŠธ์›จ์–ด์˜ ๋ฐœ๋‹ฌ๋กœ ๊ทธ ์˜ํ–ฅ์ด ๋ฏธ๋ฏธํ•จ.

  • ๋ฐ์ดํ„ฐ์…‹์— ๋งŽ์€ compound๊ฐ€ ํฌํ•จ๋ ์ˆ˜๋ก
    • ๋‹ค์–‘ํ•œ subset of compounds๋ฅผ ์„ ํƒ ๊ฐ€๋Šฅ.
    • ๋ฐ์ดํ„ฐ์…‹์„ Clusteringํ•˜๊ณ  ๊ฐ Cluster์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๊ตฌ์ถ•.
    • classification / category data ์—์„œ ํŠน์ • compound๊ฐ€ ์†Œ์ˆ˜์˜ activity class ํ˜น์€ category์— ์†ํ•˜๋Š” ๊ฒฝ์šฐ ์ด๋Ÿฌํ•œ compound๋ฅผ model ๊ฐœ๋ฐœ๊ณผ์ •์—์„œ ๋ฐฐ์ œํ•ด๋„ ๋ฌธ์ œ๊ฐ€ ์—†์Œ.

๐Ÿšฉ 10.2. Lower limit

  • ์œ„์—์„œ ๋งํ–ˆ๋“ฏ์ด ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ์˜ upper limit์€ ํฐ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์œผ๋‚˜ lower limit ์˜ ๊ฒฝ์šฐ๋Š” ๋„ˆ๋ฌด ์ž‘์€ ๊ฒฝ์šฐ correlation๊ณผ overfitting ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•จ.

  • Continuous response variable (activity)
    • training set์˜ compounds ์ˆ˜๋Š” ์ ์–ด๋„ 20๊ฐœ์—ฌ์•ผ ํ•จ.
    • ๋Œ€๋žต 10๊ฐœ์˜ compound๊ฐ€ ๊ฐ๊ฐ์˜ test set๊ณผ external evaluation set์— ์žˆ์–ด์•ผ ํ•จ.
    • ์ฆ‰, ์ตœ์†Œ 40๊ฐœ์˜ compound dataset ์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค.
  • Classification or category response variable
    • training set์˜ compounds ์ˆ˜๋Š” ๊ฐ class๋งˆ๋‹ค ์ ์–ด๋„ 10๊ฐœ์—ฌ์•ผ ํ•จ.
    • test set๊ณผ external evaluation set์€ ๊ฐ class๋งˆ๋‹ค ์ ์–ด๋„ 5๊ฐœ์˜ compound๋ฅผ ํฌํ•จํ•ด์•ผ ํ•จ.
    • ์ฆ‰, ํ•˜๋‚˜์˜ class ๋‹น ์ตœ์†Œ 20๊ฐœ์˜ compound dataset ์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค.

๐Ÿšฉ 10.3. Activity Values

  • Continuous response variable
    • Total range of activities (๋ฐ์ดํ„ฐ์˜ ์ตœ๋Œ“๊ฐ’ โ€“ ์ตœ์†Ÿ๊ฐ’)๊ฐ€ experimental error๋ณด๋‹ค ์ตœ์†Œ 5๋ฐฐ ์ด์ƒ ์ปค์•ผ ํ•จ. ์ฆ‰, ๋ฐ์ดํ„ฐ๊ฐ€ ๊ท ๋“ฑํ•˜๊ฒŒ ํผ์ ธ์žˆ์–ด์•ผ ํ•จ.

    • ๋‘ ์—ฐ์†๋œ activity ๊ฐ’ ์‚ฌ์ด์— Total range of activities์˜ 10%~15% ๋ฅผ ๋„˜์„ ๋งŒํผ์˜ ์ฐจ์ด๊ฐ€ ์—†์–ด์•ผ ํ•จ. ์ฆ‰, outlier๊ฐ€ ์—†์„์ˆ˜๋ก ์ข‹์Œ.

  • Classification or category QSAR
    • ํ•˜๋‚˜์˜ class ๋‹น ์ตœ์†Œ 20๊ฐœ์˜ compound dataset์ด ์žˆ์–ด์•ผ ํ•จ.
    • ๋ชจ๋“  class ๋˜๋Š” category์—์„œ cmopound์˜ ์ˆ˜๋Š” ๊ฑฐ์˜ ๊ฐ™์•„์•ผ ํ•จ.
      • positve : 100 / negative : 100

11. Assessing Model Performance

  • ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค

  • Regression Problems

    • MAE, MSE (Mean Squared Error), RMSE
    • Pearson correlation coefficient : ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„
    • Spearman Rank Correlation : ์ˆœ์œ„ ๋ถ„์„
  • Classification Problems

    • Classification Accuracy
    • Precision, Recall, F1 score
    • ROC Curve, AUC, PRC

12. Applicability domain

๐Ÿ’Š QSAR ๋ชจ๋ธ์€ training set์˜ structure๋กœ ํ‘œํ˜„๋˜๋Š” compound์— ๋Œ€ํ•œ prediction๋งŒ ์ œ๊ณต ๊ฐ€๋Šฅํ•˜๋‹ค. ๋”ฐ๋ผ์„œ training set์— ์†ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋“ค์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค.

  • Distance-to-model-based metrics
    • training set์— ์žˆ๋Š” compound์™€ test set์— ์žˆ๋Š” compound ์‚ฌ์ด์˜ distance ๋ฅผ ์ •๋Ÿ‰ํ™”.
  • Local error methods
  • Bagging methods
  • Sensitivity-based methods
    • model์˜ prediction์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด test compound์˜ input descriptors๋ฅผ ๋ณ€ํ˜• ํ•ด๋ณด๋Š” ๊ฒƒ.
    • ์˜ˆ๋ฅผ ๋“ค์–ด model์˜ ๊ตฌ์กฐ๋ฅผ ์‚ด์ง ๋ฐ”๊ฟ”๋ณด๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Œ.

13. Predictive QSAR models

๐Ÿ’Š ์•ž์„œ ๋‹ค๋ฃฌ ๋‚ด์šฉ๋“ค์„ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Chemical Dataset
โ–ถ Curated Dataset
โ–ถ Descriptors Generation
โ–ถ Split into Training, Test, External Validation sets
โ–ถ Machine Learning Techniques
โ–ถ Selection of Models with High Internal & External Accuracy
โ–ถ Assessment of Applicability Domain
โ–ถ Predictive QSAR models

์œ„์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋”ฐ๋ผ์„œ ์ตœ์ข… QSAR ๋ชจ๋ธ์„ ์ •ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, Curated Dataset๊ณผ Predictive QSAR models๋Š” ์ด๋ฒˆ ์ ˆ์—์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋‹ค๋ฃฐ ๊ฒƒ์ด๋‹ค.


๐Ÿšฉ 13.1. Curated Dataset

๐Ÿ’Š ๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋ถ„์ž์˜ ๊ตฌ์กฐ๋ฅผ descriptor vector์˜ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค. ์ด์— ์•ž์„œ ์ง„ํ–‰๋˜๋Š”, ์›๋ž˜์˜ Chemical Dataset์„ Curated Dataset์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ณผ์ •์„ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.

Chemical Dataset
โ–ถ Removal of mixtures, inorganics (์ฆ‰ organometallics)
โ–ถ Structure conversion. Cleaning / removal of salts
โ–ถ Normalization of specific chemotypes
โ–ถ Treatment of automatic forms
โ–ถ Analysis / Removal of structural duplicates
โ–ถ Manual Inspection
โ–ถ Curated Dataset

๐Ÿ’Š ์šฐ๋ฆฌ๊ฐ€ ๋ณดํ†ต ๋‹ค๋ฃจ๋Š” molecule data๋Š” SMILES ํ˜•ํƒœ์ธ๋ฐ, ์ด๋Š” ์ด๋ฏธ curated dataset์˜ ํ˜•ํƒœ์ด๋‹ค. ์•ž์œผ๋กœ ๋‹ค๋ฃฐ PaDEL Descriptor์—์„œ๋Š” ์ด๋ฏธ ์ด ๊ณผ์ •์„ ๋ชจ๋‘ ๊ฑฐ์นœ ํ˜•ํƒœ์ด๋ฏ€๋กœ, ๊ฐœ๋žต์ ์ธ ๊ฐœ๋…๋งŒ ์•Œ๊ณ  ๊ฐ€๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.


๐Ÿšฉ 13.2. Predictive QSAR models

๐Ÿ’Š QSAR์˜ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ค๋Š” ์ตœ์ข… ๋ชจ๋ธ์„ ๊ณ ๋ฅด๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•œ๋‹ค. ์ง€๋‚œ ๋‚ด์šฉ๋“ค์˜ ์š”์•ฝ๊ณผ ๊ทธ ํ๋ฆ„์œผ๋กœ ์ดํ•ดํ•˜๋„๋ก ํ•˜์ž๐Ÿ™ƒ.

  • ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๊ณ , ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์—ฌ ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ์ดํ›„ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜จ ๋ชจ๋ธ์„ ์ตœ์ข… ๋ชจ๋ธ๋กœ ์„ ํƒํ•˜๊ณ , training set์— ์†ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ ์ ˆํ•œ Applicability Domain ๋ฐฉ๋ฒ•์„ ์ฐพ์•„ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ตœ์ข…์ ์ธ ํ‰๊ฐ€๋‹จ๊ณ„๊นŒ์ง€ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์œผ๋ฉด ํ•ด๋‹น ๋ชจ๋ธ์„ ์ตœ์ข… QSAR ๋ชจ๋ธ๋กœ ์„ ์ •ํ•œ๋‹ค.

๐Ÿ’Š ์ด๋ ‡๊ฒŒ ํ•ด์„œ QSAR์„ ํ†ตํ•ด drug๋ฅผ ๋ฐœ๊ตดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ๋‹ค. ์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” QSAR model์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ์„ธ๋ถ€์ ์ธ ๊ณผ์ •์„, ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋งŒ๋“ค์–ด์ง„ model์˜ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์ œ์ผ ์ข‹์€ model์„ ์„ ํƒํ•˜๋Š” ๋ฒ•๊ณผ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉํ•˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์•˜๋‹ค. ์—ฌ๊ธฐ๊นŒ์ง€์˜ ๊ณผ์ •๊ณผ ๊ฐ๊ฐ์˜ ๋‹จ๊ณ„๋ฅผ ํ•˜๋Š” ๋ชฉ์ ์„ ์œ„์ฃผ๋กœ ์ดํ•ดํ•˜๋ฉด ์•ž์œผ๋กœ ๋‹ค๋ฃฐ ๋‚ด์šฉ๋“ค์— ์žˆ์–ด์„œ ํ—ท๊ฐˆ๋ฆฌ๋Š” ๋ถ€๋ถ„์€ ์—†์„ ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•œ๋‹ค.

๐Ÿ’Š ๋‚˜๋„ ๊ฐ•์˜๋ฅผ ๋“ค์œผ๋ฉด์„œ ๊ณต๋ถ€๋ฅผ ํ•˜๊ณ , ์ •๋ฆฌํ•˜๋ฉด์„œ ๋™์‹œ์— ๋ธ”๋กœ๊ทธ์— ํฌ์ŠคํŒ…์„ ํ•˜๋Š” ์ค‘์ด๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๊ฐ๋ณด๋‹ค ์‹œ๊ฐ„์ด ๋งŽ์ด ๋“œ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ํฌ์ŠคํŒ…์„ ํ•˜๋‚˜ ํ•˜๊ณ  ๋‚˜๋ฉด ๋‹ค๋ฅธ ๊ฐ•์˜๋ฅผ ๋“ฃ๊ณ  ๋‹ค์‹œ ์ •๋ฆฌ๋ฅผ ํ•˜๋ฉด์„œ ๊ธ€์„ ์˜ฌ๋ฆฌ๋Š” ์ž…์žฅ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹œ๊ฐ„์ƒ์œผ๋กœ๋‚˜ ๋‚ด์šฉ ์ƒ์œผ๋กœ๋‚˜ ์กฐ๊ธˆ ์•„์‰ฝ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ ๋‹ค๐Ÿ˜ฅ๐Ÿ˜ฅ. ํ•œ ๊ฑด ์—†์ง€๋งŒ ์ด๊ฒƒ์ €๊ฒƒ ํ•˜๋‹ค ๋ณด๋‹ˆ ์–ด๋Š์ƒˆ ๋ฐฉํ•™๋„ ํ•œ๋‹ฌ์ด ์ฑ„ ๋‚จ์ง€ ์•Š์•˜๋‹ค. ์ข€ ๋” ์—ด์‹ฌํžˆ ํž˜๋‚ด์„œ ํ•ด๋ด์•ผ๊ฒ ๋‹ค๐Ÿƒโ€โ™‚๏ธ๐Ÿƒโ€โ™‚๏ธ.


๐Ÿ’ก์œ„ ํฌ์ŠคํŒ…์€ LAIDD์— ์—…๋กœ๋“œ๋œ KAIST ๊น€๋™์„ญ ๊ต์ˆ˜๋‹˜์˜ QSAR ๋ชจ๋ธ ๊ฐœ๋ฐœ ๊ณผ์ • ๊ฐ•์˜ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•จ์„ ๋ฐํž™๋‹ˆ๋‹ค.

Leave a comment