1.1 ์ƒ๋ฌผํ•™์  ์„œ์—ด ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

1.1.1. DNA ์„œ์—ด์ด ์œ ํšจํ•œ์ง€ ์ฒดํฌ


๐Ÿงฌ validate_dna() ํ•จ์ˆ˜ ์ƒ์„ฑ
๐Ÿงฌ ์ฃผ์–ด์ง„ ์—ผ๊ธฐ์„œ์—ด dna_seq์— ์—ผ๊ธฐ A,T,G,C ์™ธ์— ๋‹ค๋ฅธ ๊ฒƒ์ด ์„ž์—ฌ์žˆ๋‹ค๋ฉด ์œ ํšจํ•œ ์„œ์—ด์ด ์•„๋‹˜ : False ์ถœ๋ ฅ

def validate_dna(dna_seq):
    seqm = dna_seq.upper()
    valid = seqm.count("A") + seqm.count("T") + seqm.count("G") + seqm.count("C")
    if valid == len(seqm): return True
    else: return False
    
print(validate_dna("atagagagatctcg"))
print(validate_dna("ATAGAXTAGAT")) 
>> 
True
False

1.1.2. ์„œ์—ด์—์„œ ๊ฐ ์‹ฌ๋ณผ์˜ ๋นˆ๋„ ๊ณ„์‚ฐ


๐Ÿงฌ frequency() ํ•จ์ˆ˜ ์ƒ์„ฑ
๐Ÿงฌ ๋นˆ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์„œ์—ด์—์„œ ๊ฐ ์‹ฌ๋ณผ๋“ค์ด ๋‚˜์˜ฌ๋•Œ์˜ ๊ฐœ์ˆ˜๋ฅผ count ํ•ด์„œ key/value ์— ์ ‘๊ทผ

def frequency(seq):
    dic = {}
    for s in seq.upper():
        if s in dic: dic[s] += 1
        else: dic[s] = 1
    return dic

print(frequency("atggatcggtacagagcggatggtgaacgaacatatcatacggctagtactagctactga"))
print(frequency("MVVMKKSHHVLHSQSLIK"))
>> 
{'A': 19, 'T': 13, 'G': 17, 'C': 11}
{'M': 2, 'V': 3, 'K': 3, 'S': 3, 'H': 3, 'L': 2, 'Q': 1, 'I': 1}

1.1.3. ์ž…๋ ฅ๋ฐ›์€ ์„œ์—ด์˜ ์‹ฌ๋ณผ ๋นˆ๋„ ๊ณ„์‚ฐ


๐Ÿงฌ ์„œ์—ด์„ ์ž…๋ ฅ๋ฐ›์•„ frequendy() ํ•จ์ˆ˜์—์„œ ๊ณ„์‚ฐํ•œ ๋นˆ๋„๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ key์™€ value์— ์ ‘๊ทผ
๐Ÿงฌ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•จ

seq_aa = input("Protein Sequence: ")
freq_aa = frequency(seq_aa)
list_f = sorted(freq_aa.items(), key = lambda x : x[1], reverse = True)

for (k, v) in list_f:
    print("Aminoacid:", k, ":", v)
>>
Protein Sequence: ATGGGATCGTAGTCGTACTAGCTAGCTGATGGTACTCGATAGTCTACGTAGCTAGTGGTACTGGATGGTACTCAGTAACAT
Aminoacid: T : 24
Aminoacid: G : 23
Aminoacid: A : 20
Aminoacid: C : 14

1.1.4.DNA ์„œ์—ด์—์„œ G/C ๋‰ดํด๋ ˆ์˜คํƒ€์ด๋“œ์˜ ํผ์„ผํŠธ ๋ฐ˜ํ™˜


๐Ÿงฌ gc_content() ํ•จ์ˆ˜ ์ƒ์„ฑ
๐Ÿงฌ ์—ผ๊ธฐ์„œ์—ด์˜ ์‹ฌ๋ณผ์ด GCgc ์ค‘ ํ•˜๋‚˜์ด๋ฉด gc_count์˜ ๊ฐ’ ์นด์šดํŠธ
๐Ÿงฌ ์ „์ฒด ๊ธธ์ด์— ๋Œ€ํ•œ ๋น„์œจ์„ ๋ฐ˜ํ™˜

def gc_content(dna_seq):
    gc_count = 0
    for s in dna_seq:
        if s in "GCgc": gc_count += 1
    return gc_count / len(dna_seq)

print(gc_content("atggatcggtacagagcggatggtgaacgaacatatcatacggctagtactagctactga"))
>> 
0.4666666666666667

1.1.5.์ค‘์ฒฉ๋˜์ง€ ์•Š๋Š” k ๊ธธ์ด์˜ ๋ถ€๋ถ„ ์„œ์—ด์— ๋Œ€ํ•ด GCํผ์„ผํŠธ ๋ฐ˜ํ™˜


๐Ÿงฌ gc_content_subseq() ํ•จ์ˆ˜ ์ƒ์„ฑ
๐Ÿงฌ ์ฃผ์–ด์ง„ ์—ผ๊ธฐ์„œ์—ด์„ ์ผ์ •ํ•œ k ๋งŒํผ์˜ ์‚ฌ์ด์ฆˆ๋กœ ๋‚˜๋ˆ”
๐Ÿงฌ ๊ฐ ๋ถ€๋ถ„ ์„œ์—ด์— ๋Œ€ํ•ด์„œ gc_content()๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฐ˜ํ™˜

def gc_content_subseq(dna_seq, k = 3):
    res = []
    for i in range(0, len(dna_seq)-k+1, k):
        subseq = dna_seq[i:i+k]
        gc = gc_content(subseq)
        res.append(gc)
    return res

print(gc_content_subseq("atagataactcgcatagc"))
>> 
[0.0, 0.3333333333333333, 0.3333333333333333, 0.6666666666666666, 0.3333333333333333, 0.6666666666666666]

๐Ÿ’ก Bioinformatics Algorithms(์—์ด์ฝ˜์ถœํŒ, 2020)๋ฅผ ๊ณต๋ถ€ํ•˜๊ณ  ๊ฐœ์ธ ํ•™์Šต์šฉ์œผ๋กœ ์ •๋ฆฌํ•œ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.

Leave a comment