All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].<Solution>
這題是要在一個 DNA 字串中,找出重複的 DNA 組成
想法如下
- 總共有四個字元,分別編碼 A=0、C=1、G=2、T=3,也就是用 2 bits 來表示四個不同的值
- 先將前面 9 個字元組成一個基底 DNA 字串
- 從第10個字元開始,先從上一次的組成取出後 18位元 (和 0x3ffff 做 and),然後和目前的字元組成一個 10 字元的 DNA 字串
- 檢查目前最新組成的 DNA 字串,是否已經存在 hash map 裡面了,如果有的話,就放到答案裡。這邊要注意一點,不論同一個 DNA 字串重複幾次,都只放一次到答案裡面
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Solution { | |
public: | |
vector<string> findRepeatedDnaSequences(string s) { | |
unordered_map<int, int> dnaMap; | |
unordered_map<char, int> dnaTable = {{'A',0}, {'C',1}, {'G',2}, {'T',3}}; | |
vector<string> ans; | |
int code = 0; | |
//>> get first 9-letter-long dna sequence | |
for(int i = 0; i < 9; i++) { | |
code = (code << 2) | dnaTable[s[i]]; | |
} | |
for(int i = 9; i < s.length(); i++) { | |
//>> get 18-bits of previous code and combine with new dna char | |
code = ((code & 0x3ffff) << 2) | dnaTable[s[i]]; | |
if(dnaMap.find(code) != dnaMap.end() && dnaMap[code] == 1) { | |
ans.push_back(s.substr(i-9, 10)); | |
} | |
++dnaMap[code]; | |
} | |
return ans; | |
} | |
}; |
沒有留言:
張貼留言