序列比对算法使用一组字符而不是一个字符

我从关于对齐算法的一些细节开始,最后,我问我的问题。 如果你知道对齐算法通过开头。

考虑我们有两个字符串:

ACCGAATCGA ACCGGTATTAAC 

有一些算法如: Smith-Waterman或Needleman-Wunsch ,它们对齐这两个序列并创建一个矩阵。 看看以下部分的结果:

 Smith-Waterman Matrix § § ACCGAATCGA § 0 0 0 0 0 0 0 0 0 0 0 A 0 4 0 0 0 4 4 0 0 0 4 C 0 0 13 9 4 0 4 3 9 4 0 C 0 0 9 22 17 12 7 3 12 7 4 G 0 0 4 17 28 23 18 13 8 18 13 G 0 0 0 12 23 28 23 18 13 14 18 T 0 0 0 7 18 23 28 28 23 18 14 A 0 4 0 2 13 22 27 28 28 23 22 T 0 0 3 0 8 17 22 32 27 26 23 T 0 0 0 2 3 12 17 27 31 26 26 A 0 4 0 0 2 7 16 22 27 31 30 A 0 4 4 0 0 6 11 17 22 27 35 C 0 0 13 13 8 3 6 12 26 22 30 Optimal Alignments ACCGA - ATCGAACCGGAATTAA 

我的问题很简单,但也许答案并不容易。 我想使用一组字符作为单个字符,如: [A0][C0][A1][B1] 。 但在这些算法中,我们必须使用单个字符。 我们怎样才能做到这一点?

PS考虑我们有这个序列: #read #write #add #write 。 然后我把它转换成类似的东西:#read到A …. #write到B …. #add to C.然后我的序列变为: ABCB 。 但是我有许多以#开头的不同词汇。 并且ASCII表不足以转换所有这些。 然后我需要更多角色。 唯一的方法是为每个单词使用[A0] ... [Z9]类的东西。 或者使用数字。

PS:Smith-Waterman的一些示例代码存在于此链接中

PS:还有另外一篇文章需要这样的东西,但我想要的是不同的。 在这个问题中,我们有一组以[并以]结尾的字符开头。 并且不需要像ee那样使用语义等于i

想象一下,我们有一个包含字母序列的日志文件。 就像你说的那样,我将序列转换为A0A1... 例如,如果存在#read #write #add #write类的序列,则转换为A0A1A2A1 。 每次,我读两个字符并比较它们,但保持得分矩阵像以前一样。 这是我在C#中用于smith-waterman字符串对齐的代码。
请注意, Cell是用户定义的类。

 private void alignment() { string strSeq1; string strSeq2; string strTemp1; string strTemp2; scoreMatrix = new int[Log.Length, Log.Length]; // Lists That Holds Alignments List SeqAlign1 = new List(); List SeqAlign2 = new List(); for (int i = 0; i= intUp_Score) { if (intDiagonal_score>= intLeft_Score) { Temp = new Cell(j, i, intDiagonal_score, Matrix[j - 1, i - 1], Cell.PrevcellType.Diagonal); } else { Temp = new Cell(j, i, intDiagonal_score, Matrix[j , i - 1], Cell.PrevcellType.Left); } } else { if (intUp_Score >= intLeft_Score) { Temp = new Cell(j, i, intDiagonal_score, Matrix[j - 1, i], Cell.PrevcellType.Above); } else { Temp = new Cell(j, i, intDiagonal_score, Matrix[j , i - 1], Cell.PrevcellType.Left); } } if (MaxScore.CellScore <= Temp.CellScore) { MaxScore = Temp; } return Temp; } public static void Traceback_Step(Cell[,] Matrix, string Sq1, string Sq2, List Seq1, List Seq2) { intMaxScore = MaxScore.CellScore; while (MaxScore.CellPointer != null) { if (MaxScore.Type == Cell.PrevcellType.Diagonal) { Seq1.Add(Sq1[MaxScore.CellColumn * 2 + 1]); //Changed: All of the following lines with *2 and +1 Seq1.Add(Sq1[MaxScore.CellColumn * 2]); Seq2.Add(Sq2[MaxScore.CellRow * 2 + 1]); Seq2.Add(Sq2[MaxScore.CellRow * 2]); } if (MaxScore.Type == Cell.PrevcellType.Left) { Seq1.Add(Sq1[MaxScore.CellColumn * 2 + 1]); Seq1.Add(Sq1[MaxScore.CellColumn * 2]); Seq2.Add('-'); } if (MaxScore.Type == Cell.PrevcellType.Above) { Seq1.Add('-'); Seq2.Add(Sq2[MaxScore.CellRow * 2 + 1]); Seq2.Add(Sq2[MaxScore.CellRow * 2]); } MaxScore = MaxScore.CellPointer; } } } 

我修改了Smith-Waterman和Needleman-Wunsch算法的Python实现 (GPL版本3许可),以支持具有多个字符组的序列:

 #This software is a free software. Thus, it is licensed under GNU General Public License. #Python implementation to Smith-Waterman Algorithm for Homework 1 of Bioinformatics class. #Forrest Bao, Sept. 26   # zeros() was origianlly from NumPy. # This version is implemented by alevchuk 2011-04-10 def zeros(shape): retval = [] for x in range(shape[0]): retval.append([]) for y in range(shape[1]): retval[-1].append(0) return retval match_award = 10 mismatch_penalty = -5 gap_penalty = -5 # both for opening and extanding gap = '----' # should be as long as your group of characters space = ' ' # should be as long as your group of characters def match_score(alpha, beta): if alpha == beta: return match_award elif alpha == gap or beta == gap: return gap_penalty else: return mismatch_penalty def finalize(align1, align2): align1 = align1[::-1] #reverse sequence 1 align2 = align2[::-1] #reverse sequence 2 i,j = 0,0 #calcuate identity, score and aligned sequeces symbol = [] found = 0 score = 0 identity = 0 for i in range(0,len(align1)): # if two AAs are the same, then output the letter if align1[i] == align2[i]: symbol.append(align1[i]) identity = identity + 1 score += match_score(align1[i], align2[i]) # if they are not identical and none of them is gap elif align1[i] != align2[i] and align1[i] != gap and align2[i] != gap: score += match_score(align1[i], align2[i]) symbol.append(space) found = 0 #if one of them is a gap, output a space elif align1[i] == gap or align2[i] == gap: symbol.append(space) score += gap_penalty identity = float(identity) / len(align1) * 100 print 'Identity =', "%3.3f" % identity, 'percent' print 'Score =', score print ''.join(align1) # print ''.join(symbol) print ''.join(align2) def needle(seq1, seq2): m, n = len(seq1), len(seq2) # length of two sequences # Generate DP table and traceback path pointer matrix score = zeros((m+1, n+1)) # the DP table # Calculate DP table for i in range(0, m + 1): score[i][0] = gap_penalty * i for j in range(0, n + 1): score[0][j] = gap_penalty * j for i in range(1, m + 1): for j in range(1, n + 1): match = score[i - 1][j - 1] + match_score(seq1[i-1], seq2[j-1]) delete = score[i - 1][j] + gap_penalty insert = score[i][j - 1] + gap_penalty score[i][j] = max(match, delete, insert) # Traceback and compute the alignment align1, align2 = [], [] i,j = m,n # start from the bottom right cell while i > 0 and j > 0: # end toching the top or the left edge score_current = score[i][j] score_diagonal = score[i-1][j-1] score_up = score[i][j-1] score_left = score[i-1][j] if score_current == score_diagonal + match_score(seq1[i-1], seq2[j-1]): align1.append(seq1[i-1]) align2.append(seq2[j-1]) i -= 1 j -= 1 elif score_current == score_left + gap_penalty: align1.append(seq1[i-1]) align2.append(gap) i -= 1 elif score_current == score_up + gap_penalty: align1.append(gap) align2.append(seq2[j-1]) j -= 1 # Finish tracing up to the top left cell while i > 0: align1.append(seq1[i-1]) align2.append(gap) i -= 1 while j > 0: align1.append(gap) align2.append(seq2[j-1]) j -= 1 finalize(align1, align2) def water(seq1, seq2): m, n = len(seq1), len(seq2) # length of two sequences # Generate DP table and traceback path pointer matrix score = zeros((m+1, n+1)) # the DP table pointer = zeros((m+1, n+1)) # to store the traceback path max_score = 0 # initial maximum score in DP table # Calculate DP table and mark pointers for i in range(1, m + 1): for j in range(1, n + 1): score_diagonal = score[i-1][j-1] + match_score(seq1[i-1], seq2[j-1]) score_up = score[i][j-1] + gap_penalty score_left = score[i-1][j] + gap_penalty score[i][j] = max(0,score_left, score_up, score_diagonal) if score[i][j] == 0: pointer[i][j] = 0 # 0 means end of the path if score[i][j] == score_left: pointer[i][j] = 1 # 1 means trace up if score[i][j] == score_up: pointer[i][j] = 2 # 2 means trace left if score[i][j] == score_diagonal: pointer[i][j] = 3 # 3 means trace diagonal if score[i][j] >= max_score: max_i = i max_j = j max_score = score[i][j]; align1, align2 = [], [] # initial sequences i,j = max_i,max_j # indices of path starting point #traceback, follow pointers while pointer[i][j] != 0: if pointer[i][j] == 3: align1.append(seq1[i-1]) align2.append(seq2[j-1]) i -= 1 j -= 1 elif pointer[i][j] == 2: align1.append(gap) align2.append(seq2[j-1]) j -= 1 elif pointer[i][j] == 1: align1.append(seq1[i-1]) align2.append(gap) i -= 1 finalize(align1, align2) 

如果我们使用以下输入运行它:

 seq1 = ['[A0]', '[C0]', '[A1]', '[B1]'] seq2 = ['[A0]', '[A1]', '[B1]', '[C1]'] print "Needleman-Wunsch" needle(seq1, seq2) print print "Smith-Waterman" water(seq1, seq2) 

我们得到这个输出:

 Needleman-Wunsch Identity = 60.000 percent Score = 20 [A0][C0][A1][B1]---- [A0]----[A1][B1][C1] Smith-Waterman Identity = 75.000 percent Score = 25 [A0][C0][A1][B1] [A0]----[A1][B1] 

对于我所做的具体更改,请参阅: 此GitHub存储库 。