Had a little programming homework for "Biological Databases" this week.
The task was:
- Count the quantities of all 5-letter comninations in the human chromosome 21.
There are 5 letters: A T G C and N
Text will be as following: NNNNNNNATAGCGATCGTANNNNNNGCTAGGCNNNNNNNN........
For example: NNNNNNTG has "NNNNN" 2 times"NNNNT" 1 time and "NNNTG" 1 time
The input text is around 800000 lines with mostly 80 letters per line (the last one may have fewer), so the speed is critical here!
My first version would have taken
1,5 DAYS to do this!
This version here < 1 hour. Of course if this would be a marked homework which I must turn in I would seek some ways to do this even faster + improve my code further as it lacks annotation now + has english and german parts. Also the quick-sort algorithm is just a modified copy-pasted google result uptated to sort a pair of data and not just a simgle entry).
So what exactly does the java code?
It makes an array of Strings with all possible combinations (3125) of A,T,G,C,N and an equaly sized integer array with the counter to the 5 letter sequence initialised, with 0 of course.
Then a BufferedReader reads in the first line of the input text file into an input string.
The substring of the first 5 letters is compared to the string array and upon a match the counter is raised by 1.
This could be optimised by stopping the search for a match when a match is indeed found. Also "NNNNN" could be the first sequence to compare as it is by far the most common.
After raising the counter the first letter of the input string is deleted as it can go now.
When the input string has fewer than 5 letters the next line is added to the input string.
Aslo I print out a little status (in german
) in which line the program currently is.
The loop continues until there is no more line to ad in the input text file.
After that I sort the result arrays with the modified copied quick-sort algorithm (that is also in german btw
) and the sorted results are beeing written into an ouput text file "QuantityOfWords" by a BufferedWriter.
Simon mentioned a more robust way by only reading in single characters and count + delete when the string has reached 5 characters.
I attached the java code here + the output file for the whole human chromosom 21 (to be Opened with a text editor forgot the ending
).
The whole chromosome sequence is FASTA format would be ~ 45 MB so you need to find that on your own if you want it
.
Again this code is
FAR FROM OPTIMAL and contains poor writing + mixed language. If this was sth that is marked or I must turn in there would be quite some work still to do