PROGRAMMING ASSIGNMENT #1

Cpt S 471/571, Spring 2021

Due:  February 19, 2021 (Friday), by 11:59pm Pacific Time @Blackboard (WSU) Assignments dropbox


General Guidelines:


Assignment:

Implement two algorithms:
i) Needleman-Wunsch algorithm for computing OPTIMAL GLOBAL ALIGNMENT, and
ii) Smith-Waterman algorithm for computing OPTIMAL LOCAL ALIGNMENT,

both using affine gap penalty function between two input DNA sequences, s1 and s2, of lengths m and n respectively. 

Each cell of your Dynamic Programming table ("DP table") should have the following structure:

    struct DP_cell {
        int Sscore;      
// field for Substitution (S) score       
        int Dscore;     
// field for Deletion (D) score
        int Iscore;        // field for Insertion (I) score 
        ... // add any other field(s) that you may need for the implementation
    }

At the start of the program, you should read the alignment score parameters from a user-specified input file (optional). The default name of the file, if the user does not specify one, should be "parameters.config" in the present working directory. The parameters.config file should allow the user to specify one scoring parameter in each line (space or tab delimited). For example:

    match    1
    mismatch    -2
    h  -5 
    g -1

The command prompt usage for your program should look as follows:

    $ <executable name> <input file containing both s1 and s2> <0: global, 1: local> <optional: path to parameters config file>

Input File Formats:

The two input sequences should be given as input in one text file. The text file should be in what is called the "multi-sequence FASTA format", which is as follows:

For example:

An input file called "input.fasta" could look like:

    >s1 sequence
    acatgctacacgtactccgataccccgtaaccgataacgatacacagacct
    cgtacgcttgctacaacgtactctataaccgagaacgattgaca
    tgcctcgtacacatgctacacgtactccgatgaccccgt

    >s2 sequence
    acattctacgaacctctcgataaccccataaccgataacgattgacacctcgt
    acgctttctacaacttactctctcgataaccccataaccgataacgattgacacctc
    gtacacatggtacatacgtactctcgataccccgt

At completion, the program should output/print the following information:

All output should be to the standard output.


  • Example Input/Output
  • As shown above, the alignment is wrapped after every 60 alignment positions, and on both sides the starting and ending indices of the aligning positions in the respective strings should be displayed. Then, a pipe symbol "|" is shown in columns wherever there is a match. (Other columns contain a whitespace character).

    After implementing and testing, run your program on the following two input files,  redirect the standard output into a global alignment and a local alignment output text files and attach them as part of your submission:

    Opsin1_colorblindness_gene.fasta:  Contains two sequences - Opsin1 gene in human vs. Opsin1 gene in mouse (this is one of the genes responsible for colorblindness)

    Human-Mouse-BRCA2-cds.fasta: Contains two sequences - BRCA2 gene in human vs. BRCA2 gene in mouse (this is one of the breast cancer genes)

    You are welcome to create your own additional test cases using small genes from the NCBI GenBank data. Look for short sequences (<10Kbp per sequence file) for this exercise. You need to download the FASTA formatted files.

    Additional references: NCBI GenBank
  • CHECKLIST FOR SUBMISSION:

            ___  source code

            ___  output files for the above two real world inputs

            ___  All the above zipped as a folder into an archive .zip