chipD is a program that computes sets of oligonucleotide probes for genome-scale microarray applications. chipD can be used to design genome tiling arrays that are used for chromatin immuno-precipitation on a chip (ChIP-chip), or to design expression arrays that are used to monitor changes in transcript abundance.
This guide provides background information about the chipD algorithm and describes the functions of the optional input parameters.
For experiments such as ChIP-chip that requires tiling arrays, two factors are particularly important. First, there should be no gaps in the sequence coverage, otherwise critical information about particular genomic regions could be missed. Second, all probes need to have identical or close to identical hybridization characteristics to obtain consistent data. These two imperatives can be hard to reconcile in some regions of the sequences, especially if they contain sequences of unusual composition, such as stretches of identical bases or stable secondary structures.
The chipD program has been created in an effort to obtain a chip design which offers complete and uniform sequence coverage of a small genome such as bacterial and yeast genomes. Candidate probes are scored according to three criteria: melting temperature, number of targets in the genome and sequence complexity. Then, instead of defining an arbitrary score threshold, the probes are ranked according to their scores and the best probes are selected in an iterative fashion until complete coverage is achieved. In this way, no genomic regions are left unrepresented while picking the best possible probes.
A contig refers to single contiguous stretch of DNA, usually the entire sequence of a specific plasmid or chromosome. One or more contigs are read by chipD from a FASTA file. Individual contigs are indicated in the FASTA file using a special header line for each, which must have a "greater than" symbol (>) as the first character.
>Contig1To design probes for a tiling array, the full sequences of all the contigs should be used as an input for chipD. The sequences may be pre-processed by the user prior to submission to mask repeated sequences or regions irrelevant to the study.
GTCGTACGTAGAT...
To design probes for expression microarrays, the FASTA file should contain only the coding strand sequences of the genes that are being targeted. Each gene sequence should be treated as a contig with an unique identifier preceded by the ">" symbol in the FASTA file.
>Locus1Due to memory limitations, the chipD server cannot handle sequence files larger than around 8 megabases. However, users can partition the sequences, submit portions in multiple instances to the server, and concatenate the resulting lists of probes.
ATGAGATACACAGT...
>Locus2
ATGATATGTCTGAT...
The term ShortOligo will be used in this document to refer to a short oligomer of DNA consisting of 15 base pairs occurring in input sequences. The characteristics of the ShortOligos will be used to determine the overall score of the probes.
The overall score for each ShortOligo is obtained by summing the following 2 parameters:
The term LongOligo will be used in this document to refer to any oligomer that is composed of multiple overlapping ShortOligos. LongOligos are used by the program to determine the set of candidate probes entering the final selection step.
At each position in the contigs, the program extracts sequences within the range of permissible lengths (set between the parameter values 40 and 70 bases by default) and calculates their hybridization characteristics. Only the best LongOligo for each position is added to the list of candidate probes.
The overall score for each LongOligo is obtained by adding the following 4 parameters:
Once the best LongOligo has been determined for each position in the sequences, the list of candidate probes is ranked according to their scores and the iterative selection process begins. The best scoring probe is selected and the neighboring probes, according to the interval specifications, are removed from the list. The next best probe is selected and the process continues until the list is depleted. This process ensures that all regions of the contigs are represented by the best possible probes.
The interval is specified either by the user or calculated according to the total length of the contigs divided by the maximum number of probes that can be synthesized on an array.
For the design of tiling arrays only, every other probe relative to its location on the sequences is transformed to its reverse complement. Therefore, both strand of the DNA are represented by probes on the array.
For the design of expression array, no transformation is done so the probes remain strand specific.
TopInput FASTA file:
>contig1
GAACTGTCGCCTCTTCCTGTCGGGACAATGGAGGATCGGCGGCATGGGATGGGTGCTGAT
GAGCGAGCGCGAACTGAACCGCATCGAGATCCTGTCGAAGGTGCTCGATCGGAGGATGAC
GAGCCGCAACCCACGGCGCCGCCCAATGCAATCCGCGCCCGCCTCCATGCAACATAACTA
TCCTTATCCGTTCTGTCGGTGTAAGCGCAAAGTAGAATTGTCGCATCCAAGCAAAGTAAT
CAACTTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGCGGCAGGCCTAACACATGCAA
GTCGAGCGAAGTCTTCGGACTTAGCGGCGGACGGGTGAGTAACGCGTGGGAACGTGCCCT
GTAACTTGGCACATGGACAGAAAGACCTCGGGCGATGCCCGAGGCAGATGTGCGAAGGTT
CGACGTCAAGGACAGCGCTTCGGCGCTTT
Options:
Statistics output:
Estimating target melting temperature...
ESTIMATING Tm, Total Number bp: 449.0
contig 1 Number bp: 449 NumRandomSamps: 1000
Tm estimate for this contig: 81.07113891257697
FINAL Tm estimate using all contigs: 81.07113891257697
Finished........ Target melting temperature set to: 81.071
Tm minimum offset: 5.0
Tm minimum: 76.07113891257697
Probe statistics...
Probe Length:
mean: 50.029
stdDev: 4.618
Cv: 0.092
Probe Melting Temp:
mean: 80.619
stdDev: 2.159
Cv: 0.027
Probe Score:
mean: 25.011
stdDev: 51.312
Cv: 2.052
Number of probes used: 34
MAX Number of probes: 100
Percent Chip Utilized: 34.000
Probe list output:
PROBE_ID CHROMOSOME POSITION PROBE_SEQUENCE SENSE LENGTH TM SCORE
TESTCASE_F000000 contig1 1 GAACTGTCGCCTCTTCCTGTCGGGACAATGGAGGATCGGCGGCATGGGAT + 50 82.064 10.513
TESTCASE_R000001 contig1 12 CCATCCCATGCCGCCGATCCTCCATTGTCCCGACAGGAAGA - 41 81.838 20.625
TESTCASE_F000002 contig1 24 GACAATGGAGGATCGGCGGCATGGGATGGGTGCTGATGAGCGAGCGCGAACTGAA + 55 81.258 28.569
TESTCASE_R000003 contig1 35 ATCTCGATGCGGTTCAGTTCGCGCTCGCTCATCAGCACCCATCCCATGCCGCCGAT - 56 81.484 24.686
TESTCASE_F000004 contig1 46 GGGATGGGTGCTGATGAGCGAGCGCGAACTGAACCGCATCGAGATCCTGT + 50 82.200 5.547
TESTCASE_R000005 contig1 57 GAGCACCTTCGACAGGATCTCGATGCGGTTCAGTTCGCGCTCGCTCATCA - 50 81.185 0.013
TESTCASE_F000006 contig1 76 GAACCGCATCGAGATCCTGTCGAAGGTGCTCGATCGGAGGATGACGAGCC + 50 81.188 0.014
TESTCASE_R000007 contig1 86 GTGGGTTGCGGCTCGTCATCCTCCGATCGAGCACCTTCGACAGGATCTC - 49 81.841 3.037
TESTCASE_F000008 contig1 96 CGAAGGTGCTCGATCGGAGGATGACGAGCCGCAACCCACG + 40 82.603 15.004
TESTCASE_R000009 contig1 116 TTATGTTGCATGGAGGCGGGCGCGGATTGCATTGGGCGGCGCCGTGGGTTGCGGCTCGTCAT - 62 82.799 32.398
TESTCASE_F000010 contig1 130 CCCACGGCGCCGCCCAATGCAATCCGCGCCCGCCTCCATGCAACATAACTATCCTTA + 57 81.530 20.857
TESTCASE_R000011 contig1 141 CGGATAAGGATAGTTATGTTGCATGGAGGCGGGCGCGGATTGCATTGGGC - 50 80.374 9.024
TESTCASE_F000012 contig1 153 CCGCGCCCGCCTCCATGCAACATAACTATCCTTATCCGTTCTGTCGGTGT + 50 80.412 3.330
TESTCASE_R000013 contig1 164 TGCGCTTACACCGACAGAACGGATAAGGATAGTTATGTTGCATGGA - 46 76.132 28.396
TESTCASE_F000014 contig1 176 AACTATCCTTATCCGTTCTGTCGGTGTAAGCGCAAAGTAGAATTGTCGCA + 50 75.439 178.627
TESTCASE_R000015 contig1 187 TGCTTGGATGCGACAATTCTACTTTGCGCTTACACCGACAGAACGGA - 47 78.233 11.056
TESTCASE_F000016 contig1 197 CGGTGTAAGCGCAAAGTAGAATTGTCGCATCCAAGCAAAGT + 41 76.286 31.898
TESTCASE_R000017 contig1 216 AGCCAGGATCAAACTCTCAAGTTGATTACTTTGCTTGGATGCGACAATT - 49 74.652 265.546
TESTCASE_F000018 contig1 227 CCAAGCAAAGTAATCAACTTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGCGGCAGGCC + 62 77.509 24.689
TESTCASE_R000019 contig1 240 AGGCCTGCCGCCAGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTGA - 50 80.249 0.676
TESTCASE_F000020 contig1 255 GATCCTGGCTCAGAATGAACGCTGGCGGCAGGCCTAACACATGCAAGTCG + 50 81.147 0.006
TESTCASE_R000021 contig1 268 CGAAGACTTCGCTCGACTTGCATGTGTTAGGCCTGCCGCCAGCGTTCATT - 50 80.860 0.045
TESTCASE_F000022 contig1 284 AGGCCTAACACATGCAAGTCGAGCGAAGTCTTCGGACTTAGCGGCGGACG + 50 81.208 0.019
TESTCASE_R000023 contig1 294 GTTACTCACCCGTCCGCCGCTAAGTCCGAAGACTTCGCTCGACTTGCATG - 50 80.614 4.694
TESTCASE_F000024 contig1 304 GAGCGAAGTCTTCGGACTTAGCGGCGGACGGGTGAGTAACGCGTGGGAAC + 50 82.344 11.296
TESTCASE_R000025 contig1 314 TTACAGGGCACGTTCCCACGCGTTACTCACCCGTCCGCCGCTAAGTCCGAA - 51 82.752 21.169
TESTCASE_F000026 contig1 328 CGGACGGGTGAGTAACGCGTGGGAACGTGCCCTGTAACTTGGCACATGGA + 50 82.298 20.224
TESTCASE_R000027 contig1 339 AGGTCTTTCTGTCCATGTGCCAAGTTACAGGGCACGTTCCCACGCGTTAC - 50 79.722 14.643
TESTCASE_F000028 contig1 350 GAACGTGCCCTGTAACTTGGCACATGGACAGAAAGACCTCGGGCGATGCC + 50 81.067 7.887
TESTCASE_R000029 contig1 360 ATCTGCCTCGGGCATCGCCCGAGGTCTTTCTGTCCATGTGCCAAGTTACA - 50 80.971 12.661
TESTCASE_F000030 contig1 370 CACATGGACAGAAAGACCTCGGGCGATGCCCGAGGCAGATGTGCGAAGGTT + 51 81.678 16.956
TESTCASE_R000031 contig1 381 CTTGACGTCGAACCTTCGCACATCTGCCTCGGGCATCGCCCGAGGTCTTT - 50 82.276 14.103
TESTCASE_F000032 contig1 394 GATGCCCGAGGCAGATGTGCGAAGGTTCGACGTCAAGGACAGCGCTTCGG + 50 82.723 5.117
TESTCASE_R000033 contig1 405 AAGCGCCGAAGCGCTGTCCTTGACGTCGAACCTTCGCACATCTG - 44 82.102 7.063
The probe list is given in a tab-delimited text file with one line per probe. Each probe recieves an unique ID that indicates which strand of the DNA it represents ('F' forward, 'R' reverse strand), column 5 also indicates the direction.
Input file:
>locus1
GAACTGTCGCCTCTTCCTGTCGGGACAATGGAGGATCGGCGGCATGGGATGGGTGCTGAT
TATGCAGATCAGACGACTCGAGCATCTGAGCTCAGGCAGTACTCAGAGGCATCTCATGAG
GACTTAGAGCGCAGAGGCGCGTCTATTAGCGAGACGGCAGATCTTATCTAGAGCGACTAT
TAGCAGACGGATCTTATATCGCGCGGGCGGCATTATATTATGCGATCATGCAGACTCAGC
>locus2
GAGCGAGCGCGAACTGAACCGCATCGAGATCCTGTCGAAGGTGCTCGATCGGAGGATGAC
GAGCCGCAACCCACGGCGCCGCCCAATGCAATCCGCGCCCGCCTCCATGCAACATAACTA
GTCAGCATCATCAGCAGCTATCATCATCATGCAGTCATCAGCGAGCAGTGACGCGTAGCG
>locus3
TCCTTATCCGTTCTGTCGGTGTAAGCGCAAAGTAGAATTGTCGCATCCAAGCAAAGTAAT
CATCGATGCATGCTGCTGATCGTACGTGCTCGATGCTAGCTGTGCTGATGATCGTAGCTG
ACTGATGCTAGCTGATGTCGCTGCTGATCGTAGCTGATGTGCTGACTGATCGTGATCGTA
>locus4
CAACTTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGCGGCAGGCCTAACACATGCAA
GTCGAGCGAAGTCTTCGGACTTAGCGGCGGACGGGTGAGTAACGCGTGGGAACGTGCCCT
GTAACTTGGCACATGGACAGAAAGACCTCGGGCGATGCCCGAGGCAGATGTGCGAAGGTT
CGACGTCAAGGACAGCGCTTCGGCGCTTT
Options:
Statistics output:
Estimating target melting temperature...
ESTIMATING Tm, Total Number bp: 809.0
FINAL Tm estimate using all contigs: 80.10455328555695
Finished........ Target melting temperature set to: 80.105
Tm minimum offset: 5.0
Tm minimum: 75.10455328555695
Number of replicates manually set to 1
Target number of probes per contig: 5
Using spacer offset of 0 to calculate the spacer for each contig.
Determining the spacer for each contig.......done!
Average spacer size: 40.25
Not reversing any probes. Don't need to do this for expression arrays.
Average number of probes per contig : 3.750
Max number of probes: 4
Min number of probes: 3
Transcripts with fewer than 6 probes:
locus1 240 bases) - spacer: 48 - number of probes: 4
locus2 180 bases) - spacer: 36 - number of probes: 4
locus3 180 bases) - spacer: 36 - number of probes: 3
locus4 209 bases) - spacer: 41 - number of probes: 4
All contigs have at least one probeCalculating probe statistics.....
Finished.........................Wed May 05 14:30:47 CDT 2010
Probe statistics...
Probe Length:
mean: 50.200
stdDev: 3.331
Cv: 0.066
Probe Melting Temp:
mean: 80.191
stdDev: 1.276
Cv: 0.016
Probe Score:
mean: 6.197
stdDev: 9.048
Cv: 1.460
Number of replicates: 1
Number of unique probes used: 15
Number of total probes used: 15
MAX Number of probes: 20
Percent Chip Utilized: 75.000
Probe list output:
PROBE_ID CHROMOSOME POSITION PROBE_SEQUENCE SENSE LENGTH TM SCORETop
TESTCASE_F000000 locus1 1 GAACTGTCGCCTCTTCCTGTCGGGACAATGGAGGATCGGCGGCATGGGAT + 50 82.064 13.366
TESTCASE_F000001 locus1 62 ATGCAGATCAGACGACTCGAGCATCTGAGCTCAGGCAGTACTCAGAGGCA + 50 79.382 0.522
TESTCASE_F000002 locus1 112 TCTCATGAGGACTTAGAGCGCAGAGGCGCGTCTATTAGCGAGACGGCAGA + 50 80.210 0.011
TESTCASE_F000003 locus1 166 ATCTAGAGCGACTATTAGCAGACGGATCTTATATCGCGCGGGCGGCA + 47 79.112 7.160
TESTCASE_F000004 locus4 12 TTTGATCCTGGCTCAGAATGAACGCTGGCGGCAGGCCTAACACATGCAAG + 50 80.076 0.001
TESTCASE_F000005 locus4 54 CATGCAAGTCGAGCGAAGTCTTCGGACTTAGCGGCGGACGGGTGAGTAAC + 50 80.614 4.745
TESTCASE_F000006 locus4 110 GAACGTGCCCTGTAACTTGGCACATGGACAGAAAGACCTCGGGCGATGCC + 50 81.067 8.813
TESTCASE_F000007 locus4 154 GATGCCCGAGGCAGATGTGCGAAGGTTCGACGTCAAGGACAGCGCTTC + 48 82.116 8.580
TESTCASE_F000008 locus2 14 CTGAACCGCATCGAGATCCTGTCGAAGGTGCTCGATCGGAGGATGACGAG + 50 80.164 0.004
TESTCASE_F000009 locus2 56 ATGACGAGCCGCAACCCACGGCGCCGCCCAATGCAATCCGCGCCCGCCTCCATGCAACATAA + 62 82.799 36.674
TESTCASE_F000010 locus2 94 CGCGCCCGCCTCCATGCAACATAACTAGTCAGCATCATCAGCAGCTATCA + 50 79.660 2.586
TESTCASE_F000011 locus2 131 TCAGCAGCTATCATCATCATGCAGTCATCAGCGAGCAGTGACGCGTAGC + 49 79.147 1.916
TESTCASE_F000012 locus3 7 TCCGTTCTGTCGGTGTAAGCGCAAAGTAGAATTGTCGCATCCAAGCA + 47 78.233 6.503
TESTCASE_F000013 locus3 71 TGCTGCTGATCGTACGTGCTCGATGCTAGCTGTGCTGATGATCGTAGCTG + 50 79.317 0.620
TESTCASE_F000014 locus3 117 GCTGACTGATGCTAGCTGATGTCGCTGCTGATCGTAGCTGATGTGCTGAC + 50 78.901 1.449
NCBI Bacterial Genomes: NCBI
Getting Started in Tiling Microarray Analysis, Liu XS, 2007, PLoS Comput Biol 3(10). (View)
The design strategy, the initial implementation in Perl, and the actual use of a resulting chip design in a study on Rhodobacter sphaeroides are due to the efforts of Yann Dufour in collaboration with Tim Donohue [3]. The code was ported to JAVA by Andrew Tritt, who also added the expression array functionality. The server version of chipD was inspired by Julie Mitchell, who also provided invaluable guidance in its development. Improvements to the JAVA code and server scripts were done by Gary Wesenberg. Special thanks to Madeline Fisher for improving the use of language in this document and assisting in other aspects of web page design.
Top1. John SantaLucia, Jr. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor, Proc Natl Acad Sci U S A., 95, 1460-1465. (Full Text)
2. James G. Wetmur (1991) DNA Probes: Applications of the Principles of Nucleic Acid Hybridization, Critical Reviews in Biochemistry and Molecular Biology, 26, 227-259.
3. Yann S. Dufour, Robert Landick, Timothy J. Donohue (2008) Organization and Evolution of the Biological Response to Singlet Oxygen Stress, Journal of Molecular Biology, 383, 713-730. (View)
4. Bolton ET, McCarthy BJ. (1962) A general method for the isolation of RNA complementary to DNA, Proc Natl Acad Sci U S A. 48, 1390-7. (View)
5. Yann S. Dufour, Gary E. Wesenberg, Andrew J. Tritt, Jeremy D. Glasner, Nicole T. Perna, Julie C. Mitchell, Timothy J. Donohue (2010) chipD: a web tool to design oligonucleotide probes for high-density tiling arrays, Nucleic Acids Research 38 W321-W325, (doi:10.1093/nar/gkq517), (View)
Top
Case 1A: chipD default settings: [Na+] = 0.10 M, [DNA excess] = 0.0001 M.
Model 1 (NN) is plotted as a red curve and Model 2 (%GC) as
a blue curve. Points from Model 3 (Hybrid) plotted as black circles.
Case 1B: Closer view, chipD default settings: [Na+] = 0.10 M, [DNA excess] = 0.0001 M.
Model 1 (NN) is plotted as a red curve and Model 2 (%GC) as
a blue curve. Points from Model 3 (Hybrid) plotted as black circles.
Case 2: Higher salt: [Na+] = 0.50 M, [DNA excess] = 0.0001 M.
Model 1 (NN) is plotted as a red curve and Model 2 (%GC) as
a blue curve. Points from Model 3 (Hybrid) plotted as black circles.
Case 3: Very high salt: [Na+] = 1.0 M, [DNA excess] = 0.0001 M
Model 1 (NN) is plotted as a red curve and Model 2 (%GC) as
a blue curve. Points from Model 3 (Hybrid) plotted as black circles.