Category: bedops

Downloading BLAT

To get BLAT source code:

$ mkdir /tmp/blat && cd /tmp/blat
$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc.zip
$ unzip blatSrc.zip

Patching (optional)

I decided to make blat a static binary to avoid missing xyz.so shared library errors. Here’s a patch you can use to modify the blat makefile:

$ cat > static-blat-makefile.patch
3c3
< L += -lm $(SOCKETLIB)
---
> L += -lm -ldl $(SOCKETLIB) -static-libgcc
10c10
< ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O $(MYLIBS) $L
---
> ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O -static $(MYLIBS) $L

You may need static library packages installed on your system. The names of these packages will depend on your version of Linux.

Then apply the patch:

$ cd /tmp/blat/blatSrc/blat
$ cp makefile makefile.original
$ patch makefile.original -i ../../static-blat-makefile.patch -o makefile

You may decide not to apply this patch. You could probably skip this step. I just don’t like dynamic binaries.

Building BLAT

In any case, you will want to go to the top level of the blatSrc directory and run make to build the kit:

$ cd /tmp/blat/blatSrc && make

This will take a few minutes to build binaries. Grab some coffee or whatevs.

Installing BLAT

To install them into ${HOME}/bin/${MACHTYPE}, run:

$ make install

This destination is a subdirectory of your home directory.

Once it is built and installed, you can copy the binary to /usr/local/bin or somewhere in your shell’s PATH that makes sense to you. For me, my ${MACHTYPE} is x86_64 and I like having binaries in /usr/local/bin:

$ sudo cp ~areynolds/bin/x86_64/blat /usr/local/bin/blat

Adjust this to the particulars of your setup.

Downloading genomes

Once you have installed blat, the next step is to download a FASTA file for your genome of interest.

If you wanted hg38, for instance:

$ for chr in `seq 1 22` X Y; do echo $chr; wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr$chr.fa.gz | gunzip -c - >> hg38.fa; done

Optimizing queries

Once you have this file hg38.fa, you can start doing queries against it to look for sequence matches, but it can help speed up searches if you first make an OOC file:

$ blat /path/to/hg38.fa /dev/null /dev/null -makeOoc=/path/to/hg38.fa.11.ooc -repMatch=1024

When you do searches, you’d pass this OOC file as an option to skip over regions with over-represented sequences.

Querying

Once you have this OOC file made, you can do searches with your FASTA file containing sequences of interest:

$ blat /path/to/hg38.fa /path/to/your-sequences.fa -ooc=/path/to/hg38.fa.11.ooc search-results.psl

The blat binary will write any search results to a PSL-formatted text file called search-results.psl. You can name this whatever you want.

The PSL format is described on the UCSC site.

Parallelization

If you have very many sequences, you can parallelize this by simply splitting up your input sequences file into smaller pieces, and running multiple blat processes, one process for each piece of your original sequences file, writing many PSL files as output.

Set operations

It can help to use a tool like BEDOPS psl2bed to convert PSL to a BED file to do set operations, but that depends on what you want to do with the results. In any case, to convert a PSL file to a sorted BED file:

$ psl2bed < search-results.psl > search-results.bed

Read More

Finishing touches are in place for my convert2bed tool (GitHub site).

This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:

  • It runs about 3-10x as fast as bedtools *ToBed equivalents
  • It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format

As an example, here we use convert2bed on a 14M-read, indexed BAM file to a sorted BED file (data are piped to /dev/null) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:

$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam
14090028

Conversion is performed with default options (sorted BED as output, using BEDOPS sort-bed):

$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null
[bam_header_read] EOF marker is absent. The input is probably truncated.

real 3m5.508s
user 0m25.702s
sys 0m8.602s

Here is the same conversion, performed with bedtools v2.22 bamToBed and sortBed:

$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null

real    28m22.057s
user    2m58.579s
sys     0m41.605s

The use of convert2bed for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.

Further time reductions are conferred with use of bam2bedcluster and bam2starchcluster scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.

When testing is complete, code will be wrapped into the upcoming BEDOPS v2.4.3 release. Source is now available via GitHub.

Read More

I wrote a data extraction utility which uses PolarSSL to export a Base64-encoded SHA-1 digest of some internal metadata (a string of JSON-formatted data), to help validate archive integrity:

$ unstarch --sha1-signature .foo
7HkOxDUBJd2rU/CQ/zigR84MPTc=

So far, so good.

But now I want to validate that the metadata are being digested correctly through some independent means, preferably via the command-line, so that I can perform regression testing. I can use the openssl, xxd and base64 tools together to test that I get the same answer:

$ unstarch --list-json-no-trailing-newline .foo \
| openssl sha1 \
| xxd -r -p \
| base64
7HkOxDUBJd2rU/CQ/zigR84MPTc=

As a note to myself: I end up stripping the trailing newline from the JSON output of unstarch because this is what the PolarSSL library ends up digesting. This very nearly had me doubting whether PolarSSL was working correctly, or whether my command-line test was correct!

Read More

Here’s a one-liner that converts jarch files to starch format, stripping the input file’s extension so that it can be replaced with a new one:

$ for i in `ls *.jarch`; do echo "${i%.*}.starch"; gchr $i | starch - > "${i%.*}.starch"; done

Read More

PolarSSL is a C-based cryptography and SSL library which has a GPL license, which makes it ideal for use with BEDOPS, where I plan to use it for quick SHA-1 hashes of metadata, so as to help validate the integrity of the archive.

I’ve been testing it out in Mac OS X 10.8 and it seems pretty straightforward. Here’s a simple project that hashes the string abc:

#include <stdlib.h>                                                                                                                                                                               
#include <stdio.h>                                                                                                                                                                                
#include "polarssl/config.h"                                                                                                                                                                      
#include "polarssl/sha1.h"                                                                                                                                                                        
                                                                                                                                                                                                  
int main(int argc, char **argv)                                                                                                                                                                   
{                                                                                                                                                                                                 
    unsigned char output[20];                                                                                                                                                                     
    unsigned char *buf;                                                                                                                                                                           
    size_t bufLength;                                                                                                                                                                             
    size_t idx;                                                                                                                                                                                      
                                                                                                                                                                                                  
    buf = strdup("abc");                                                                                                                                                                          
    bufLength = strlen(buf);                                                                                                                                                                      
                                                                                                                                                                                                  
    sha1(buf, bufLength, output);                                                                                                                                                                 
                                                                                                                                                                                                  
    for (idx = 0; idx < 20; idx++) {                                                                                                                                                               
        fprintf(stdout, "%02x", output[idx]);
        if ((idx + 1) % 4 == 0)
            fprintf(stdout, " ");
    }                                                                                                                                                                                                  
    fprintf(stdout, "\n");                                                                                                                                                                        
                                                                                                                                                                                                  
    free(buf);                                                                                                                                                                                    
                                                                                                                                                                                                  
    return EXIT_SUCCESS;                                                                                                                                                                          
}

To compile it:

gcc -Wall -lpolarssl sha1test.c -o sha1test

When we ask for the hash value of the string abc, we get the following result:

$ ./sha1test
a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d

This agrees with the value reported at NIST, which is also:

a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d

Testing output against standards is useful for validation.

Read More

I plan to use a magic number in the second major release of our BEDOPS suite to uniquely identify starch-formatted archive files.

Looking at the first few bytes of the archive will help us because I plan to move the metadata to the back of the archive file, and it would be expensive to seek to the end of the file just to determine the file type. I figure a 64-byte header is enough to hold some extra data that identifies the file type and points the unstarch and starchcat tools to the right byte in the file where the metadata is kept:

This image describes what I’m planning for Starch v2. Basically, we reserve a 64-byte header at the front of the archive that contains the following four items:

  • Magic number – The magic number ca5cade5 identifies this as a Starch v2-formatted file.
  • Offset – A zero-padded 16-digit unsigned integer that marks the byte into the file at which the archive’s metadata starts (including the 64-byte header itself).
  • Hash – Here I make a SHA-1 hash of a concatenated string made from the offset value and metadata string. This value helps validate the integrity of the archive metadata and the offset values. (At a later time, we will add hashes of the chromosome streams to the metadata, to allow full archive validation).
  • Reserved – We keep 20 bytes of space free, in case we need it.

After these 64 bytes, the compressed, per-chromosome streams start, and we then wrap up with the metadata at the end of the file.

Initially, the header will be the magic number and all zeros. At the conclusion of creating the archive, once all the streams are prepared and the metadata is ready to hash, parts of the header are written over with calculated values (offset and hash).

To assist with picking the right magic number, Ned Batchelder put together a Python script to generate all the hex words possible from a small subset of the English dictionary — words like deadbeef, dec0ded and 0ddba11 which can be used as magic numbers to identify a file type. Thanks, Ned! So far, I’m liking the magic word ca5cade5 as it has a nice biological flavor to it.

Read More