Category: bioinformatics

Downloading BLAT

To get BLAT source code:

$ mkdir /tmp/blat && cd /tmp/blat
$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc.zip
$ unzip blatSrc.zip

Patching (optional)

I decided to make blat a static binary to avoid missing xyz.so shared library errors. Here’s a patch you can use to modify the blat makefile:

$ cat > static-blat-makefile.patch
3c3
< L += -lm $(SOCKETLIB)
---
> L += -lm -ldl $(SOCKETLIB) -static-libgcc
10c10
< ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O $(MYLIBS) $L
---
> ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O -static $(MYLIBS) $L

You may need static library packages installed on your system. The names of these packages will depend on your version of Linux.

Then apply the patch:

$ cd /tmp/blat/blatSrc/blat
$ cp makefile makefile.original
$ patch makefile.original -i ../../static-blat-makefile.patch -o makefile

You may decide not to apply this patch. You could probably skip this step. I just don’t like dynamic binaries.

Building BLAT

In any case, you will want to go to the top level of the blatSrc directory and run make to build the kit:

$ cd /tmp/blat/blatSrc && make

This will take a few minutes to build binaries. Grab some coffee or whatevs.

Installing BLAT

To install them into ${HOME}/bin/${MACHTYPE}, run:

$ make install

This destination is a subdirectory of your home directory.

Once it is built and installed, you can copy the binary to /usr/local/bin or somewhere in your shell’s PATH that makes sense to you. For me, my ${MACHTYPE} is x86_64 and I like having binaries in /usr/local/bin:

$ sudo cp ~areynolds/bin/x86_64/blat /usr/local/bin/blat

Adjust this to the particulars of your setup.

Downloading genomes

Once you have installed blat, the next step is to download a FASTA file for your genome of interest.

If you wanted hg38, for instance:

$ for chr in `seq 1 22` X Y; do echo $chr; wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr$chr.fa.gz | gunzip -c - >> hg38.fa; done

Optimizing queries

Once you have this file hg38.fa, you can start doing queries against it to look for sequence matches, but it can help speed up searches if you first make an OOC file:

$ blat /path/to/hg38.fa /dev/null /dev/null -makeOoc=/path/to/hg38.fa.11.ooc -repMatch=1024

When you do searches, you’d pass this OOC file as an option to skip over regions with over-represented sequences.

Querying

Once you have this OOC file made, you can do searches with your FASTA file containing sequences of interest:

$ blat /path/to/hg38.fa /path/to/your-sequences.fa -ooc=/path/to/hg38.fa.11.ooc search-results.psl

The blat binary will write any search results to a PSL-formatted text file called search-results.psl. You can name this whatever you want.

The PSL format is described on the UCSC site.

Parallelization

If you have very many sequences, you can parallelize this by simply splitting up your input sequences file into smaller pieces, and running multiple blat processes, one process for each piece of your original sequences file, writing many PSL files as output.

Set operations

It can help to use a tool like BEDOPS psl2bed to convert PSL to a BED file to do set operations, but that depends on what you want to do with the results. In any case, to convert a PSL file to a sorted BED file:

$ psl2bed < search-results.psl > search-results.bed

Read More

Here’s a quick method to get HGNC symbols and names that draws upon data from UCSC and the open source MyGene.info project:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | cut -f13 | sort | uniq | get_hgnc_names_for_symbols.py > hgnc_symbols_with_names.txt

There’s a Python script in there that I call get_hgnc_names_for_symbols.py:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

hgnc_symbols = []
for line in sys.stdin:
    hgnc_symbols.append('%s' % (line.strip()))

mg = MyGeneInfo()
results = mg.querymany(hgnc_symbols, scopes='symbol', species='human', verbose=False)

for result in results:
    sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))

The pipeline above writes a two-column text file called hgnc_symbols_with_names.txt that contains the HGNC symbol (e.g., AAR2) and its name (e.g., AAR2 splicing factor homolog), which could be put into a lookup table or, given that it is sorted, could be searched very quickly with a binary search via the Python bisect library.

Read More