Author: alexpreynolds

Downloading BLAT

To get BLAT source code:

$ mkdir /tmp/blat && cd /tmp/blat
$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc.zip
$ unzip blatSrc.zip

Patching (optional)

I decided to make blat a static binary to avoid missing xyz.so shared library errors. Here’s a patch you can use to modify the blat makefile:

$ cat > static-blat-makefile.patch
3c3
< L += -lm $(SOCKETLIB)
---
> L += -lm -ldl $(SOCKETLIB) -static-libgcc
10c10
< ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O $(MYLIBS) $L
---
> ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O -static $(MYLIBS) $L

You may need static library packages installed on your system. The names of these packages will depend on your version of Linux.

Then apply the patch:

$ cd /tmp/blat/blatSrc/blat
$ cp makefile makefile.original
$ patch makefile.original -i ../../static-blat-makefile.patch -o makefile

You may decide not to apply this patch. You could probably skip this step. I just don’t like dynamic binaries.

Building BLAT

In any case, you will want to go to the top level of the blatSrc directory and run make to build the kit:

$ cd /tmp/blat/blatSrc && make

This will take a few minutes to build binaries. Grab some coffee or whatevs.

Installing BLAT

To install them into ${HOME}/bin/${MACHTYPE}, run:

$ make install

This destination is a subdirectory of your home directory.

Once it is built and installed, you can copy the binary to /usr/local/bin or somewhere in your shell’s PATH that makes sense to you. For me, my ${MACHTYPE} is x86_64 and I like having binaries in /usr/local/bin:

$ sudo cp ~areynolds/bin/x86_64/blat /usr/local/bin/blat

Adjust this to the particulars of your setup.

Downloading genomes

Once you have installed blat, the next step is to download a FASTA file for your genome of interest.

If you wanted hg38, for instance:

$ for chr in `seq 1 22` X Y; do echo $chr; wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr$chr.fa.gz | gunzip -c - >> hg38.fa; done

Optimizing queries

Once you have this file hg38.fa, you can start doing queries against it to look for sequence matches, but it can help speed up searches if you first make an OOC file:

$ blat /path/to/hg38.fa /dev/null /dev/null -makeOoc=/path/to/hg38.fa.11.ooc -repMatch=1024

When you do searches, you’d pass this OOC file as an option to skip over regions with over-represented sequences.

Querying

Once you have this OOC file made, you can do searches with your FASTA file containing sequences of interest:

$ blat /path/to/hg38.fa /path/to/your-sequences.fa -ooc=/path/to/hg38.fa.11.ooc search-results.psl

The blat binary will write any search results to a PSL-formatted text file called search-results.psl. You can name this whatever you want.

The PSL format is described on the UCSC site.

Parallelization

If you have very many sequences, you can parallelize this by simply splitting up your input sequences file into smaller pieces, and running multiple blat processes, one process for each piece of your original sequences file, writing many PSL files as output.

Set operations

It can help to use a tool like BEDOPS psl2bed to convert PSL to a BED file to do set operations, but that depends on what you want to do with the results. In any case, to convert a PSL file to a sorted BED file:

$ psl2bed < search-results.psl > search-results.bed

Read More

Here’s a quick method to get HGNC symbols and names that draws upon data from UCSC and the open source MyGene.info project:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz | gunzip -c | cut -f13 | sort | uniq | get_hgnc_names_for_symbols.py > hgnc_symbols_with_names.txt

There’s a Python script in there that I call get_hgnc_names_for_symbols.py:

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

hgnc_symbols = []
for line in sys.stdin:
    hgnc_symbols.append('%s' % (line.strip()))

mg = MyGeneInfo()
results = mg.querymany(hgnc_symbols, scopes='symbol', species='human', verbose=False)

for result in results:
    sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))

The pipeline above writes a two-column text file called hgnc_symbols_with_names.txt that contains the HGNC symbol (e.g., AAR2) and its name (e.g., AAR2 splicing factor homolog), which could be put into a lookup table or, given that it is sorted, could be searched very quickly with a binary search via the Python bisect library.

Read More

This assumes Homebrew is installed and that it installs Graphviz 2.40.1:

$ brew install graphviz
$ git clone https://github.com/pygraphviz/pygraphviz.git
$ cd pygraphviz
$ sudo python setup.py install --user --include-path=/usr/local/Cellar/graphviz/2.40.1/include --library-path=/usr/local/Cellar/graphviz/2.40.1/lib

Read More

Here are ways to get SIMD/SSE flags from machines running either Linux or OS X:

On Linux (CentOS 7):

$ cat /proc/cpuinfo | grep flags | uniq
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local

On Mac OS X 10.12:

$ sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
$ sysctl -a | grep machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID FPU_CSDS

See: https://stackoverflow.com/a/38345423/19410 for a discussion about how to detect instruction sets.

Read More

The following post explains steps I took to install and enable mongoDB 3.2.1 as a service running under CentOS 7.

Install development tools and libraries, download mongoDB and compile source, and install the compiled binaries:

$ sudo yum group install "Development Tools"
$ sudo yum install scons
$ sudo yum install glibc-static
$ curl -O https://fastdl.mongodb.org/src/mongodb-src-r3.2.1.tar.gz
$ tar zxvf mongodb-src-r3.2.1.tar.gz
$ cd mongodb-src-r3.2.1
$ scons --ssl all
$ sudo scons --prefix=/opt/mongo install

Set up a mongod account and relevant directories:

$ sudo groupadd --system mongod
$ sudo useradd --no-create-home --system --gid mongod --home-dir /var/lib/mongo --shell /sbin/nologin --comment 'mongod' mongod
$ sudo mkdir -p /var/lib/mongo
$ sudo chown -R mongod:mongod /var/lib/mongo
$ sudo chmod 0755 /var/lib/mongo/
$ sudo mkdir -p /var/{run,log}/mongodb/
$ sudo chown mongod:mongod /var/{run,log}/mongodb/
$ sudo chmod 0755 /var/{run,log}/mongodb/
$ sudo mkdir -p /data/db
$ sudo chown -R mongod:mongod /data/db
$ sudo chmod -R o+w /data/db

Copy over mongod.conf and mongod.service configuration files with modifications for our setup:

$ sudo cp rpm/mongod.conf /etc/mongod.conf
$ sudo cp rpm/mongod.service /lib/systemd/system/mongod.service
$ sudo sed -i -e 's\/usr/local/bin/mongod\/opt/mongo/bin/mongod\' /lib/systemd/system/mongod.service

Reload daemon templates, and start and enable the mongoDB service:

$ sudo systemctl --system daemon-reload
$ sudo systemctl start mongod.service
$ sudo systemctl enable mongod.service

Confirm that the service is running properly:

$ sudo systemctl status mongod.service
● mongod.service - High-performance, schema-free document-oriented database
   Loaded: loaded (/usr/lib/systemd/system/mongod.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-01-27 14:33:39 PST; 9min ago
 Main PID: 116789 (mongod)
   CGroup: /system.slice/mongod.service
           └─116789 /opt/mongo/bin/mongod --quiet -f /etc/mongod.conf run

Jan 27 14:33:39 example.com systemd[1]: Started High-performance, schema-free document-oriented database.
Jan 27 14:33:39 example.com systemd[1]: Starting High-performance, schema-free document-oriented database...
Jan 27 14:33:39 example.com mongod[116787]: about to fork child process, waiting until server is ready for connections.
Jan 27 14:33:39 example.com mongod[116787]: forked process: 116789
Jan 27 14:33:39 example.com mongod[116787]: child process started successfully, parent exiting

You can also check the file /var/run/mongodb/mongod.pid for a valid process ID value. Sometimes it might be necessary to create the parent folder so that the PID can be created:

$ sudo mkdir /var/run/mongodb/

You could also check the mongoDB log for other errors:

$ tail /var/log/mongodb/mongod.log

If the mongod service is not active, double-check that folders are named correctly in configuration and service files, and that permissions and ownership are set correctly on those folders. If anything is not named and attributed correctly, then the service will likely not start and note something like the following error:

about to fork child process, waiting until server is ready for connections. forked process: 1234 ERROR: child process failed, exited with error number 1

I hope this helps others with setting up mongoDB under CentOS — good luck!

Read More

Our research lab is non-profit, but private GitHub repositories still cost money, so I have been playing with GitLab Community Edition to serve up some private Git repositories from a third-party host on the cheap.

Before using GitLab CE, I had set up a Git repository that, for whatever reason, would not allow users to cache credentials and would also not allow access via https (SSL). It was getting pretty frustrating to have to type in a long string of credentials on every commit, so setting up a proper Git server was one of the goals.

Installing and setting up the server is pretty painless. After installing all the necessary files and editing the server’s configuration file, I go into the GitLab web console and add myself as a user, and then add myself as a master of a test repository called test-repo.

When I try to clone this test repository via https, I get a Peer's Certificate issuer is not recognized error, which prevents cloning.

To debug this, Git uses the curl framework, which I put into verbose mode:

$ export GIT_CURL_VERBOSE=1

When cloning, I get a bit more detail about the certificate issuer error message:

$ git clone https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git
Cloning into 'test-repo'...
* Couldn't find host somehost.lab.org in the .netrc file; using defaults
* About to connect() to somehost.lab.org port 9999 (#0)
* Trying ...
* Connection refused
* Trying ...
* Connected to somehost.lab.org (127.0.0.1) port 9999 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* failed to load '/etc/pki/tls/certs/renew-dummy-cert' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/Makefile' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/localhost.crt' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/make-dummy-cert' from CURLOPT_CAPATH
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: /etc/pki/tls/certs
* Server certificate:
* subject: CN=*.lab.org,OU=Domain Control Validated
* start date: Oct 10 19:14:52 2013 GMT
* expire date: Oct 10 19:14:52 2018 GMT
* common name: *.lab.org
* issuer: CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O="GoDaddy.com, Inc.",L=Scottsdale,ST=Arizona,C=US
* NSS error -8179 (SEC_ERROR_UNKNOWN_ISSUER)
* Peer's Certificate issuer is not recognized.
* Closing connection 0
fatal: unable to access 'https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git/': Peer's Certificate issuer is not recognized.

Something is up with the certificate from Go Daddy. From some Googling around, it looks like nginx doesn’t like using intermediate certificates to validate server certificates.

To fix this, I concatenate my wildcard CRT certificate file with GoDaddy’s intermediate and root certificates, which are available from their certificate repository:

$ sudo su -
# cd /etc/gitlab/ssl
# wget https://certs.godaddy.com/repository/gdroot-g2.crt
# wget https://certs.godaddy.com/repository/gdig2.crt
# cat somehost.lab.org.crt gdig2.crt gdroot-g2.crt > somehost.lab.org.combined-with-gd-root-and-intermediate.crt

I then edit the GitLab configuration file to point its nginx certificate file setting to this combined file:

...
################
# GitLab Nginx #
################
## see: https://gitlab.com/gitlab-org/omnibus-gitlab/tree/629def0a7a26e7c2326566f0758d4a27857b52a3/doc/settings/nginx.md

# nginx['enable'] = true
# nginx['client_max_body_size'] = '250m'
# nginx['redirect_http_to_https'] = true
# nginx['redirect_http_to_https_port'] = 443
nginx['ssl_certificate'] = "/etc/gitlab/ssl/somehost.lab.org.combined-with-gd-root-and-intermediate.crt"
...

Once this is done, I then reconfigure and restart GitLab the usual way:

$ sudo gitlab-ctl reconfigure
$ sudo gitlab-ctl restart

After giving the server a few moments to crank up, I then clone the Git repository:

$ git clone https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git
Password for 'https://areynolds@somehost.lab.org:9999': ...

I can even cache credentials!

$ git config credential.helper store

Much nicer than the previous, non-web setup.

Read More