Archives for category: programming

Here are ways to get SIMD/SSE flags from machines running either Linux or OS X:

On Linux (CentOS 7):

$ cat /proc/cpuinfo | grep flags | uniq
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local

On Mac OS X 10.12:

$ sysctl -a | grep machdep.cpu.features
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
$ sysctl -a | grep machdep.cpu.leaf7_features
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID FPU_CSDS

See: https://stackoverflow.com/a/38345423/19410 for a discussion about how to detect instruction sets.

Our research lab is non-profit, but private GitHub repositories still cost money, so I have been playing with GitLab Community Edition to serve up some private Git repositories from a third-party host on the cheap.

Before using GitLab CE, I had set up a Git repository that, for whatever reason, would not allow users to cache credentials and would also not allow access via https (SSL). It was getting pretty frustrating to have to type in a long string of credentials on every commit, so setting up a proper Git server was one of the goals.

Installing and setting up the server is pretty painless. After installing all the necessary files and editing the server’s configuration file, I go into the GitLab web console and add myself as a user, and then add myself as a master of a test repository called test-repo.

When I try to clone this test repository via https, I get a Peer's Certificate issuer is not recognized error, which prevents cloning.

To debug this, Git uses the curl framework, which I put into verbose mode:

$ export GIT_CURL_VERBOSE=1

When cloning, I get a bit more detail about the certificate issuer error message:

$ git clone https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git
Cloning into 'test-repo'...
* Couldn't find host somehost.lab.org in the .netrc file; using defaults
* About to connect() to somehost.lab.org port 9999 (#0)
* Trying ...
* Connection refused
* Trying ...
* Connected to somehost.lab.org (127.0.0.1) port 9999 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* failed to load '/etc/pki/tls/certs/renew-dummy-cert' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/Makefile' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/localhost.crt' from CURLOPT_CAPATH
* failed to load '/etc/pki/tls/certs/make-dummy-cert' from CURLOPT_CAPATH
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: /etc/pki/tls/certs
* Server certificate:
* subject: CN=*.lab.org,OU=Domain Control Validated
* start date: Oct 10 19:14:52 2013 GMT
* expire date: Oct 10 19:14:52 2018 GMT
* common name: *.lab.org
* issuer: CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O="GoDaddy.com, Inc.",L=Scottsdale,ST=Arizona,C=US
* NSS error -8179 (SEC_ERROR_UNKNOWN_ISSUER)
* Peer's Certificate issuer is not recognized.
* Closing connection 0
fatal: unable to access 'https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git/': Peer's Certificate issuer is not recognized.

Something is up with the certificate from Go Daddy. From some Googling around, it looks like nginx doesn’t like using intermediate certificates to validate server certificates.

To fix this, I concatenate my wildcard CRT certificate file with GoDaddy’s intermediate and root certificates, which are available from their certificate repository:

$ sudo su -
# cd /etc/gitlab/ssl
# wget https://certs.godaddy.com/repository/gdroot-g2.crt
# wget https://certs.godaddy.com/repository/gdig2.crt
# cat somehost.lab.org.crt gdig2.crt gdroot-g2.crt > somehost.lab.org.combined-with-gd-root-and-intermediate.crt

I then edit the GitLab configuration file to point its nginx certificate file setting to this combined file:

...
################
# GitLab Nginx #
################
## see: https://gitlab.com/gitlab-org/omnibus-gitlab/tree/629def0a7a26e7c2326566f0758d4a27857b52a3/doc/settings/nginx.md

# nginx['enable'] = true
# nginx['client_max_body_size'] = '250m'
# nginx['redirect_http_to_https'] = true
# nginx['redirect_http_to_https_port'] = 443
nginx['ssl_certificate'] = "/etc/gitlab/ssl/somehost.lab.org.combined-with-gd-root-and-intermediate.crt"
...

Once this is done, I then reconfigure and restart GitLab the usual way:

$ sudo gitlab-ctl reconfigure
$ sudo gitlab-ctl restart

After giving the server a few moments to crank up, I then clone the Git repository:

$ git clone https://areynolds@somehost.lab.org:9999/areynolds/test-repo.git
Password for 'https://areynolds@somehost.lab.org:9999': ...

I can even cache credentials!

$ git config credential.helper store

Much nicer than the previous, non-web setup.

The newer versions of emacs include JavaScript and other user modes useful for modern app development:

$ git clone git://git.savannah.gnu.org/emacs.git
$ sudo yum groupinstall "Development Tools"
$ wget ftp://ftp.gnu.org/gnu/autoconf/autoconf-2.68.tar.bz2
$ tar jxvf autoconf-2.68.tar.bz2
$ cd autoconf-2.68
$ ./configure; make; sudo make install
$ sudo yum install texinfo libXpm-devel giflib-devel libtiff-devel libotf-devel  
$ cd ../emacs
$ make bootstrap; sudo make install

This process can take upwards of 20-30 minutes.

With the git repo state as of 24 March 2015:

$ emacs --version
GNU Emacs 25.0.50.1
Copyright (C) 2015 Free Software Foundation, Inc.
GNU Emacs comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GNU Emacs
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

Via: http://haulynjason.net/weblog/?p=1592

Finishing touches are in place for my convert2bed tool (GitHub site).

This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:

  • It runs about 3-10x as fast as bedtools *ToBed equivalents
  • It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format

As an example, here we use convert2bed on a 14M-read, indexed BAM file to a sorted BED file (data are piped to /dev/null) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:

$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam
14090028

Conversion is performed with default options (sorted BED as output, using BEDOPS sort-bed):

$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null
[bam_header_read] EOF marker is absent. The input is probably truncated.

real 3m5.508s
user 0m25.702s
sys 0m8.602s

Here is the same conversion, performed with bedtools v2.22 bamToBed and sortBed:

$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null

real    28m22.057s
user    2m58.579s
sys     0m41.605s

The use of convert2bed for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.

Further time reductions are conferred with use of bam2bedcluster and bam2starchcluster scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.

When testing is complete, code will be wrapped into the upcoming BEDOPS v2.4.3 release. Source is now available via GitHub.

Say we have a bunch of text files each containing a column of non-negative numerical values that we want to log-transform (base-10):

for i in `ls *.txt`; do echo $i; awk '{system("calc \"log("$1" + 1)\" | sed -e \"s/^[\t~]*//\"");}' $i > $i.transformed; done

Slow, but it seems to work in a pinch.

See: http://stackoverflow.com/questions/150355/programmatically-find-the-number-of-cores-on-a-machine

Cartograms with d3   TopoJSON   Same Sex Marriage

Shawn Allen wrote a d3.js-based implementation of a 2D cartogram, which sizes US states in an area-proportional manner, where area is based on some interesting statistic, like population.

There has been a great deal of progress made in the last year in defending the rights of GLBT Americans to marry and have their partnership rights acknowledged, rights like visitation and estate planning, rights that straight couples take for granted when visiting their loved one in the hospital, or sharing their lives in the house they own, etc.

It’s easy enough to see a map of the 50 states colored by legal status, but people are not spread out evenly to live across all states. I wanted to see how the United States was progressing as a factor of population.

I forked Allen’s project (GitHub project source code available here) and I redid the color scheme, which takes the 50 states and the District of Columbia and shades them by their legal status, whether their laws defend or remove same-sex marriage rights (and associated protections).

Green states allow same-sex marriage, light-green states allow civil unions, orange allow marriage or civil unions (but rulings are currently held up on appeal), and red states that do not defend same-sex marriage rights, either by explicit law or constitutional amendment.

I based the color assignments initially on data from the Right to Marry site, up-to-date as of May 19th, 2014. But with Pennsylvania’s Gov. Corbett conceding defeat and vowing not to appeal the ruling, I added Pennsylvania to the list of pro-equality states.

In addition to seeing how fast things have changed, what is also interesting is that drawing by area quickly shows that over half the country — by 2010 US Census population counts, at least — now enjoys (or will soon enjoy, pending appeals) legal protections that were once denied to a minority of Americans.

For scientific work, I have used matrix2png to make a nice PNG image from a text-formatted matrix of data values. PNG looks great on the web, but it doesn’t translate well to making publication-quality figures.

My thought was to take matrix2png and — with the help of Haru (libharu) — turn it into matrix2pdf. Maybe I can get this going on Github.

I wrote a data extraction utility which uses PolarSSL to export a Base64-encoded SHA-1 digest of some internal metadata (a string of JSON-formatted data), to help validate archive integrity:

$ unstarch --sha1-signature .foo
7HkOxDUBJd2rU/CQ/zigR84MPTc=

So far, so good.

But now I want to validate that the metadata are being digested correctly through some independent means, preferably via the command-line, so that I can perform regression testing. I can use the openssl, xxd and base64 tools together to test that I get the same answer:

$ unstarch --list-json-no-trailing-newline .foo \
| openssl sha1 \
| xxd -r -p \
| base64
7HkOxDUBJd2rU/CQ/zigR84MPTc=

As a note to myself: I end up stripping the trailing newline from the JSON output of unstarch because this is what the PolarSSL library ends up digesting. This very nearly had me doubting whether PolarSSL was working correctly, or whether my command-line test was correct!

Here’s a one-liner that converts jarch files to starch format, stripping the input file’s extension so that it can be replaced with a new one:

$ for i in `ls *.jarch`; do echo "${i%.*}.starch"; gchr $i | starch - > "${i%.*}.starch"; done