Speedy BED conversion tool: convert2bed

Finishing touches are in place for my convert2bed tool (GitHub site).

This utility converts common genomics data formats (BAM, GFF, GTF, PSL, SAM, VCF, WIG) to lexicographically-sorted UCSC BED format. It offers two benefits over alternatives:

  • It runs about 3-10x as fast as bedtools *ToBed equivalents
  • It converts all input fields in as non-lossy a way as possible, to allow recovery of data to the original format

As an example, here we use convert2bed on a 14M-read, indexed BAM file to a sorted BED file (data are piped to /dev/null) on a 4 GB, dual-Core 2 (2.4 GHz) workstation running RHEL 6:

$ samtools view -c ../DS27127A_GTTTCG_L001.uniques.sorted.bam
14090028

Conversion is performed with default options (sorted BED as output, using BEDOPS sort-bed):

$ time ./convert2bed -i bam < ../DS27127A_GTTTCG_L001.uniques.sorted.bam > /dev/null
[bam_header_read] EOF marker is absent. The input is probably truncated.

real 3m5.508s
user 0m25.702s
sys 0m8.602s

Here is the same conversion, performed with bedtools v2.22 bamToBed and sortBed:

$ time ../bedtools2/bin/bamToBed -i ../DS27127A_GTTTCG_L001.uniques.sorted.bam | ../bedtools2/bin/sortBed -i stdin > /dev/null

real    28m22.057s
user    2m58.579s
sys     0m41.605s

The use of convert2bed for this file offers a 9.1x speed improvement. Other large BAM files show similar conversion speedups.

Further time reductions are conferred with use of bam2bedcluster and bam2starchcluster scripts (TBA) which make use of GNU Parallel or a Sun Grid Engine job scheduler, reducing conversion time even further by breaking conversion tasks down by chromosome.

When testing is complete, code will be wrapped into the upcoming BEDOPS v2.4.3 release. Source is now available via GitHub.

Getting around Google Chrome’s broken printing in OS X

Google once again has moved the print dialog box settings around in its browser, making it purposefully difficult to set the default print option to use the native OS X software. It wouldn’t be a problem if Chrome didn’t mess up what I’m trying to print! Here is a command to issue from Terminal.app, which seems to fix this bug with v37:

defaults write com.google.Chrome DisablePrintPreview -boolean true

For Google Canary nightly builds (v40-ish?), the following seems to work:

defaults write com.google.Chrome.canary DisablePrintPreview -boolean true

Maybe it’s time to look into Safari again…

Check your trailing newlines, kids

This mistake has caught me before, but I always overlook it: https://github.com/lindenb/magic/issues/1#issuecomment-54685236

Old school GNU glue

Say we have a bunch of text files each containing a column of non-negative numerical values that we want to log-transform (base-10):

for i in `ls *.txt`; do echo $i; awk '{system("calc \"log("$1" + 1)\" | sed -e \"s/^[\t~]*//\"");}' $i > $i.transformed; done

Slow, but it seems to work in a pinch.

Platform-independent methods to get number of available cores and/or processors with C/C++

See: http://stackoverflow.com/questions/150355/programmatically-find-the-number-of-cores-on-a-machine

Adding same-sex marriage law data to a d3-cartogram

Cartograms with d3   TopoJSON   Same Sex Marriage

Shawn Allen wrote a d3.js-based implementation of a 2D cartogram, which sizes US states in an area-proportional manner, where area is based on some interesting statistic, like population.

There has been a great deal of progress made in the last year in defending the rights of GLBT Americans to marry and have their partnership rights acknowledged, rights like visitation and estate planning, rights that straight couples take for granted when visiting their loved one in the hospital, or sharing their lives in the house they own, etc.

It’s easy enough to see a map of the 50 states colored by legal status, but people are not spread out evenly to live across all states. I wanted to see how the United States was progressing as a factor of population.

I forked Allen’s project (GitHub project source code available here) and I redid the color scheme, which takes the 50 states and the District of Columbia and shades them by their legal status, whether their laws defend or remove same-sex marriage rights (and associated protections).

Green states allow same-sex marriage, light-green states allow civil unions, orange allow marriage or civil unions (but rulings are currently held up on appeal), and red states that do not defend same-sex marriage rights, either by explicit law or constitutional amendment.

I based the color assignments initially on data from the Right to Marry site, up-to-date as of May 19th, 2014. But with Pennsylvania’s Gov. Corbett conceding defeat and vowing not to appeal the ruling, I added Pennsylvania to the list of pro-equality states.

In addition to seeing how fast things have changed, what is also interesting is that drawing by area quickly shows that over half the country — by 2010 US Census population counts, at least — now enjoys (or will soon enjoy, pending appeals) legal protections that were once denied to a minority of Americans.

matrix2png -to- matrix2pdf

For scientific work, I have used matrix2png to make a nice PNG image from a text-formatted matrix of data values. PNG looks great on the web, but it doesn’t translate well to making publication-quality figures.

My thought was to take matrix2png and — with the help of Haru (libharu) — turn it into matrix2pdf. Maybe I can get this going on Github.