OCR resources for iOS

Here are some useful resources for open source C and C++ -based OCR libraries that could run under iOS (need to check licensing):

The end goal is to be able to use an iPhone to read LED displays, as commonly found on meters, etc. and then do something useful with that data (upload it somewhere, tagged with geodata). An aggregate of hundreds or thousands of users could conceivably collect data useful for themselves and also for the group as a whole.

Regression testing of SHA-1 signatures via command-line

I wrote a data extraction utility which uses PolarSSL to export a Base64-encoded SHA-1 digest of some internal metadata (a string of JSON-formatted data), to help validate archive integrity:

$ unstarch --sha1-signature .foo
7HkOxDUBJd2rU/CQ/zigR84MPTc=

So far, so good.

But now I want to validate that the metadata are being digested correctly through some independent means, preferably via the command-line, so that I can perform regression testing. I can use the openssl, xxd and base64 tools together to test that I get the same answer:

$ unstarch --list-json-no-trailing-newline .foo \
| openssl sha1 \
| xxd -r -p \
| base64
7HkOxDUBJd2rU/CQ/zigR84MPTc=

As a note to myself: I end up stripping the trailing newline from the JSON output of unstarch because this is what the PolarSSL library ends up digesting. This very nearly had me doubting whether PolarSSL was working correctly, or whether my command-line test was correct!

Selecting a new gcc version installed via MacPorts

MacPorts is useful for installing a variety of command-line utilities and programs for Mac OS X. There are others, e.g. Homebrew. After using MacPorts to update a GNU gcc installation, it is useful to select the new revision. Tips were posted to this Stack Overflow thread. Basically, it boils down to two steps:

  1. sudo port select --list gcc
  2. sudo port select --set gcc mp-gcc47

Playing with SHA-1 hashing in PolarSSL

PolarSSL is a C-based cryptography and SSL library which has a GPL license, which makes it ideal for use with BEDOPS, where I plan to use it for quick SHA-1 hashes of metadata, so as to help validate the integrity of the archive.

I’ve been testing it out in Mac OS X 10.8 and it seems pretty straightforward. Here’s a simple project that hashes the string abc:

#include <stdlib.h>                                                                                                                                                                               
#include <stdio.h>                                                                                                                                                                                
#include "polarssl/config.h"                                                                                                                                                                      
#include "polarssl/sha1.h"                                                                                                                                                                        
                                                                                                                                                                                                  
int main(int argc, char **argv)                                                                                                                                                                   
{                                                                                                                                                                                                 
    unsigned char output[20];                                                                                                                                                                     
    unsigned char *buf;                                                                                                                                                                           
    size_t bufLength;                                                                                                                                                                             
    size_t idx;                                                                                                                                                                                      
                                                                                                                                                                                                  
    buf = strdup("abc");                                                                                                                                                                          
    bufLength = strlen(buf);                                                                                                                                                                      
                                                                                                                                                                                                  
    sha1(buf, bufLength, output);                                                                                                                                                                 
                                                                                                                                                                                                  
    for (idx = 0; idx < 20; idx++) {                                                                                                                                                               
        fprintf(stdout, "%02x", output[idx]);
        if ((idx + 1) % 4 == 0)
            fprintf(stdout, " ");
    }                                                                                                                                                                                                  
    fprintf(stdout, "\n");                                                                                                                                                                        
                                                                                                                                                                                                  
    free(buf);                                                                                                                                                                                    
                                                                                                                                                                                                  
    return EXIT_SUCCESS;                                                                                                                                                                          
}

To compile it:

gcc -Wall -lpolarssl sha1test.c -o sha1test

When we ask for the hash value of the string abc, we get the following result:

$ ./sha1test
a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d

This agrees with the value reported at NIST, which is also:

a9993e36 4706816a ba3e2571 7850c26c 9cd0d89d

Testing output against standards is useful for validation.

Hex words are magic

I plan to use a magic number in the second major release of our BEDOPS suite to uniquely identify starch-formatted archive files.

Looking at the first few bytes of the archive will help us because I plan to move the metadata to the back of the archive file, and it would be expensive to seek to the end of the file just to determine the file type. I figure a 64-byte header is enough to hold some extra data that identifies the file type and points the unstarch and starchcat tools to the right byte in the file where the metadata is kept:

This image describes what I’m planning for Starch v2. Basically, we reserve a 64-byte header at the front of the archive that contains the following four items:

  • Magic number – The magic number ca5cade5 identifies this as a Starch v2-formatted file.
  • Offset – A zero-padded 16-digit unsigned integer that marks the byte into the file at which the archive’s metadata starts (including the 64-byte header itself).
  • Hash – Here I make a SHA-1 hash of a concatenated string made from the offset value and metadata string. This value helps validate the integrity of the archive metadata and the offset values. (At a later time, we will add hashes of the chromosome streams to the metadata, to allow full archive validation).
  • Reserved – We keep 20 bytes of space free, in case we need it.

After these 64 bytes, the compressed, per-chromosome streams start, and we then wrap up with the metadata at the end of the file.

Initially, the header will be the magic number and all zeros. At the conclusion of creating the archive, once all the streams are prepared and the metadata is ready to hash, parts of the header are written over with calculated values (offset and hash).

To assist with picking the right magic number, Ned Batchelder put together a Python script to generate all the hex words possible from a small subset of the English dictionary — words like deadbeef, dec0ded and 0ddba11 which can be used as magic numbers to identify a file type. Thanks, Ned! So far, I’m liking the magic word ca5cade5 as it has a nice biological flavor to it.

Using wget to grab a copy of a website for offline browsing

When I want to make BEDOPS documentation available for offline browsing, here is the command I use:

wget –no-parent –recursive –page-requisites –html-extension –convert-links -E -l 2 http://code.google.com/p/bedops

This handy wget statement fixes up the img and other URL references so that links to images and other resources load up from the local copy. So what the end user sees is almost exactly like what she would see if the documentation was retrieved through a network connection. Very nice.