The interface is defined in Onelib.h. There are 21 functions and 9 macros, with one primary
type OneFile which maintains information about the file being read or written, including the current line.
As a brief synopsis, the following reads a sequence file, prints out some simple stats, and writes a binary file containing the reverse-complemented sequences.
{ int totLen = 0;
int totCount = 0;
// NULL for read schema from file, 1 for single thread
OneFile *in = oneFileOpenRead(inFile, NULL, "seq", 1);
if (in == NULL)
{ fprintf(stderr,"Can't open sequence file %s to read\n",inFile);
exit (1);
}
// false for don't use counts from in's header, true for binary, 1 for single thread
OneFile *out = oneFileOpenWriteFrom(outFile, in, false, true, 1);
if (out == NULL)
{ fprintf(stderr,"Can't open sequence file %s to write\n",outFile);
exit (1);
}
oneAddProvenance(out,"revcomp","1.0","revcomp inFile outFile",0);
while (oneReadLine(in))
if (in->lineType == 'S')
{ totLen += oneLen(in);
reverseComplement(oneString(in), oneLen(in)); // user-provided, assume acts in place
oneWriteLine(out, 'S', oneLen(in), oneString(in));
}
else if (in->lineType == 'C')
totCount += oneInt(in,0);
printf("Total sequence length %d and counts %d.\n", totLen, totCount);
printf("Auto-accumulated length %d should be the same.\n", in->lineInfo['S']->accum.total);
oneFileClose(in);
oneFileClose(out); // NB this writes out the footer as well as closing the file - don't omit!
}
In the above, there is no check that the schema of the file fits the expectations in the code below. It would have been possible to carry out such a check using
if (! oneFileCheckSchema (in, "D S 1 3 DNA\nD C 1 3 INT\n")) die ("schema mismatch");
which confirms that there are S lines with a single field encoding DNA, and C lines with a single field encoding an integer. Alternatively, one could define the schema ahead of opening the file as in
OneSchema *schema = oneSchemaCreateFromText ("P 3 seq\nD S 1 3 DNA\nD C 1 3 INT\n");
OneFile *in = oneFileOpenRead (inFile, schema, "seq", 1);
oneSchemaDestroy (schema) ;
Note that in this case it is necessary to define the file type "seq" in the schema, since a general schema can specify multiple file types. The schema can also be read from file using oneSchemacreateFromFile().
Also there is a more subtle difference, in that the first version checks that the S and C lines are present and specified as required while allowing additional unspecified line types, while the second version requires that the file only contain S and C lines. i.e. for oneFileCheckSchema() all defined lines must be in the file, and for a schema given as an argument to oneFileOpenRead all lines the in file must be in the schema.
The following is derived from the file ONElib.h which provides the entire interface. First we provide the subroutine interface, then the data types (reversing the order in a normal C header file).
OneSchema *oneSchemaCreateFromFile (char *path) ;
OneSchema *oneSchemaCreateFromText (char *text) ;
These functions create a schema handle that can be used to open One-code data files for reading and writing. A schema file is itself a One-code file, consisting of a set of objects, one per primary file type. Valid lines in this file are:
P <primary file type> // a short string, generally lower-case
S <secondary file type> // a short string - any number of these
G <char> <field_list> // definition of group defining line - optional, at most one per P line
O <char> <field_list> // definition of object defining line - once and only once per P line
D <char> <field_list> // definition of standard line
<char> must be a lower or upper case letter. The O line specifies
the records that defines objects in this file type. D lines define other record types used within objects. If objects are
grouped then the G line specifies the records that demarcate groups. For
each of O, G and D, <field_list> is a list of field types from:
CHAR, INT, REAL, STRING, INT_LIST, REAL_LIST, STRING_LIST, DNA that
specify the required fields on a record initiated by the <char>.
Any additional text on the line is a comment. By convention comments in schema files explain the definitions.
Example, with lists and strings preceded by their length in OneCode style:
P 3 seq this is a sequence file
O S 1 3 DNA the DNA sequence - each S line starts an object
D Q 1 6 STRING the phred encoded quality score + ASCII 33
D N 4 4 REAL 4 REAL 4 REAL 4 REAL signal to noise ratio in A, C, G, T channels
G g 2 3 INT 6 STRING group designator: number of objects, name
The oneSchemaCreateFromText() alternative writes the text to a temp file and reads it with
oneSchemaCreateFromFile(). This allows code to set the schema.
void oneSchemaDestroy (OneSchema *schema);
Recovers the memory allocated for the schema object.
OneFile *oneFileOpenRead (const char *path, OneSchema *schema, char *type, int nthreads) ;
Open ONE file 'path', either binary or ascii encoded, for reading.
If the file doesn't have a header, then 'type' must be specified,
otherwise, if 'type' is non-zero it must match the header type.
All header information (if present) is read.
'schema' is also optional. If it is NULL then the file must contain its own schema.
If 'schema' is present then it must support 'type', and if the file contains its
own schema, then that must be a subset of the one for this type in 'schema'.
If nthreads > 1 then nthreads OneFiles are generated as an array and the pointer to the first, called the master, is returned. The other nthreads-1 files are called slaves. The package routines are aware of when a OneFile argument is a slave or master in a parallel group. The master recieves provenance, counts, etc. The slaves only read data and have the virture of sharing indices and codecs with the master if relevant.
BOOL oneFileCheckSchema (OneFile *vf, char *textSchema) ; // EXPERIMENTAL
Checks if file schema is consistent with text schema. Mismatches are reported to stderr. Filetype and all linetypes in text must match. File schema can contain additional linetypes. e.g. if (! oneFileCheckSchema (vf, "P 3 seq\nD S 1 3 DNA\nD Q 1 6 STRING\nD P 0\n")) die () ; This is provided to enable a program to ensure that its assumptions about data layout are satisfied.
char oneReadLine (OneFile *vf);
Read the next ONE formatted line returning the line type of the line, or 0 if at the end of the data section.
void *oneList (OneFile *vf); // lazy codec decompression if required
void *oneCompressedList (OneFile *vf); // lazy codec compression if required
#define oneInt(vf,x) ((vf)->field[x].i)
#define oneReal(vf,x) ((vf)->field[x].r)
#define oneChar(vf,x) ((vf)->field[x].c)
#define _LF(vf) ((vf)->info[(int)(vf)->lineType]->listField)
#define oneLen(vf) ((vf)->field[_LF(vf)].len & 0xffffffffffffffll)
#define oneString(vf) (char *) oneList(vf)
#define oneDNAchar(vf) (char *) oneList(vf)
#define oneIntList(vf) (I64 *) oneList(vf)
#define oneRealList(vf) (double *) oneList(vf)
#define oneNextString(vf,s) (s + strlen(s) + 1)
#define oneDNA2bit(vf) (U8 *) oneCompressedList(vf)
The content macros immediately above are
used to access the information of the line most recently read.
Lists are read lazily, i.e. they remain compressed (when reading a binary 1-code file) and are not decoded until one explicitly does so with onelist.
The 2-bit compressed form of DNA sequences is useful for certain common numeric tricks (such
as converting a bar-code into an integer), so oneDNA2bit always returns the compressed form
of a DNA list argument which is necessary when reading an ASCII 1-code file.
The index x of a list object is not required as there is
only one list per line, stored in the line type's private buffer.
A "string list" is implicitly supported: get the first string with oneString, and
subsequent strings sequentially with oneNextString, e.g.:
char *s = oneString(vf);
for (i = 0; i < oneLen(vf); i++)
{ // do something with i'th string
s = oneNextString(vf,s);
}
char *oneReadComment (OneFile *vf);
Can be called after oneReadLine to read any optional comment text after the fixed fields.
Returns NULL if there is no comment.
OneFile *oneFileOpenWriteNew (const char *path, OneSchema *schema, char *type,
BOOL isBinary, int nthreads);
OneFile *oneFileOpenWriteFrom (const char *path, OneFile *vfIn,
BOOL isBinary, int nthreads);
Create a new oneFile that will be written to 'path'. For the 'New' variant supply the file type, subtype (if non-zero), and whether it should be binary or ASCII. For the 'From' variant, specify binary or ASCII, then the schema and all other header information is inherited from 'vfIn'.
If nthreads > 1 then nthreads OneFiles are generated as an array and the pointer to the first, called the master, is returned. The other nthreads-1 files are called slaves. The package routines are aware of when a OneFile argument is a slave or master in a parallel group. The slaves are expected to only write data lines, with the master adding provenance, producing the header, and then some segment of the initial data lines. Upon close the final result is effectively the concatenation of the master, followed by the output of each slave in sequence.
BOOL oneInheritProvenance (OneFile *vf, OneFile *source);
BOOL oneInheritReference (OneFile *vf, OneFile *source);
BOOL oneInheritDeferred (OneFile *vf, OneFile *source);
Add all provenance/reference/deferred entries in source to header of vf.
BOOL oneAddProvenance (OneFile *vf, char *prog, char *version, char *format, ...);
BOOL oneAddReference (OneFile *vf, char *filename, I64 count);
BOOL oneAddDeferred (OneFile *vf, char *filename);
Append provenance/reference/deferred to header information. oneAddProvenance uses the current date and time.
void oneWriteLine (OneFile *vf, char lineType, I64 listLen, void *listBuf);
Set up a line for output just as it would be returned by oneReadLine and then call this routine to output the line (ASCII or binary). Use the macros above on the l.h.s. of assignments to fill fields (e.g. oneInt(vf,2) = 3). For lists, give the length in the listLen argument, and either place the list data in your own buffer and give it as listBuf, or put it in the line's buffer and set listBuf == NULL.
void oneWriteComment (OneFile *vf, char *format, ...);
Adds a comment to the current line. Need to use this not fprintf() so as to keep the index correct in binary mode. Cannot have internal new-lines ('\n').
void oneFileClose (OneFile *vf);
Close vf (opened either for reading or writing). Merges theaded files, and writes footer if binary. Frees all non-user memory associated with vf.
BOOL oneGotoObject (OneFile *vf, I64 i);
Goto i'th object in the file. This only works on binary files, which have an index.
I64 oneGotoGroup (OneFile *vf, I64 i);
Goto the first object in group i. Return the size (number of objects) of the group, or 0 if an error (i out of range or vf does not have group type defined). Only works for binary files.
void oneUserBuffer (OneFile *vf, char lineType, void *buffer);
A buffer is used to capture the list element of each line type that has one. This routine allows you to reassign the buffer to one you have allocated, or to revert to a default system buffer if 'buffer' = NULL. The previous buffer (if any) is freed. The user must ensure that a buffer they supply is large enough. By the way, this buffer is overwritten with each new line read of the given type.
typedef int64_t I64;
typedef unsigned char U8;
typedef enum
{ oneINT = 1,
oneREAL,
oneCHAR,
oneSTRING,
oneINT_LIST,
oneREAL_LIST,
oneSTRING_LIST,
oneDNA
} OneType;
static char* oneTypeString[] =
{ 0,
"INT",
"REAL",
"CHAR",
"STRING",
"INT_LIST",
"REAL_LIST",
"STRING_LIST",
"DNA"
};
Basic data types. Integers are all as 64-bit, and reals are stored as doubles (8 byte).
typedef union
{ I64 i;
double r;
char c;
I64 len; // for lists: top 8 bits encode excess bytes, low 56 bits encode length
} OneField;
Encoding of a data value. For integer lists, if all the integers are less than 2^x for some x (i.e. they fit in x-bits), then the 1-code library knows this and stores the excess bits, 64-x, in the first byte of the field len. So be sure to use the macro OneLen above that masks out this first byte when fetching the length of a list.
typedef struct
{ char *program;
char *version;
char *command;
char *date;
} OneProvenance;
typedef struct
{ char *filename;
I64 count;
} OneReference;
typedef struct
{ I64 count;
I64 max;
I64 total;
I64 groupCount;
I64 groupTotal;
} OneCounts;
Natural data structures for programmatic access to information in the header. Note that all of these should be used read only.
typedef void OneCodec; // forward declaration of opaque type for compression codecs
extern OneCodec *DNAcodec; // a special codec for DNA
OneCodecs are a private package for binary one file compression. DNAcodec is a special pre-existing compressor one should use for DNA. It compresses every base to 2-bits, where any non-acgt letter is effectively converted to an a. DNA compression is case insensitive, with decompression always delivering lower case.
typedef struct
{ OneCounts accum; // counts read or written to this moment
OneCounts given; // counts read from header
int nField; // number of fields
OneType *fieldType; // type of each field
char *comment; // the comment on the definition line in the schema
// plus private fields
} OneInfo;
Record for a particular line type. There is at most one list element per line type. Again, all read only.
typedef struct OneSchema {} OneSchema ;
The schema type, all private to the package. Internally a schema is stored as a linked list of OneSchema objects, with the first holding the (hard-coded) schema for the header and footer, and the remainder each holding the schema definition data for one primary file type.
And, finally, the main OneFile type - this is the primary handle used by the end user.
typedef struct
{
// these fields may be set by the user
BOOL isCheckString; // set if want to validate strings char by char
// - slows down reading
I64 codecTrainingSize; // number of bytes to see before building codec
// - default 100k - can set before writing
// these fields may be read by user - but don't change them!
char fileType[4];
char subType[4];
char objectType; // line designation character for primary objects
char groupType; // line designation character for groups (optional)
I64 line; // current line number
char lineType; // current lineType
I64 object; // current object - incremented when object line read
I64 group; // current group - incremented when group line read
OneProvenance *provenance; // if non-zero then count['!'] entries
OneReference *reference; // if non-zero then count['<'] entries
OneReference *deferred; // if non-zero then count['>'] entries
OneField *field; // used to hold the current line - accessed by macros
OneInfo *info[128]; // all the per-linetype information
// the remainder is private to the package
} OneFile; // the footer will be in the concatenated result.
<bin file> <- <ASCII Prolog> <$-line> <binary data> <footer> <^-line> <footer-size:int64>
'$'-line flags file is binary and gives endian. The data block ends with a blank line consisting of '\n' only.
<ASCII Prolog> <- <'1'-line> [<'2'-line>] ( <'!'-line> | <'<'-line> | <'>'-line> )*
The ASCII prolog contains the type, subtype, provenance, reference, and deferred lines and schema in the ASCII format. The ONE count statistic lines for each data line type are found in the footer along with binary ';' lines that encode their compressors as needed. The footer also contains binary '&' and '*' lines that encode the object index and group indices, respectively.
<Binary line> <- <Binary line code + tags> <fields> [<list data>]
Binary line codes are >= 128. The low bit of these is a flag denoting whether any list in the line is compressed with a codec, so each binary-encoded line type has two codes and a table maps these to the ASCII code.
If a field is a list, then the field array element for that field is the list's length, and the list itself is stored separately following all the fields.
Integer lists (type INT_LIST) are stored with a custom compaction step that first takes differences between an element and the preceding element for all elements after the first, and second only uses the number of bytes needed to capture the information in the remaining differences. For example, if all the differences are between -128 and 127 then only one byte will be used to store each of them.