[Xml-bin] Some central design issues

Stephen D. Williams sdw@lig.net
Fri, 13 Apr 2001 12:03:39 -0400


Al Snell wrote:
> 
> On Fri, 13 Apr 2001, Stephen D. Williams wrote:
> 
> > Exactly what I'm trying to address.  It's hard to solve this problem and
> > I have had a hard time finding project that tried.  I know enough about
> > the solution to know that it will work, although exact performance is
> > still just an estimate.
> 
> You're convincing me :-) Maybe a DOM-like format is better than a SAX-like
> one!

Looks like some very good ideas.  I'll digest it more later.

One important choice is whether the file uses single (UTF-8) or
double-byte Unicode for actual string data.  I always lean toward UTF-8
but it does increase complexity when you actually have double-byte
values.  Java's pervasive use of double-byte makes this a more evenly
balanced decision for me.

I'll have to post my full constraint list shortly, but one weak rule is
to try to use 2 byte integers when possible.  This has to do with
possible optimizations for Java.  I am willing however to consider
scalable sized counts (the 1/2/4+byte method is usually appealing).

Another goal is to make searching ultra-fast.

Modification has to be fast for 'appending'.  For random changes, some
'startup' penalty is assumed that should be amortized if many changes
are made.  The latter point allows element values to be 'variables' used
in normal processing.  It's valid to incur some 'warm-up' as long as
subsequent changes are usually cheap.

Support for repeated tags is required; ordering is assumed to be
important, and 'data' is treated as pseudo-tagged to allow intermingling
of data and tags within an XML 'object'.  It seems that a good default
rule is that presence of non-space data automatically triggers verbatim
mode for a tag data space.  Probably all space should be, by default,
treated as significant to allow totally verbatim transformations.

Your thinking of grouping like fixed sized data items together is good
and will improve efficiency, but that's really a 'compression' level
problem.  The prerequisite problem is how to know that items are scalars
to begin with.  I can think of at least 3 approaches, but that problem
is independant of the structure layers for the most part except for the
idea of efficient pools.  ( A: convert anything that would benefit (i.e.
Regex match integers, floats, IETF dates, etc.), B: allow specification
that would be used in bsXML but lost upon conversion to XML, C:
B+automatic insertion of meta-data typing, etc.)  I would suggest we
solve the problem at a purely string level and then add typed data
optimizations in a later pass.  That would allow us to determine it's
efficiency advantage independantly also.

At one conceptual level, the flat memory space of the XML buffers is
just another Malloc space with relative pointers (which obviously need
to be block/Index).  Care should be taken however to use minimal data
structures when allocating within that space since wasted 'tails' would
be prohibitive.  Additionally, extensive use of nested length counts is
actually prohibitive since changes require that they all be updated.

Following this logic, I would suggest that we start with ideal
structures that can be resized before assuming fixed pointers that
require chained blocks.  On the other hand, at a lower level resizing
and vpointers are handled strictly at a block level to limit work in a
tunable way.

A vpointer (virtual pointer if you will) is best described by comparison
to Emacs's Mark and Point.  Point is where the cursor is at.  Regardless
of what code inserts or deletes within the buffer, Mark and Point still
point to the 'gap' between characters that they started with.  I'm
generalizing it to make 'Elastic Memory' efficient while supporting
'pointers'.  A reference consists of a block/index tuple.  Within a
block, the vpointer table consists of an offset list to 'filled holes',
i.e. characters.  Only when a block splits due to modifications that
make it larger than a desired block size do vpointer references external
to that block have to be changed.  Doing this efficienctly, or avoiding
it altogether, is an interesting problem.  A single global vpointer
table could be used as a double-indirect.  Large block number space
could be used by subdivision to allow a power of 2 number of splits
before a global scan and update would be needed.  (I.e. assign initial
block numbers as every 32.  Splits would then have the same upper bits
so that references to the old single block would be like a 'broadcast',
i.e. search, of the new blocks.  I'm probably not willing to make
vpointers large enough for this but it's interesting.)

'Elastic Memory' is simply the idea that within a block, you can have
holes that are hidden from higher layers but are used, and created, by
modification to the virtually flat and contiguous memory space.  Two
obvious implementations are bitmaps and range lists.  I've chosen to
have a range list per block since optimally there is a single entry that
gives a contiguous range.  Range lists also seem to be more efficient
generally.  Remember that since everything is relative to actual
characters, ignoring holes, you can condense at any time while only
updating the range list for each individual block.

Appending and modification should be assumed to consist of passing a
vpointer and a tag/attribute/text indicator and it's value, receiving
one or two vpointers as a result: one to append inside (for tags) and
one to append after (for tags, attributes, text).  Note that this
implies that vpointers will be held, aliased, outside the reach of the
library and therefore double-indirect is probably the best initial
structure.

It all seems elegant to me, overall.  This is about where I've been at
for more than a year.  I recently became very interested in picking up
development, coincidentally just before you started the binary-XML
thread (again).  With no funding to take alternate paths, this is one of
my personal projects.  It's good to bounce ideas off of the XML
community so we can make this a reality.

More later.

Thanks
sdw

> 
> Mmmm, ok.
> 
> A simple approach is to make the file header contain the serial number
> (S#) of the root node, and any other <?xml version=... <!DOCTYPE... stuff.
> 
> The S#s are distinct from IDs and the like since they are assigned by the
> file format itself, and have meaning only at that layer.
> 
> The header also contains pointers to the Directory blocks. There is a
> directory block for each type code (integer, variable-length-node (a
> subtype flag says whether it's a CDATA, Binary DATA, element or attribute
> name, or element body). Directory blocks will either need to be relocated
> if they overflow, or have provision for a chain pointer to point to an
> extension directory block.
> 
> An S# consists of a type code and an offset into that type code's
> directory block.
> 
> For fixed size types like integers, the data can live inside the directory
> block. For variable-length nodes, there will be different type codes for
> short (8 bit length) and long (64 bit length) versions (why no 16 or 32?
> Because the space saving of a few bytes of length code is negligible if
> the data is more than 256 bytes anyway), meaning that the two sizes live
> in seperate directories. Variable-length nodes directory entries also
> contain a type field:
> 
> 1) It's a CDATA
> 2) It's a BDATA (binary data) - MIME type in the body too?
> 3) It's a symbol (element or attribute name string)
> 4) It's an element (the body contains the S# of the symbol containing the
> element's name, a list of attribute declarations (S# of attribute name
> then S# of attribute content), then a list of S#s of child nodes)
> 5) It's a bit of free space in the file
> 
> Directory entries are always fixed size, and every 32 entries are prefixed
> by a 32 bit free-bitmap, indicating entries that are not in use and can be
> reallocated.
> 
> Reading (be it a tree walk or an XPath query) would involve starting with
> the root node (S# found from the header) and traversing element nodes (via
> the variable-length-node directory, referring to the symbol and element
> nodes). The XPath engine will, at first, be reading every symbol node's
> body and comparing against the string it's looking for a the time, but it
> should cache symbols and (if it turns up a symbol matching a name anywhere
> in the XPath expression) remember it to avoid further string operations.
> 
> Maybe it would be a good idea to have a seperate symbol directory, with
> 16 or 32 bit hash values stored for the strings, to speed up matching.
> 
> Caching directory blocks as they are read from disk would be wise.
> 
> Allocating new directory entries will involve scanning for a
> non-0xffffffff bitmap in the appropriate directory block, then doing a bit
> scan to find the free slot. If the directory block is full, either
> relocate it to the end of the file (updating the pointer in the
> header) and creat a variable-length-node directory entry marking the freed
> up space as free, or if you use the linked list approach, create a new
> directory block of a fixed size at the end of the file and link to it.
> 
> To create new variable-length nodes, or to relocate one that cannot grow
> any further, allocate an S# for the old region (tagged as "empty") then
> find an S# marked as "free" that's big enough and put the data there,
> creating a new S# for the unused region. If the file is full, either
> compact stuff (remember - file offsets live only in directories and the
> master-directory pointers in the header) or grow the file and tack stuff
> onto the end.
> 
> Howzat sound?
> 
> >
> > sdw
> >
> 
> ABS
> 
> --
>                                Alaric B. Snell
>  http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
>    Any sufficiently advanced technology can be emulated in software
> 
> _______________________________________________
> xml-bin mailing list
> xml-bin@warhead.org.uk
> http://lists.warhead.org.uk/mailman/listinfo/xml-bin

-- 
sdw@lig.net  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000