[Xml-bin] Some central design issues

Al Snell alaric@alaric-snell.com
Fri, 13 Apr 2001 16:06:37 +0100 (BST)


On Fri, 13 Apr 2001, Stephen D. Williams wrote:

> Exactly what I'm trying to address.  It's hard to solve this problem and
> I have had a hard time finding project that tried.  I know enough about
> the solution to know that it will work, although exact performance is
> still just an estimate.

You're convincing me :-) Maybe a DOM-like format is better than a SAX-like
one!

Mmmm, ok.

A simple approach is to make the file header contain the serial number
(S#) of the root node, and any other <?xml version=... <!DOCTYPE... stuff.

The S#s are distinct from IDs and the like since they are assigned by the
file format itself, and have meaning only at that layer.

The header also contains pointers to the Directory blocks. There is a
directory block for each type code (integer, variable-length-node (a
subtype flag says whether it's a CDATA, Binary DATA, element or attribute
name, or element body). Directory blocks will either need to be relocated
if they overflow, or have provision for a chain pointer to point to an
extension directory block.

An S# consists of a type code and an offset into that type code's
directory block. 

For fixed size types like integers, the data can live inside the directory
block. For variable-length nodes, there will be different type codes for
short (8 bit length) and long (64 bit length) versions (why no 16 or 32?
Because the space saving of a few bytes of length code is negligible if
the data is more than 256 bytes anyway), meaning that the two sizes live
in seperate directories. Variable-length nodes directory entries also
contain a type field:

1) It's a CDATA
2) It's a BDATA (binary data) - MIME type in the body too?
3) It's a symbol (element or attribute name string)
4) It's an element (the body contains the S# of the symbol containing the
element's name, a list of attribute declarations (S# of attribute name
then S# of attribute content), then a list of S#s of child nodes)
5) It's a bit of free space in the file

Directory entries are always fixed size, and every 32 entries are prefixed
by a 32 bit free-bitmap, indicating entries that are not in use and can be
reallocated.

Reading (be it a tree walk or an XPath query) would involve starting with
the root node (S# found from the header) and traversing element nodes (via
the variable-length-node directory, referring to the symbol and element
nodes). The XPath engine will, at first, be reading every symbol node's
body and comparing against the string it's looking for a the time, but it
should cache symbols and (if it turns up a symbol matching a name anywhere
in the XPath expression) remember it to avoid further string operations.

Maybe it would be a good idea to have a seperate symbol directory, with
16 or 32 bit hash values stored for the strings, to speed up matching.

Caching directory blocks as they are read from disk would be wise.

Allocating new directory entries will involve scanning for a
non-0xffffffff bitmap in the appropriate directory block, then doing a bit
scan to find the free slot. If the directory block is full, either
relocate it to the end of the file (updating the pointer in the
header) and creat a variable-length-node directory entry marking the freed
up space as free, or if you use the linked list approach, create a new
directory block of a fixed size at the end of the file and link to it.

To create new variable-length nodes, or to relocate one that cannot grow
any further, allocate an S# for the old region (tagged as "empty") then
find an S# marked as "free" that's big enough and put the data there,
creating a new S# for the unused region. If the file is full, either
compact stuff (remember - file offsets live only in directories and the
master-directory pointers in the header) or grow the file and tack stuff
onto the end.

Howzat sound?

> 
> sdw
> 

ABS

-- 
                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software