[Xml-bin] Some central design issues

Stephen D. Williams sdw@lig.net
Fri, 13 Apr 2001 13:52:01 -0400


Anjana Manian wrote:
> 
> Hi,
> 
> I have been following closely the thread of mails on this subject  -
> this interest me becoz I am working on an implementation of Binary
> compressed XML generation which will be available in our future release
> of Oracle XML Developer's kit.

You missed a key point I have been making:  My primary goal with binary
structured XML is not compression, although that is a secondary goal. 
The primary goals I just layed out on the xml-dev list are:

o  Binary structured organization with fast traversable indexing using a
relative handle/pointer system.

o  Modifiability in-place.

o  Doing this efficiently in flat, chunked buffers that allow both
memory and wire format to be the same, with an optional 'compress' step
that removes working 'holes'.

o  Efficient support for append or modification overlays (allows
streaming, transactions, and good handling of base/change situations
such as changes in a rulebase for a session).

Efficient compression of XML isn't hard and has a dense solution space. 
Solving my primary goals in a useful way is.  Compression can be layered
on later, beyond what we'll gain, if anything, in a binary structure
without endtags, run length coding of repeated tags, etc.  This allows
us to solve hard and more useful problems and then handle compression in
the relatively sparse solution space that results.

While most apps need to be able to produce text XML, the goal is to
avoid it altogether for some portion of import/export/work processing
steps.


Since you are working on this at Oracle, I'll mention that I also added:

If anyone wants to sponsor me at my consulting rate to complete this
fulltime as an open source project, let me know.

I'll also mention that a longer term goal is the merger of relational,
object, and full-text database paradigms using binary XML, dynamic
indexing and schema, with SQL queries producing bsXML results.  Simple,
fixed, relational tables are primitive and just mapping to them is not
that useful.  I have a whole spiel on this...

> We are using the java serialization (rather externalization) technique
> to generate a binary compressed stream from an XML document. The
> compression is based on tokenizing the XML tag's. The assumption is that
> any XML file has repeated number of tags and so tokenizing the tags will
> give considerable compression. Therefore the compression achieved
> depends on type of input xml document  - larger the tags, lesser the
> text content, better the compression!
> 
> Our goal of compression is to reduce the size of the xm document without
> loosing the structural and hierarchical information of the DOM tree.
> Therefore the compressed stream contains all the "useful" information to
> create the DOM tree back from the binary format. The compressed stream
> can also be generated from the SAX event. The binary stream generated
> from DOM and SAX are compatible i.e  the compressed stream generated
> from SAX could be used to generate the DOM tree and vice versa. One of
> the advantage of compression is that at any time a huge in-memory DOM
> tree could be stored in full or parts in the form of compressed stream
> and whenever it is required we can regenerate the DOM tree back.
> 
> I am wondering whether we really need a binary xml specification? Assume

Yes.  I absolutely need it.  The more N-tier I architect a project, the
worse the overhead of text XML.

> we have one, in which case, our compressed stream output would be
> compliant to the binary xml format. But in any case we are providing
> ways and means to generate the text xml from the binary format and vice
> versa. So the end user may continue working with the text format without
> bothering about the way the binary format is created and stored!

Space-wise and cpu-poor.  Not always the appropriate tradeoff.

Thanks
sdw
> 
> Thanks,
> Anjana
> 
> Al Snell wrote:
> >
> > On Fri, 13 Apr 2001, Stephen D. Williams wrote:
> >
> > > Exactly what I'm trying to address.  It's hard to solve this problem and
> > > I have had a hard time finding project that tried.  I know enough about
> > > the solution to know that it will work, although exact performance is
> > > still just an estimate.
> >
> > You're convincing me :-) Maybe a DOM-like format is better than a SAX-like
> > one!
> >
> > Mmmm, ok.
> >
> > A simple approach is to make the file header contain the serial number
> > (S#) of the root node, and any other <?xml version=... <!DOCTYPE... stuff.
> >
> > The S#s are distinct from IDs and the like since they are assigned by the
> > file format itself, and have meaning only at that layer.
> >
> > The header also contains pointers to the Directory blocks. There is a
> > directory block for each type code (integer, variable-length-node (a
> > subtype flag says whether it's a CDATA, Binary DATA, element or attribute
> > name, or element body). Directory blocks will either need to be relocated
> > if they overflow, or have provision for a chain pointer to point to an
> > extension directory block.
> >
> > An S# consists of a type code and an offset into that type code's
> > directory block.
> >
> > For fixed size types like integers, the data can live inside the directory
> > block. For variable-length nodes, there will be different type codes for
> > short (8 bit length) and long (64 bit length) versions (why no 16 or 32?
> > Because the space saving of a few bytes of length code is negligible if
> > the data is more than 256 bytes anyway), meaning that the two sizes live
> > in seperate directories. Variable-length nodes directory entries also
> > contain a type field:
> >
> > 1) It's a CDATA
> > 2) It's a BDATA (binary data) - MIME type in the body too?
> > 3) It's a symbol (element or attribute name string)
> > 4) It's an element (the body contains the S# of the symbol containing the
> > element's name, a list of attribute declarations (S# of attribute name
> > then S# of attribute content), then a list of S#s of child nodes)
> > 5) It's a bit of free space in the file
> >
> > Directory entries are always fixed size, and every 32 entries are prefixed
> > by a 32 bit free-bitmap, indicating entries that are not in use and can be
> > reallocated.
> >
> > Reading (be it a tree walk or an XPath query) would involve starting with
> > the root node (S# found from the header) and traversing element nodes (via
> > the variable-length-node directory, referring to the symbol and element
> > nodes). The XPath engine will, at first, be reading every symbol node's
> > body and comparing against the string it's looking for a the time, but it
> > should cache symbols and (if it turns up a symbol matching a name anywhere
> > in the XPath expression) remember it to avoid further string operations.
> >
> > Maybe it would be a good idea to have a seperate symbol directory, with
> > 16 or 32 bit hash values stored for the strings, to speed up matching.
> >
> > Caching directory blocks as they are read from disk would be wise.
> >
> > Allocating new directory entries will involve scanning for a
> > non-0xffffffff bitmap in the appropriate directory block, then doing a bit
> > scan to find the free slot. If the directory block is full, either
> > relocate it to the end of the file (updating the pointer in the
> > header) and creat a variable-length-node directory entry marking the freed
> > up space as free, or if you use the linked list approach, create a new
> > directory block of a fixed size at the end of the file and link to it.
> >
> > To create new variable-length nodes, or to relocate one that cannot grow
> > any further, allocate an S# for the old region (tagged as "empty") then
> > find an S# marked as "free" that's big enough and put the data there,
> > creating a new S# for the unused region. If the file is full, either
> > compact stuff (remember - file offsets live only in directories and the
> > master-directory pointers in the header) or grow the file and tack stuff
> > onto the end.
> >
> > Howzat sound?
> >
> > >
> > > sdw
> > >
> >
> > ABS
> >
> > --
> >                                Alaric B. Snell
> >  http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
> >    Any sufficiently advanced technology can be emulated in software
> >
> > _______________________________________________
> > xml-bin mailing list
> > xml-bin@warhead.org.uk
> > http://lists.warhead.org.uk/mailman/listinfo/xml-bin

-- 
sdw@lig.net  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000