[Xml-bin] Some central design issues

Anjana Manian Anjana.Manian@oracle.com
Fri, 13 Apr 2001 11:01:13 -0700


This is a multi-part message in MIME format.
--------------6DAB975DD492BA6C52D12CD3
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi,

I have been following closely the thread of mails on this subject  -
this interest me becoz I am working on an implementation of Binary
compressed XML generation which will be available in our future release
of Oracle XML Developer's kit.

We are using the java serialization (rather externalization) technique
to generate a binary compressed stream from an XML document. The
compression is based on tokenizing the XML tag's. The assumption is that
any XML file has repeated number of tags and so tokenizing the tags will
give considerable compression. Therefore the compression achieved
depends on type of input xml document  - larger the tags, lesser the
text content, better the compression!  

Our goal of compression is to reduce the size of the xm document without
loosing the structural and hierarchical information of the DOM tree.
Therefore the compressed stream contains all the "useful" information to
create the DOM tree back from the binary format. The compressed stream
can also be generated from the SAX event. The binary stream generated
from DOM and SAX are compatible i.e  the compressed stream generated
from SAX could be used to generate the DOM tree and vice versa. One of
the advantage of compression is that at any time a huge in-memory DOM
tree could be stored in full or parts in the form of compressed stream
and whenever it is required we can regenerate the DOM tree back.

I am wondering whether we really need a binary xml specification? Assume
we have one, in which case, our compressed stream output would be
compliant to the binary xml format. But in any case we are providing
ways and means to generate the text xml from the binary format and vice
versa. So the end user may continue working with the text format without
bothering about the way the binary format is created and stored!

Thanks,
Anjana


Al Snell wrote:
> 
> On Fri, 13 Apr 2001, Stephen D. Williams wrote:
> 
> > Exactly what I'm trying to address.  It's hard to solve this problem and
> > I have had a hard time finding project that tried.  I know enough about
> > the solution to know that it will work, although exact performance is
> > still just an estimate.
> 
> You're convincing me :-) Maybe a DOM-like format is better than a SAX-like
> one!
> 
> Mmmm, ok.
> 
> A simple approach is to make the file header contain the serial number
> (S#) of the root node, and any other <?xml version=... <!DOCTYPE... stuff.
> 
> The S#s are distinct from IDs and the like since they are assigned by the
> file format itself, and have meaning only at that layer.
> 
> The header also contains pointers to the Directory blocks. There is a
> directory block for each type code (integer, variable-length-node (a
> subtype flag says whether it's a CDATA, Binary DATA, element or attribute
> name, or element body). Directory blocks will either need to be relocated
> if they overflow, or have provision for a chain pointer to point to an
> extension directory block.
> 
> An S# consists of a type code and an offset into that type code's
> directory block.
> 
> For fixed size types like integers, the data can live inside the directory
> block. For variable-length nodes, there will be different type codes for
> short (8 bit length) and long (64 bit length) versions (why no 16 or 32?
> Because the space saving of a few bytes of length code is negligible if
> the data is more than 256 bytes anyway), meaning that the two sizes live
> in seperate directories. Variable-length nodes directory entries also
> contain a type field:
> 
> 1) It's a CDATA
> 2) It's a BDATA (binary data) - MIME type in the body too?
> 3) It's a symbol (element or attribute name string)
> 4) It's an element (the body contains the S# of the symbol containing the
> element's name, a list of attribute declarations (S# of attribute name
> then S# of attribute content), then a list of S#s of child nodes)
> 5) It's a bit of free space in the file
> 
> Directory entries are always fixed size, and every 32 entries are prefixed
> by a 32 bit free-bitmap, indicating entries that are not in use and can be
> reallocated.
> 
> Reading (be it a tree walk or an XPath query) would involve starting with
> the root node (S# found from the header) and traversing element nodes (via
> the variable-length-node directory, referring to the symbol and element
> nodes). The XPath engine will, at first, be reading every symbol node's
> body and comparing against the string it's looking for a the time, but it
> should cache symbols and (if it turns up a symbol matching a name anywhere
> in the XPath expression) remember it to avoid further string operations.
> 
> Maybe it would be a good idea to have a seperate symbol directory, with
> 16 or 32 bit hash values stored for the strings, to speed up matching.
> 
> Caching directory blocks as they are read from disk would be wise.
> 
> Allocating new directory entries will involve scanning for a
> non-0xffffffff bitmap in the appropriate directory block, then doing a bit
> scan to find the free slot. If the directory block is full, either
> relocate it to the end of the file (updating the pointer in the
> header) and creat a variable-length-node directory entry marking the freed
> up space as free, or if you use the linked list approach, create a new
> directory block of a fixed size at the end of the file and link to it.
> 
> To create new variable-length nodes, or to relocate one that cannot grow
> any further, allocate an S# for the old region (tagged as "empty") then
> find an S# marked as "free" that's big enough and put the data there,
> creating a new S# for the unused region. If the file is full, either
> compact stuff (remember - file offsets live only in directories and the
> master-directory pointers in the header) or grow the file and tack stuff
> onto the end.
> 
> Howzat sound?
> 
> >
> > sdw
> >
> 
> ABS
> 
> --
>                                Alaric B. Snell
>  http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
>    Any sufficiently advanced technology can be emulated in software
> 
> _______________________________________________
> xml-bin mailing list
> xml-bin@warhead.org.uk
> http://lists.warhead.org.uk/mailman/listinfo/xml-bin
--------------6DAB975DD492BA6C52D12CD3
Content-Type: text/x-vcard; charset=us-ascii;
 name="Anjana.Manian.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Anjana Manian
Content-Disposition: attachment;
 filename="Anjana.Manian.vcf"

begin:vcard 
n:Manian;Anjana
tel;fax:650-654-6208
tel;work:650-607-0153
x-mozilla-html:FALSE
org:Server Technologies;Oracle Corporation
adr:;;;;;;
version:2.1
email;internet:Anjana.Manian@oracle.com
title:Senior Member Technical Staff
x-mozilla-cpt:;-1
fn:Anjana Manian
end:vcard

--------------6DAB975DD492BA6C52D12CD3--