[Xml-bin] Some central design issues

Stephen D. Williams sdw@lig.net
Fri, 13 Apr 2001 00:00:41 -0400


Al Snell wrote:
> 
> On Thu, 12 Apr 2001, Stephen D. Williams wrote:
> 
> > I require both very efficient seekability and efficient in-place
> > modification.  I haven't seen anyone else consider the latter
> > requirement.  It's difficult, but is what is really required in a number
> > of key situations.
> 
> In place modification is a bitch... the files start to get uncompacted
> where things grow and have to have their space reallocated :-(

I think I have a good compromise solution for this.  The idea is pretty
simple, but you have to add some further ideas to make it all fit
together.  The lowest level data structure is something I think of as
'Elastic Memory'.  First you have a virtual memory space that is chunked
into macroblocks and then blocks, with efficient 'holes' along with
cheap pointers (vpointers) relative to the beggining of each block that
track 'points', not memory locations.

On top of that basic structure that solves a lot of problems, is the
question of how to generically and compactly provide hierarchical
indexing that will allow very fast search, and update of something like:
"application.driver[3].accident[2].damage.estimate.total".

I'm currently leaning away from a separate index describing linear XML
since that would pretty much duplicate everything but the actual text. 
I'm experimenting with breadth first organization of tags with vpointers
to an attribute/tag/text table of the next level.  For small lists of
tags, storage would be simple and linear.  For larger lists, full-text
style indexing is used: sorted/bisection searched with an instance
vpointer list.

Additionally, each macroblock optionally has a tag and/or string
dictionary.  With the breadth-first sorted tag list, each tag only
occurs once at each level.  Since tags at different levels are likely to
be different anyway, this could be optimal and a separate tag/string
table is probably not worth on the first pass.

As far as getting uncompacted, the block/elastic stuff is an attempt to
balance copying, allocating, space, and efficiency while staying within
a compact, linear address space chunked into reasonably sized blocks.  I
believe the tradeoff will pay off well, but I don't have all the code
working yet.

Most of the work that needs to be done to keep track of the elastic
memory stuff should be absorbed by the far better locality of reference
this design has.  When each tag is tracked in a different DOM node in
malloc-space, it's likely to be on a separate page for CPU cache
purposes.  With a confined memory space of a chunked buffer, many tags
are guarunteed to be within the same page.  When modifying a value, a
gap is opened up and only grown periodically (usually not at all).  The
size of a block limits the average amount of data that has to be moved. 
When a value shrinks, the 'elastic' structure allows a hole in case that
value or one around it grows again.

When the XML document is 'moved' to a different function or over a flat
transport like a socket, the holes can be compressed out or not as
desired.  This becomes a simple, direct speed/size tradeoff.

An obvious goal is that streaming production of an XML object should
cause almost no data copying.  This is true to a large extent because of
the structure and can be more fully handled by carefully managing where
block boundaries fall.

The DOM api is broken with respect to the design goals here; a new
interface is needed so that a create-in-place paradigm replaces the
create-hook-in idiom used now.

Ideally, the library managing this structure would be at least as easy
to use as any 'native' 'collection'/dictionary/tree system like DOM,
JGL, or STL.

...
> > Cool.  After thinking about it long and hard, I'm more inclined to solve
> > the DOM optimized problem first.  SAX is secondary to me.  True
> > streaming needs to be supported, but I am trying structures that may not
> > make it efficient without some reasonable buffering.
> 
> I can make a format that's great for streaming and random access reads,
> but in-place modification is an arse :-(

Yes, it is!  Solving it in a reasonably optimal way is highly beneficial
everywhere else.

> The ability to provide in-place modification may disallow efficient SAX
> processing, which would be a downer, since in-place modification would
> require the ability to break the document ordering inside the
> file...

That's not necessarily true, however there are always tradeoffs with
different strategies.

> I see SAX emulation as pretty crucial for interoperability and efficiency
> (otherwise it'll be slower than textXML when streaming)

I disagree.  It isn't a matter of speed, per se, it's more of a matter
buffer/memory requirements for processing.  SAX provides two things,
from my point of view: low latency/low memory processing of potentially
large XML streams and use of XML elements for 'custom' purposes.  By
custom I mean parsing into something other than a DOM-like tree.  For
the latter purpose, you could start with DOM and just traverse the tree
to produce SAX events.  What I'm proposing is semantically the same, but
without the wasted parsing step.  The former memory optimization problem
I'm willing to defer to later thought because A) it's not relevant to
lots of problems and B) trying to optimize it blocks too many possible
promising strategies.


sdw

> ABS
> 
> --
>                                Alaric B. Snell
>  http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
>    Any sufficiently advanced technology can be emulated in software
> 
> _______________________________________________
> xml-bin mailing list
> xml-bin@warhead.org.uk
> http://lists.warhead.org.uk/mailman/listinfo/xml-bin

-- 
sdw@lig.net  http://sdw.st
Stephen D. Williams
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 
Dec2000