[Xml-bin] An attempt to sort things out [long]

Kinner, Jason JKinner@Bluestone.com
Tue, 17 Apr 2001 12:18:29 -0400


All -

Okay, I spent most of three hours last night following this thread from
xml-dev, so I guess it's time for my $0.02.

First, I had the opportunity to speak with Charles Goldfarb 2 years ago
regarding binary encoding of XML when I co-authored a chapter of the XML
Handbook, 2nd ed., with him.  He mentioned that there was a similar effort
to "binarize" SGML, which met with very limited success.  His gut feel,
however, was that it was the complexity of SGML that limited their success
and that XML _might_ have simplified things enough for this stuff to
actually work.

Second, I'm glad to see that I'm not crazy in thinking this is a good idea,
or at least if I am crazy, I am in good company. ;)

Third, I am particularly intrigued by this email regarding the OSI stack.
Recently, we've been talking about a generalized data processing model which
maps to the top 3 layers of the OSI stack (session, presentation,
application).  My particular interest (since I'm in a middleware group) is
the interoperability of XML and binary file formats.

With all that, here's what I've been thinking about (and it's been very
informative to see all of the prior discussion on this topic -- it's
corrected a few assertions that I was making).  Originally, the format that
I was working on was targeted at mobile/embedded devices.  As one of the few
people who still has a 2MB PDA, I found it difficult to find a complete XML
parser solution that I could run while still having any applications
installed on my device.  Furthermore, in a wireless scenario with
asynchronous messaging (definitely the Way to Go(tm) here in the States,
with the immaturity of wireless data coverage), store-and-forward techniques
using textML could easily bloat to unmanageable proportions.

My design goals were as follows:

1) In-place _reading_ of XML data (e.g. - on a mobile device)
2) No string element names (cf. the "pointer comparison" posts)
3) The possibility of a pointer-math implementation of a DOM (cf. the
fixed-length/directory posts)
4) The possibility of a DOM implementation that could produce _valid_
documents on a mobile device

Note that I did not have SAX as a priority, since the root of my R&D was
basically XML-RPC (small, compact messages).  I have now seen the error of
my ways...

These design goals led me to the idea that the DTD/Schema, since it defines
the elements, needs to be a separate item.  However, the requirement of XML
to be self-describing also requires the option of an "embedded" DTD, where a
"DTD" in my case was simply a hash of element-id to string name.  Namespaces
were headed toward the (URI, element-id) tuple (sorry I can't make the
specific reference, my head is still spinning :) where the true element name
was a tuple of (namespace-id, element-id).  What I came up with was a
fixed-size structural block that could encode a DOM-1 Node in 5 bytes, and
an attribute in 6-8 bytes (because I was using variable-byte-length element-
and attribute-ids).  Now, developing this thing has not been my "job", so I
don't know if I've covered the use-cases.  But basically, if a Node (or an
Attribute) has a value, I was using a symbol-table with relative pointers
(referred to as "vpointers" elsewhere) based on the starting location of the
"value" block within the overall document stream.  Structure came first,
then attributes, then values, then namespace definition (whether references
or embedded).  This would allow a lazy-evaluation technique for retrieving
values.  Also, it would allow an XPath implementation to search by
element-id instead of by an element name by using a pre-compilation
technique (a la IDL).

So, the basic pipeline would look like this:

DTD/Schema ----> Compiled schema with namespace support
                             |
                             v
XML Document ----------------------> Compiled XML doc

Which is suspiciously like the IDL pipeline.

Okay, I could write more about this, but I think you've suffered enough,
especially if you're on a dial-up line.  One last point, though.  In terms
of XML/Binary interoperability, I believe there are two issues to be worked
out: 1) a binary structural encoding of straight XML (so we don't loose the
"good stuff" of a textual representation of the actual data, including the
byte-swapping problem, the small-value problem, &c), and also 2) using the
DOM/SAX/JDOM API set to interact with generic binary data.  The group that
solves those two problems with get their names in lights.

<opinion>
In summary, my belief is that the fall-out of XML will be the tools and APIs
that have been developed around this "new" way of handling data
processing/interchange.  Whether the message is in a binary encoding or a
textual encoding will be and invisible "flip-the-switch" option related to
the transport (excuse me -- "session") layer.
</opinion>

Jason Kinner
Principal Architect, OEM/ISV
HP Middleware Division

-----Original Message-----
From: Al Snell [mailto:alaric@alaric-snell.com]
Sent: Tuesday, April 17, 2001 5:30 AM
To: xml-bin@warhead.org.uk
Subject: Re: [Xml-bin] An attempt to sort things out


On Tue, 17 Apr 2001, Stefan Zier wrote:

> So the layer model is:
> 
> Application Data
> DOM/SAX Data
> Textual XML Data

> Input on this model would be greatly appreciated. Am I reinventing the
wheel
> here?

Well, it looks like a summary of the ISO OSI 7-layer model for networking
:-)

The 7-layer model goes:

Application layer (application data)

Presentation layer (specifies the data types such as strings
	and dates and integers)

Session layer (hmmm... in this context, probably specifies the way the
	XML data is split into a tree; the syntax of "<".

Tranport layer (Headers. <?xml version='1.0 ?> <!DOCTYPE ... >)

Network layer (Directory structure)

Data link layer (How the filesystem delimits different files on disk)

Physical layer (how hard disks work)

The transport and network layers were (funnily enough) defined in very
network-centric terms, so my translation is ad libbing a bit, but apart
from that it's fairly isomorphic.

ABS

-- 
                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software  


_______________________________________________
xml-bin mailing list
xml-bin@warhead.org.uk
http://lists.warhead.org.uk/mailman/listinfo/xml-bin