[Box Backup] Win32 port (was: BoxBackup Server Side Management
Specs (Draft0.01))
Alaric Snell-Pym
boxbackup@fluffy.co.uk
Mon, 11 Oct 2004 15:47:36 +0100
Chris Wilson wrote:
> I fail to see the difference between markup and interchange of tree-
> structured data. To flatten a tree into a stream, with appropriate markers
> to enable the original tree to be reconstructed, seems to be what XML
> does?
Yep, it's just a matter of intent, that has subtle implications for the
details.
Markup was originally about taking a string, and annotating regions of
it with metadata (either for styling, or to enable bits of the text to
be selected according to their purpose, or whatever). This is why XML
has attributes as well as elements - so that the metadata could be
parameterised.
If you came at the design from the viewpoint of exchange trees, you'd be
unlikely to have attributes - just nodes in a tree. If you wanted to
attach properties to objects in the tree, you'd be as likely to have
something like attributes both for the nodes and for the links between
nodes and their children...
Also, the markup mindset led to DTDs, which people who want to exchange
data have found too string-oriented; you can't put type constraints on
element content, and only simple constraints on attribute content. XML
schema languages were retrofitted to try and make it more data-oriented,
but there used to be ongoing wars on XML-DEV between the data folks and
the document folks!
Personally, I quite like XML as a document format - DocBook is quite
powerful, IMHO, and it's easy to knock together custom documentation
systems with XSLT...
> In this case, although there isn't yet a need for deep tree structure in
> the protocol, I believe that XML's markup makes it much more extensible
> and future-proof than the tab-and-newline-delimited language which Ben
> proposed. I speak from long and bitter experience with such data formats
> (I designed netservers' firewall to use them throughout, and deeply
> regretted it two years later).
I think that Ben just likes two-dimensional tables ;-)
However, as Ben has suggested it, each message would start with a
message identifier in the first column, which is just as extensible as
an XML tag name - third party extenders can even prefix the names they
create with their domain name, if they like, to make collisions
unlikely, a la namespaces.
To add extra parameters to an existing message - why not just add an
extra message that gives more information about the same object?
> No, I'm not talking about the client (bbackupd) and server (bbstored). I
> understand that they are tightly coupled, although this in itself is a
> pain in the a**e -- what happens when I want to upgrade my backup server?
> Do all my customers have to upgrade their clients at exactly the same
> time? THAT's painful.
Only if Ben made a backwards-incompatible change, eg making the server
require information that it didn't need before. However, my point was
more that people tend to get too carried away with thinking XML's claim
of extensibility magics away extension problems - keeping the protocol
compatible is usually less of a problem than the higher level issues ;-)
> I'm talking about UI clients connecting to bbackupd, which acts as a
> server for information about its own status. I can foresee a wide variety
> of specialised clients, of which the WX GUI that I'm writing is just one.
> I have no wish whatsoever for that program to be tightly coupled to
> bbackupd, as that would give me an ongoing, permanent task of updating it
> to try to keep up with the latest bbackupd versions. On the contrary, I
> very much want it to be loosely coupled.
Ok! My mistake, sorry. If it's just a viewer of information, then surely
key-value pairs will be ideal? If the client sees a key it doesn't know,
it can just ignore it, and you know that no harm will be done?
>>However, for protocols, you've no way of knowing if extra information is
>>critical or not. Look at the PNG image format - PNG chunks have a flag
>>bit stating whether they are critical to decoding the image or not; if a
>>decoder does not understand a critical chunk, then it must abort.
>
> I had a plan for this: if you break the protocol in an incompatible way
> (i.e. you add information that the client can't ignore), change the major
> version code. Clients which don't understand your new version will
> disconnect when they see this.
This was considered by the PNG group.
The thing about the PNG system, however, if that a writer supporting
higher-version features outputs a stream that doesn't happen to use
those features, it will still be readable by lower-version readers.
If you have to put a version number at the beginning, a writer either
has to put the highest version number of a feature it is likely to use,
OR to do potentially complex work to find out what features it is going
to use for this actual message, before outputting the header.
Also, who allocates version numbers? You need a central authority to do
that - what about third-party extensions?
> OK, that is a serious issue. I fully intend for the protocol to be
> human-readable (including filenames), even if it means breaching the XML
> standards slightly or inventing a new escaping mechanism, but maybe I
> should just ignore them and invent an "XML-like" protocol?
>
> YAML does seem to have support for encoding arbitrary characters with
> escaping (http://www.yaml.org/spec/#syntax-escape).
Yes! YAML was designed more for data interchange ;-)
> But if there was a better alternative, why aren't people talking about it?
> I've never heard anyone propose a better interchange format than XML (for
> my definition of "betterness"). If you have, please could you point me in
> its direction?
If being textual is really important, then YAML or s-expressions
(s-expressions happen to let you express arbitrary graphs, not just trees!).
If not, then there's XDR, which is more widely implemented than XML
(it's what NFS uses, so all free Unixes come with the toolset). With
XDR, you write what look like C type declarations for structs and stuff,
and a preprocessor takes that and produces source code to map between
them and native data types.
Then there's the grandfather of them all - ASN.1. ASN.1 is basically a
schema language. There are a variety of actual encodings that you then
use to express information, tailored for different applications. The
simplest to describe, BER, is designed to be easy to implement. It has
two variants - DER and CER - which have useful cryptographic properties
relating to the fact that the encoder has no choice as to how to encode
values; in BER, it can choose which of two methods is easiest for
encoding variable-length things, and unordered lists can be written in
whatever order is easiest, while DER and CER constrain it. These are
used for digital signatures; X.509 certificates use CER or DER.
However, BER is inefficient. Not as inefficient as XML, but inefficient
enough to make some people grumble, so they came up with PER - Packed
Encoding Rules. PER is quite complex to apply, but that application is
done once when your application is being built; the result is that the
encoder and decoder are then actually smaller and faster than BER ones,
and the on-the-wire encoding is very simple.
Then in order to avoid the growing interest in XML for data interchange
from sidelining ASN.1, the ASN.1 group made XER - XML Encoding Rules,
which produces XML representations!
ASN.1 supports extensibility in ways more elegant than XML does -
information object classes and extension markers - which I won't go into
in detail! If you're interested, I can point you at online resources.
However, it's hard to get decent open source ASN.1 toolkits :-( If
people hadn't hyped XML for data interchange, and had instead hyped
ASN.1, then we would have gotten people writing ASN.1 toolkits instead
of SAX/DOM implementations, and we'd be laughing.
If you shell out for a commercial ASN.1 toolkit, you get all sorts of
lovely development tools, like protocol analysers that will render BER
and PER human readable for you (for development) - a BER decoder can
happily read CER and DER since they're just subsets. Thus, the fact that
BER and friends aren't "human readable" isn't a problem in practice;
when a human needs to read them in debugging or whatnot, they use the
correct tool, while day to day communicating software doesn't need to
burden itself with human readability. Or you can use a transcoder tool
to automatically convert between BER/CER/DER/PER and XER, if you like
angle brackets ;-)
Perhaps the best option when you take availability into account is
CORBA. There are open source CORBA client and server implementations to
be had from the Gnome folks. Part of CORBA is the binary format for
passing object values around, which works just as well as a file format
as for messages in a protocol.
So to answer your question as to why people aren't talking about the
better alternatives - in short, because people don't do their research!
Everyone was coming up with homegrown protocols instead of looking for a
data interchange library, because they had never been shown a data
interchange library. Then when people start hyping XML as one, the idea
seems novel and amazing, so people flock to it in droves...
The computer industry, sadly, is driven more by fads and marketing than
by technical stuff! The Web era brought lots of people into programming
with no theoretical background in it. The only programming tools they
saw were the very basics of CGI scripting in Perl, and their knowledge
of programming grew outwards from that by reinventing the wheel, rather
than by looking at what good work had been done in the past. Not to
slate them - they were just doing their job, and their job was to solve
problems in the short timescales of boom-era Web development rather than
to come up with decent software architectures for distributed systems!
>>XML makes for large files, and the parser libraries are large and slow
>>compared to simpler mechanisms...
>
> With gzip compression the files are not significantly larger than any
> other format. XML's syntactic sugar compresses very well in general, and
> there is even a binary format which should be even more lightweight (in
> theory, I haven't studied it).
gzip compression is quite resource intensive! And if you're willing to
expend those resources, you might as well compress a terser format to
begin with - the compression ratio will be lower due to less redundant
information to remove, but the result will still be smaller, and then
you still avoid the time overhead of parsing XML.
The binary format of which you speak may be one of a number of
contenders - one of which is the Fast Web Services stuff, which uses
ASN.1 at heart. There's also BiM (IIRC, used for streaming metadata in
MPEG) and a few others.
>>1) The character encoding issue. A valid XML file might be supplied in
>>any of a wide range of character sets, including EBCDIC! Parsers are
>>only required to support UTF-*, but encoders are still allowed to
>>produce whatever charset they feel like.
>
>
> Here the encoder is fully under our control, and I would be happy to stick
> with UTF-8 or US-ASCII (given a means for representing filenames, possibly
> non-standard).
When I have this argument with *real* XML zealots, they cannot bear to
think of mandating any subset of XML, since it breaks their dream of
interoperability between everything ;-)
I'd stick with UTF-8 - Windows for one allows Unicode filenames, so
US-ASCII would present a problem...
> I would rather not Base64 encode anything, but instead invent an escaping
> system which at least allows Latin filenames to be read directly from the
> XML stream.
It has been suggested that somebody creates a specification for magic
empty elements to represent characters:
...xmlns:char="..."...
Here is some text containing <char:BEL /> illegal characters.
<char:NUL />.
I think the ASN.1 people (who have defined XER, XML Encoding Rules, to
allow XML to interoperate with stuff like BER) came up with something
similar. At least, I suggested they did when I was invited to some of
their meetings as an XML expert ;-)
> Perhaps it would be better to search for, or write afresh, a very simple
> "XML-like" protocol which is easy to parse, has well-defined support for
> representing binary data in an easy-to-read and relatively lightweight
> way, and no support for validation, entities or references to external
> documents? Does YAML fit these requirements?
I don't think YAML offers any better than hex or base64 for binary data
- making binary and text cooperate in one file is tricky... which is why
I like binary interchange formats; you can put arbitrary binary data in
byte-string fields quite happily, and then use a nice viewer/editor
program rather than a text editor, which will handle the binary data
neatly while still letting you view and edit everything else :-)
> Cheers, Chris.
ABS
--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/categories/alaric/