[Box Backup] Win32 port (was: BoxBackup Server Side Management Specs (Draft0.01))

Alaric Snell-Pym boxbackup@fluffy.co.uk
Mon, 11 Oct 2004 15:47:36 +0100


Chris Wilson wrote:

> I fail to see the difference between markup and interchange of tree-
> structured data. To flatten a tree into a stream, with appropriate markers
> to enable the original tree to be reconstructed, seems to be what XML
> does?

Yep, it's just a matter of intent, that has subtle implications for the 
details.

Markup was originally about taking a string, and annotating regions of 
it with metadata (either for styling, or to enable bits of the text to 
be selected according to their purpose, or whatever). This is why XML 
has attributes as well as elements - so that the metadata could be 
parameterised.

If you came at the design from the viewpoint of exchange trees, you'd be 
unlikely to have attributes - just nodes in a tree. If you wanted to 
attach properties to objects in the tree, you'd be as likely to have 
something like attributes both for the nodes and for the links between 
nodes and their children...

Also, the markup mindset led to DTDs, which people who want to exchange 
data have found too string-oriented; you can't put type constraints on 
element content, and only simple constraints on attribute content. XML 
schema languages were retrofitted to try and make it more data-oriented, 
but there used to be ongoing wars on XML-DEV between the data folks and 
the document folks!

Personally, I quite like XML as a document format - DocBook is quite 
powerful, IMHO, and it's easy to knock together custom documentation 
systems with XSLT...

> In this case, although there isn't yet a need for deep tree structure in
> the protocol, I believe that XML's markup makes it much more extensible
> and future-proof than the tab-and-newline-delimited language which Ben
> proposed. I speak from long and bitter experience with such data formats
> (I designed netservers' firewall to use them throughout, and deeply
> regretted it two years later).

I think that Ben just likes two-dimensional tables ;-)

However, as Ben has suggested it, each message would start with a 
message identifier in the first column, which is just as extensible as 
an XML tag name - third party extenders can even prefix the names they 
create with their domain name, if they like, to make collisions 
unlikely, a la namespaces.

To add extra parameters to an existing message - why not just add an 
extra message that gives more information about the same object?

> No, I'm not talking about the client (bbackupd) and server (bbstored). I 
> understand that they are tightly coupled, although this in itself is a 
> pain in the a**e -- what happens when I want to upgrade my backup server? 
> Do all my customers have to upgrade their clients at exactly the same 
> time? THAT's painful.

Only if Ben made a backwards-incompatible change, eg making the server 
require information that it didn't need before. However, my point was 
more that people tend to get too carried away with thinking XML's claim 
of extensibility magics away extension problems - keeping the protocol 
compatible is usually less of a problem than the higher level issues ;-)

> I'm talking about UI clients connecting to bbackupd, which acts as a
> server for information about its own status. I can foresee a wide variety
> of specialised clients, of which the WX GUI that I'm writing is just one.
> I have no wish whatsoever for that program to be tightly coupled to
> bbackupd, as that would give me an ongoing, permanent task of updating it
> to try to keep up with the latest bbackupd versions. On the contrary, I
> very much want it to be loosely coupled.

Ok! My mistake, sorry. If it's just a viewer of information, then surely 
key-value pairs will be ideal? If the client sees a key it doesn't know, 
it can just ignore it, and you know that no harm will be done?

>>However, for protocols, you've no way of knowing if extra information is 
>>critical or not. Look at the PNG image format - PNG chunks have a flag 
>>bit stating whether they are critical to decoding the image or not; if a 
>>decoder does not understand a critical chunk, then it must abort.
> 
> I had a plan for this: if you break the protocol in an incompatible way
> (i.e. you add information that the client can't ignore), change the major
> version code. Clients which don't understand your new version will 
> disconnect when they see this.

This was considered by the PNG group.

The thing about the PNG system, however, if that a writer supporting 
higher-version features outputs a stream that doesn't happen to use 
those features, it will still be readable by lower-version readers.

If you have to put a version number at the beginning, a writer either 
has to put the highest version number of a feature it is likely to use, 
OR to do potentially complex work to find out what features it is going 
to use for this actual message, before outputting the header.

Also, who allocates version numbers? You need a central authority to do 
that - what about third-party extensions?

> OK, that is a serious issue. I fully intend for the protocol to be
> human-readable (including filenames), even if it means breaching the XML
> standards slightly or inventing a new escaping mechanism, but maybe I
> should just ignore them and invent an "XML-like" protocol?
>
> YAML does seem to have support for encoding arbitrary characters with
> escaping (http://www.yaml.org/spec/#syntax-escape).

Yes! YAML was designed more for data interchange ;-)

> But if there was a better alternative, why aren't people talking about it?  
> I've never heard anyone propose a better interchange format than XML (for
> my definition of "betterness"). If you have, please could you point me in
> its direction?

If being textual is really important, then YAML or s-expressions 
(s-expressions happen to let you express arbitrary graphs, not just trees!).

If not, then there's XDR, which is more widely implemented than XML 
(it's what NFS uses, so all free Unixes come with the toolset). With 
XDR, you write what look like C type declarations for structs and stuff, 
and a preprocessor takes that and produces source code to map between 
them and native data types.

Then there's the grandfather of them all - ASN.1. ASN.1 is basically a 
schema language. There are a variety of actual encodings that you then 
use to express information, tailored for different applications. The 
simplest to describe, BER, is designed to be easy to implement. It has 
two variants - DER and CER - which have useful cryptographic properties 
relating to the fact that the encoder has no choice as to how to encode 
values; in BER, it can choose which of two methods is easiest for 
encoding variable-length things, and unordered lists can be written in 
whatever order is easiest, while DER and CER constrain it. These are 
used for digital signatures; X.509 certificates use CER or DER.

However, BER is inefficient. Not as inefficient as XML, but inefficient 
enough to make some people grumble, so they came up with PER - Packed 
Encoding Rules. PER is quite complex to apply, but that application is 
done once when your application is being built; the result is that the 
encoder and decoder are then actually smaller and faster than BER ones, 
and the on-the-wire encoding is very simple.

Then in order to avoid the growing interest in XML for data interchange 
from sidelining ASN.1, the ASN.1 group made XER - XML Encoding Rules, 
which produces XML representations!

ASN.1 supports extensibility in ways more elegant than XML does - 
information object classes and extension markers - which I won't go into 
  in detail! If you're interested, I can point you at online resources.

However, it's hard to get decent open source ASN.1 toolkits :-( If 
people hadn't hyped XML for data interchange, and had instead hyped 
ASN.1, then we would have gotten people writing ASN.1 toolkits instead 
of SAX/DOM implementations, and we'd be laughing.

If you shell out for a commercial ASN.1 toolkit, you get all sorts of 
lovely development tools, like protocol analysers that will render BER 
and PER human readable for you (for development) - a BER decoder can 
happily read CER and DER since they're just subsets. Thus, the fact that 
BER and friends aren't "human readable" isn't a problem in practice; 
when a human needs to read them in debugging or whatnot, they use the 
correct tool, while day to day communicating software doesn't need to 
burden itself with human readability. Or you can use a transcoder tool 
to automatically convert between BER/CER/DER/PER and XER, if you like 
angle brackets ;-)

Perhaps the best option when you take availability into account is 
CORBA. There are open source CORBA client and server implementations to 
be had from the Gnome folks. Part of CORBA is the binary format for 
passing object values around, which works just as well as a file format 
as for messages in a protocol.

So to answer your question as to why people aren't talking about the 
better alternatives - in short, because people don't do their research! 
Everyone was coming up with homegrown protocols instead of looking for a 
data interchange library, because they had never been shown a data 
interchange library. Then when people start hyping XML as one, the idea 
seems novel and amazing, so people flock to it in droves...

The computer industry, sadly, is driven more by fads and marketing than 
by technical stuff! The Web era brought lots of people into programming 
with no theoretical background in it. The only programming tools they 
saw were the very basics of CGI scripting in Perl, and their knowledge 
of programming grew outwards from that by reinventing the wheel, rather 
than by looking at what good work had been done in the past. Not to 
slate them - they were just doing their job, and their job was to solve 
problems in the short timescales of boom-era Web development rather than 
to come up with decent software architectures for distributed systems!

>>XML makes for large files, and the parser libraries are large and slow 
>>compared to simpler mechanisms...
> 
> With gzip compression the files are not significantly larger than any
> other format. XML's syntactic sugar compresses very well in general, and
> there is even a binary format which should be even more lightweight (in 
> theory, I haven't studied it).

gzip compression is quite resource intensive! And if you're willing to 
expend those resources, you might as well compress a terser format to 
begin with - the compression ratio will be lower due to less redundant 
information to remove, but the result will still be smaller, and then 
you still avoid the time overhead of parsing XML.

The binary format of which you speak may be one of a number of 
contenders - one of which is the Fast Web Services stuff, which uses 
ASN.1 at heart. There's also BiM (IIRC, used for streaming metadata in 
MPEG) and a few others.

>>1) The character encoding issue. A valid XML file might be supplied in 
>>any of a wide range of character sets, including EBCDIC! Parsers are 
>>only required to support UTF-*, but encoders are still allowed to 
>>produce whatever charset they feel like.
> 
> 
> Here the encoder is fully under our control, and I would be happy to stick 
> with UTF-8 or US-ASCII (given a means for representing filenames, possibly 
> non-standard).

When I have this argument with *real* XML zealots, they cannot bear to 
think of mandating any subset of XML, since it breaks their dream of 
interoperability between everything ;-)

I'd stick with UTF-8 - Windows for one allows Unicode filenames, so 
US-ASCII would present a problem...

> I would rather not Base64 encode anything, but instead invent an escaping
> system which at least allows Latin filenames to be read directly from the
> XML stream.

It has been suggested that somebody creates a specification for magic 
empty elements to represent characters:

...xmlns:char="..."...

    Here is some text containing <char:BEL /> illegal characters. 
<char:NUL />.

I think the ASN.1 people (who have defined XER, XML Encoding Rules, to 
allow XML to interoperate with stuff like BER) came up with something 
similar. At least, I suggested they did when I was invited to some of 
their meetings as an XML expert ;-)

> Perhaps it would be better to search for, or write afresh, a very simple
> "XML-like" protocol which is easy to parse, has well-defined support for
> representing binary data in an easy-to-read and relatively lightweight 
> way, and no support for validation, entities or references to external 
> documents? Does YAML fit these requirements?

I don't think YAML offers any better than hex or base64 for binary data 
- making binary and text cooperate in one file is tricky... which is why 
I like binary interchange formats; you can put arbitrary binary data in 
byte-string fields quite happily, and then use a nice viewer/editor 
program rather than a text editor, which will handle the binary data 
neatly while still letting you view and edit everything else :-)

> Cheers, Chris.

ABS

-- 
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/categories/alaric/