[Box Backup] Win32 port (was: BoxBackup Server Side Management Specs (Draft0.01))

Sat, 9 Oct 2004 18:02:46 +0100 (BST)

Hi Alaric,

> XML's subtly broken for data interchange applications

I agree with you now, thanks for your explanation. But please see my 
comments below.

> bear in mind it was designed as a markup language, not a data
> interchange format!  Although the term "markup" seems to have been
> re-defined in hindsight to mean data interchange, mainly due to people
> misunderstanding what Tim Berners-Lee was going on about when he was
> basically explaining that XML is meant to be more machine-processable
> than HTML, to simplify "screen scraping" of web pages.

I fail to see the difference between markup and interchange of tree-
structured data. To flatten a tree into a stream, with appropriate markers
to enable the original tree to be reconstructed, seems to be what XML
does? (but perhaps with too much complexity in the general case, and
certainly too much for this application, I agree.)

In this case, although there isn't yet a need for deep tree structure in
the protocol, I believe that XML's markup makes it much more extensible
and future-proof than the tab-and-newline-delimited language which Ben
proposed. I speak from long and bitter experience with such data formats
(I designed netservers' firewall to use them throughout, and deeply
regretted it two years later).

> You're thinking here in terms of loosely-coupled systems. By design, the 
> box backup client and server are tightly coupled - they co-evolve.

No, I'm not talking about the client (bbackupd) and server (bbstored). I 
understand that they are tightly coupled, although this in itself is a 
pain in the a**e -- what happens when I want to upgrade my backup server? 
Do all my customers have to upgrade their clients at exactly the same 
time? THAT's painful.

I'm talking about UI clients connecting to bbackupd, which acts as a
server for information about its own status. I can foresee a wide variety
of specialised clients, of which the WX GUI that I'm writing is just one.
I have no wish whatsoever for that program to be tightly coupled to
bbackupd, as that would give me an ongoing, permanent task of updating it
to try to keep up with the latest bbackupd versions. On the contrary, I
very much want it to be loosely coupled.

> What this means in practice is that if there is divergence between the 
> two halves, then lexical protocol incompatability is the least of your 
> problems.

I don't believe this will be an issue when the client and server are 
designed from the ground up to be decoupled, but if I am wrong, then I 
would really appreciate an explanation.

> This is one of the flaws in XML's "extensibility" for data interchange. 
> For the original application - human-readable documents that allowed 
> easy extraction of parts of them for machine processing - the "ignore 
> elements you don't understand" paradigm was applicable, since Web pages 
> would often add extra stuff (adverts, special offers, navigation links) 
> alongside the core data that could then be ignored.
> 
> However, for protocols, you've no way of knowing if extra information is 
> critical or not. Look at the PNG image format - PNG chunks have a flag 
> bit stating whether they are critical to decoding the image or not; if a 
> decoder does not understand a critical chunk, then it must abort.

I had a plan for this: if you break the protocol in an incompatible way
(i.e. you add information that the client can't ignore), change the major
version code. Clients which don't understand your new version will 
disconnect when they see this.

> XML has just the same complexity. Except that it can't actually 
> represent some of the characters that might turn up in file names (even 
> using numeric character entities), such as nulls, so you'd have to 
> base64-encode the file names anyway (see the XSD type defined exactly 
> for this).

OK, that is a serious issue. I fully intend for the protocol to be
human-readable (including filenames), even if it means breaching the XML
standards slightly or inventing a new escaping mechanism, but maybe I
should just ignore them and invent an "XML-like" protocol?

YAML does seem to have support for encoding arbitrary characters with
escaping (http://www.yaml.org/spec/#syntax-escape).

> Personally, I think it became popular because it was hyped by a lot of 
> software companies as the ultimate solution to data interchange, in 
> front of a lot of people (Web developers who had little network 
> programming experience outside of the Web arena) who hadn't ever seen a 
> proper data interchange format before so sucked up the hype...

But if there was a better alternative, why aren't people talking about it?  
I've never heard anyone propose a better interchange format than XML (for
my definition of "betterness"). If you have, please could you point me in
its direction?

> XML makes for large files, and the parser libraries are large and slow 
> compared to simpler mechanisms...

With gzip compression the files are not significantly larger than any
other format. XML's syntactic sugar compresses very well in general, and
there is even a binary format which should be even more lightweight (in 
theory, I haven't studied it).

> 1) The character encoding issue. A valid XML file might be supplied in 
> any of a wide range of character sets, including EBCDIC! Parsers are 
> only required to support UTF-*, but encoders are still allowed to 
> produce whatever charset they feel like.

Here the encoder is fully under our control, and I would be happy to stick 
with UTF-8 or US-ASCII (given a means for representing filenames, possibly 
non-standard).

> 2) There are characters that are banned in XML text content. This is 
> because it's designed for marking up human readable text for display, 
> not for data interchange - and the display semantics of characters like 
> "bell" and "shift in" aren't defined. This caused problems for a lot of 
> companies that rushed out and produced XML interfaces to their databases 
> - the databases could contain characters that XML cannot express, so 
> they ended up naively creating non well-formed documents! What they 
> should have done is to base64-encode all element content that cannot be 
> guaranteed to be constrained to valid characters, but that would have 
> made a mockery of their "human readable!" claims...

I would rather not Base64 encode anything, but instead invent an escaping
system which at least allows Latin filenames to be read directly from the
XML stream.

> 3) Debate over schema languages. Do you go for the verbose hell of W3C 
> XML Schema, or adopt Relax NG which is sensibly designed and nice to 
> use, but not as widely accepted?

I would be happy with Relax NG, as I have no intention of making extra 
effort to validate the output, at least to start with.

> 4) XML parsers are prone to DoS attacks (intentional or accidental). For 
> example, one can reference arbitrary URIs as external entities (itself 
> requiring an XML parser to contain a URL parser and at least an HTTP 
> client...), and you can use pathological nested internal entities to 
> cause a parser to allocate huge amounts of memory, unless it contains 
> special resource usage limitation code that just adds complexity (see 
> "http://www.xs4all.nl/~irmen/comp/billionlaughs.xml").

I have no intention of supporting such nonsense in the client anway :-) 
But well pointed out, I had forgotten about that.

Perhaps it would be better to search for, or write afresh, a very simple
"XML-like" protocol which is easy to parse, has well-defined support for
representing binary data in an easy-to-read and relatively lightweight 
way, and no support for validation, entities or references to external 
documents? Does YAML fit these requirements?

Cheers, Chris.
-- 
_ ___ __     _
 / __/ / ,__(_)_  | Chris Wilson <0000 at qwirx.com> - Cambs UK |
/ (_/ ,\/ _/ /_ \ | Security/C/C++/Java/Perl/SQL/HTML Developer |
\ _/_/_/_//_/___/ | We are GNU-free your mind-and your software |