[Box Backup] Win32 port (was: BoxBackup Server Side Management Specs (Draft0.01))

Alaric Snell-Pym boxbackup@fluffy.co.uk
Fri, 08 Oct 2004 23:19:38 +0100


Chris Wilson wrote:
>>How about a simple \n terminated terminated line, with tab separated 
>>message name and arguments? Nice and easy to parse, nice and easy to 
>>write, extend by adding new messages or arguments.
 >
> It lacks human-readability, extensibility and robustness. In order to 
> understand the output when "parsing" by hand, you have to learn by heart 
> what each column stands for. With XML, the columns are self-describing.

XML's subtly broken for data interchange applications - bear in mind it 
was designed as a markup language, not a data interchange format! 
Although the term "markup" seems to have been re-defined in hindsight to 
mean data interchange, mainly due to people misunderstanding what Tim 
Berners-Lee was going on about when he was basically explaining that XML 
is meant to be more machine-processable than HTML, to simplify "screen 
scraping" of web pages.

> How do you deal with third-party extensions to the protocol? With XML, you
> can just create a new namespace to hold those attributes and tags, and the
> client can ignore them. Any sensible solution to the problem is bound to
> be similar to XML's.
>
> How about if Box was forked one day, or independent, incompatible protocol
> changes made in the stable and development versions? Such changes would
> normally make the forks incompatible with each other, from the view of the
> client, because they would re-use the same columns for new purposes.
> Labelling columns makes this almost not an issue, and the same client
> could probably work with both forks, with maximal shared code, and minimal
> client complexity.

You're thinking here in terms of loosely-coupled systems. By design, the 
box backup client and server are tightly coupled - they co-evolve.

What this means in practice is that if there is divergence between the 
two halves, then lexical protocol incompatability is the least of your 
problems.

This is one of the flaws in XML's "extensibility" for data interchange. 
For the original application - human-readable documents that allowed 
easy extraction of parts of them for machine processing - the "ignore 
elements you don't understand" paradigm was applicable, since Web pages 
would often add extra stuff (adverts, special offers, navigation links) 
alongside the core data that could then be ignored.

However, for protocols, you've no way of knowing if extra information is 
critical or not. Look at the PNG image format - PNG chunks have a flag 
bit stating whether they are critical to decoding the image or not; if a 
decoder does not understand a critical chunk, then it must abort.

> For robustness, there are many special cases to deal with, in terms of 
> data encapsulation. For example, filenames can contain tabs, newlines and 
> null bytes, all of which need to be escaped. The escape character itself 
> needs to be escaped. Once you have dealt with that, the reader needs a 
> parser of considerable complexity. I would not be at all sure that I had 
> got it right, especially not without the extra syntactic sugar of the tag 
> <start> and </end> markers.

XML has just the same complexity. Except that it can't actually 
represent some of the characters that might turn up in file names (even 
using numeric character entities), such as nulls, so you'd have to 
base64-encode the file names anyway (see the XSD type defined exactly 
for this).

> I'm having difficulty understanding whether your objection to XML is 
> philosophical or practical. Myself, I think that its practical benefits 
> far outweigh its disadvantages, in all but the most extreme circumstances 
> (tiny systems, huge systems, huge data volume; of which this protocol 
> needs none), and I don't understand a philosophical objection. XML became 
> popular for a reason, not because it was popular.

Personally, I think it became popular because it was hyped by a lot of 
software companies as the ultimate solution to data interchange, in 
front of a lot of people (Web developers who had little network 
programming experience outside of the Web arena) who hadn't ever seen a 
proper data interchange format before so sucked up the hype...

> I really dislike over-simple protocols that make too little concession to
> being easy for both humans and machines to read, future-proofing or
> robustness. A little effort at design time makes it possible to satisfy
> all these criteria. The only happy medium that I have found is XML (and
> YAML, but it's almost the same as XML). Everybody is using it. It's not
> big or ugly or heavy if you do it right. Good libraries are available.
> What's not to like?

XML makes for large files, and the parser libraries are large and slow 
compared to simpler mechanisms...

XML has lots of problems for data interchange. Off of the top of my head:

1) The character encoding issue. A valid XML file might be supplied in 
any of a wide range of character sets, including EBCDIC! Parsers are 
only required to support UTF-*, but encoders are still allowed to 
produce whatever charset they feel like. Parsers have to implement a 
complex algorithm to decode the character set. Because of this, the 
text/xml MIME type is now being officialy deprecated, since the complex 
interactions between this and MIME text/* character set support were 
deemed too messy to fix.

2) There are characters that are banned in XML text content. This is 
because it's designed for marking up human readable text for display, 
not for data interchange - and the display semantics of characters like 
"bell" and "shift in" aren't defined. This caused problems for a lot of 
companies that rushed out and produced XML interfaces to their databases 
- the databases could contain characters that XML cannot express, so 
they ended up naively creating non well-formed documents! What they 
should have done is to base64-encode all element content that cannot be 
guaranteed to be constrained to valid characters, but that would have 
made a mockery of their "human readable!" claims...

   <queryResult>
     <row>
       <field name="bmFtZQ==" value="QWxhcmljIEIuIFNuZWxsLVB5bQ==" />
     </row>
   </queryResult>

3) Debate over schema languages. Do you go for the verbose hell of W3C 
XML Schema, or adopt Relax NG which is sensibly designed and nice to 
use, but not as widely accepted?

4) XML parsers are prone to DoS attacks (intentional or accidental). For 
example, one can reference arbitrary URIs as external entities (itself 
requiring an XML parser to contain a URL parser and at least an HTTP 
client...), and you can use pathological nested internal entities to 
cause a parser to allocate huge amounts of memory, unless it contains 
special resource usage limitation code that just adds complexity (see 
"http://www.xs4all.nl/~irmen/comp/billionlaughs.xml").

Note also the fact that even the W3C are still designing *non-XML* 
formats for things.

  - XPath isn't written in XML.
  - CSS isn't written in XML.
  - In SVG, path expressions aren't written in XML.
  - In RDF, N3 notation is more expressive than the RDF/XML serialisation

Reading the original specifications, XML was designed as a replacement 
to HTML that could be more easily harversted by machines, due to 
descriptive element names. The data interchange thing is just marketing 
hype...

http://www.w3.org/XML/:
"Extensible Markup Language (XML) is a simple, very flexible text format 
derived from SGML (ISO 8879). Originally designed to meet the challenges 
of large-scale electronic publishing, XML is also playing an 
increasingly important role in the exchange of a wide variety of data on 
the Web and elsewhere."

> Cheers, Chris.

ABS

-- 
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/categories/alaric/