[Box Backup] Win32 port (was: BoxBackup Server Side Management
Specs (Draft0.01))
Alaric Snell-Pym
boxbackup@fluffy.co.uk
Fri, 08 Oct 2004 23:19:38 +0100
Chris Wilson wrote:
>>How about a simple \n terminated terminated line, with tab separated
>>message name and arguments? Nice and easy to parse, nice and easy to
>>write, extend by adding new messages or arguments.
>
> It lacks human-readability, extensibility and robustness. In order to
> understand the output when "parsing" by hand, you have to learn by heart
> what each column stands for. With XML, the columns are self-describing.
XML's subtly broken for data interchange applications - bear in mind it
was designed as a markup language, not a data interchange format!
Although the term "markup" seems to have been re-defined in hindsight to
mean data interchange, mainly due to people misunderstanding what Tim
Berners-Lee was going on about when he was basically explaining that XML
is meant to be more machine-processable than HTML, to simplify "screen
scraping" of web pages.
> How do you deal with third-party extensions to the protocol? With XML, you
> can just create a new namespace to hold those attributes and tags, and the
> client can ignore them. Any sensible solution to the problem is bound to
> be similar to XML's.
>
> How about if Box was forked one day, or independent, incompatible protocol
> changes made in the stable and development versions? Such changes would
> normally make the forks incompatible with each other, from the view of the
> client, because they would re-use the same columns for new purposes.
> Labelling columns makes this almost not an issue, and the same client
> could probably work with both forks, with maximal shared code, and minimal
> client complexity.
You're thinking here in terms of loosely-coupled systems. By design, the
box backup client and server are tightly coupled - they co-evolve.
What this means in practice is that if there is divergence between the
two halves, then lexical protocol incompatability is the least of your
problems.
This is one of the flaws in XML's "extensibility" for data interchange.
For the original application - human-readable documents that allowed
easy extraction of parts of them for machine processing - the "ignore
elements you don't understand" paradigm was applicable, since Web pages
would often add extra stuff (adverts, special offers, navigation links)
alongside the core data that could then be ignored.
However, for protocols, you've no way of knowing if extra information is
critical or not. Look at the PNG image format - PNG chunks have a flag
bit stating whether they are critical to decoding the image or not; if a
decoder does not understand a critical chunk, then it must abort.
> For robustness, there are many special cases to deal with, in terms of
> data encapsulation. For example, filenames can contain tabs, newlines and
> null bytes, all of which need to be escaped. The escape character itself
> needs to be escaped. Once you have dealt with that, the reader needs a
> parser of considerable complexity. I would not be at all sure that I had
> got it right, especially not without the extra syntactic sugar of the tag
> <start> and </end> markers.
XML has just the same complexity. Except that it can't actually
represent some of the characters that might turn up in file names (even
using numeric character entities), such as nulls, so you'd have to
base64-encode the file names anyway (see the XSD type defined exactly
for this).
> I'm having difficulty understanding whether your objection to XML is
> philosophical or practical. Myself, I think that its practical benefits
> far outweigh its disadvantages, in all but the most extreme circumstances
> (tiny systems, huge systems, huge data volume; of which this protocol
> needs none), and I don't understand a philosophical objection. XML became
> popular for a reason, not because it was popular.
Personally, I think it became popular because it was hyped by a lot of
software companies as the ultimate solution to data interchange, in
front of a lot of people (Web developers who had little network
programming experience outside of the Web arena) who hadn't ever seen a
proper data interchange format before so sucked up the hype...
> I really dislike over-simple protocols that make too little concession to
> being easy for both humans and machines to read, future-proofing or
> robustness. A little effort at design time makes it possible to satisfy
> all these criteria. The only happy medium that I have found is XML (and
> YAML, but it's almost the same as XML). Everybody is using it. It's not
> big or ugly or heavy if you do it right. Good libraries are available.
> What's not to like?
XML makes for large files, and the parser libraries are large and slow
compared to simpler mechanisms...
XML has lots of problems for data interchange. Off of the top of my head:
1) The character encoding issue. A valid XML file might be supplied in
any of a wide range of character sets, including EBCDIC! Parsers are
only required to support UTF-*, but encoders are still allowed to
produce whatever charset they feel like. Parsers have to implement a
complex algorithm to decode the character set. Because of this, the
text/xml MIME type is now being officialy deprecated, since the complex
interactions between this and MIME text/* character set support were
deemed too messy to fix.
2) There are characters that are banned in XML text content. This is
because it's designed for marking up human readable text for display,
not for data interchange - and the display semantics of characters like
"bell" and "shift in" aren't defined. This caused problems for a lot of
companies that rushed out and produced XML interfaces to their databases
- the databases could contain characters that XML cannot express, so
they ended up naively creating non well-formed documents! What they
should have done is to base64-encode all element content that cannot be
guaranteed to be constrained to valid characters, but that would have
made a mockery of their "human readable!" claims...
<queryResult>
<row>
<field name="bmFtZQ==" value="QWxhcmljIEIuIFNuZWxsLVB5bQ==" />
</row>
</queryResult>
3) Debate over schema languages. Do you go for the verbose hell of W3C
XML Schema, or adopt Relax NG which is sensibly designed and nice to
use, but not as widely accepted?
4) XML parsers are prone to DoS attacks (intentional or accidental). For
example, one can reference arbitrary URIs as external entities (itself
requiring an XML parser to contain a URL parser and at least an HTTP
client...), and you can use pathological nested internal entities to
cause a parser to allocate huge amounts of memory, unless it contains
special resource usage limitation code that just adds complexity (see
"http://www.xs4all.nl/~irmen/comp/billionlaughs.xml").
Note also the fact that even the W3C are still designing *non-XML*
formats for things.
- XPath isn't written in XML.
- CSS isn't written in XML.
- In SVG, path expressions aren't written in XML.
- In RDF, N3 notation is more expressive than the RDF/XML serialisation
Reading the original specifications, XML was designed as a replacement
to HTML that could be more easily harversted by machines, due to
descriptive element names. The data interchange thing is just marketing
hype...
http://www.w3.org/XML/:
"Extensible Markup Language (XML) is a simple, very flexible text format
derived from SGML (ISO 8879). Originally designed to meet the challenges
of large-scale electronic publishing, XML is also playing an
increasingly important role in the exchange of a wide variety of data on
the Web and elsewhere."
> Cheers, Chris.
ABS
--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/categories/alaric/