dtRdr::doc::Book::whitespace - issues with whitespace
Weird things happen when whitespace doesn't count, but sort of counts.
The annotations rely on a reliable character position, which can be very different from byte offset due to character encoding and whitespace collapses. Thus, we have to establish conventions for whitespace which can be consistently applied in all of these situations.
All your spaces are belong to one position.
The general rule is that any amount of whitespace, whether spanning a tag or not, is treated a single space character.
This becomes a little difficult with book formats that contain (rendered) nested content nodes. Because of these types of books, a position needs to be able to map from global to local so that the position in a parent can be calculated given the position in a child. See dtRdr::doc::Book::annotree for "math is fun."
As for whitespace, we have to adopt a convention that a space at the end or beginning of a node needs to belong somewhere. In these examples, I'll use square brackets to represent the opening and closing of node xml tags.
[a[b][c[d]]] [a [b][c[d]]] [ a [b][c[d]]] [ a[ b][c[d] ] ]
The above are not intended to be necessarily equivalent. Just representative situations.
Because lots of linebreaks and/or indentation from manual editing and/or conversion tools is so common, the situation almost always looks like this in reality.
[ a [ b ] [ c [ d ] ] ]
This should basically reduce into the following:
[a [b ][c [d ]]]
no node starts with a space
there are no consecutive spaces, regardless of tag boundaries
This convention is important because it needs to be shared between the book base class (which does the annotation-insertion xml munging) and the individual book plugins (which build the annotation offset table to allow for position math.)
I still need to prove it, but I believe that even this should be equivalent to the canonical example above.
[ a[ b][ c[ d] ]]
And, to be pragmatic, this is not really worth chasing, since nested content nodes which are accessible both individually and from within the parent is an impossible-to-resolve-into-a-pagewise-reader concept.
Nonbreaking space is treated as a space and collapsed. Thus " " is equivalent to " ". This is because a given html widget may or may not pass the space (e.g. from get_selection()) is a plain space or 0xA0 (NOTE that if it does, the widget shim should replace it with a plain space at the get_selection() call.)
Not sure how to deal with this yet.
http://en.wikipedia.org/wiki/Non-breaking_space U+00A0 -- is just /  U+202F -- narrow (?) U+FEFF -- zero-width (but has issues) U+2060 -- word-joiner (replaces FEFF)
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references ensp U+2002 (8194) -- en space  emsp U+2003 (8195) -- em space  thinsp U+2009 (8201) -- thin space  zwnj U+200C (8204) -- zero width non-joiner zwj U+200D (8205) -- zero width joiner