The StartX Files: In-Depth With StarOffice Filters
Source: Linux Planet
By: Brian Proffitt
The Word to the Wise tour bus has officially pulled off to the side of the road while the driver begs the indulgence of the tourists to continue to examine the word processing juggernaut that is OpenOffice. It is not the norm to spend three columns' worth of time reviewing what is essentially one product among many but I will submit that OpenOffice (and its near-clone StarOffice) is a product special enough to pay attention to.
For those of you who are growing tired of this part of the tour, you are invited to visit the gift shop to purchase the lovely souvenirs. Before we begin, let me remind you that for the purposes of this column, unless I make specific references to each product, when I write "OpenOffice" I am referring to both StarOffice and OpenOffice.
Two weeks ago, I reviewed the StarOffice 6.0 Writer Beta, and last week I examined OpenOffice Build 638C's version of Writer. In both cases, I emphasized strongly (translation: gushed liked a gossipy spinster) about the fact that OpenOffice had finally managed to put together a Word document filter that carried over all of the information stored within a Word document, beyond the straight text: formatting, annotations, comments, and revision marks. This is an exceptional breakthrough, because it allows full collaboration between Those Who Use Windows and Those Who Use Linux.
Admittedly, though, I may have been caught up in my own excitement just a wee bit too much. I feel obligated to back away from this cool technology to tell you that since last week's column, I have learned quite a bit more about how these filters work within OpenOffice. It seems that the filtering tools are not perfect in what they pull over from Word to OpenOffice. Formats and styles in complex Word documents are not flawlessly brought across in every case.
And there is a tendency for documents saved in Word format within OpenOffice to actually become larger in file size than the original Word document itself--even if no changes were made in the OpenOffice file.
These two problems were brought to my attention by diligent reader Neil Cohen, who dropped using his Windows machine at his Cisco office in favor of Linux some months back, and turned to OpenOffice to collaborate with his Windows-using colleagues. Cohen indicated that some of the documents used in his office can carry very complex templates and formats, which OpenOffice does not always display properly. I did not notice this during my tests because the documents I was moving from Word to OpenOffice were, while very heavy in comments and revisions, not very complex in terms of formatting.
The second concern, that of ballooning file size, I completely missed during my reviews. I could offer some convenient excuse, like my "l" and "s" keys weren't working or the ever-popular "my Linux machine was hacked," but the simple truth is, I neglected to check this myself out of sheer brain-fade. But now that Mr. Cohen has brought it to my attention, I want to clear up some of the mystery surrounding these phenomena.
Of Bugs and Features
In order to get the answers on why users were experiencing problems with Word filters in OpenOffice Writer, I decided to go to the source itself and dropped a line to Juergen Pingel, who is the project owner for the word processor project in OpenOffice. He kindly referred me to Dublin-resident Caolán McNamara who is the filters developer for Writer.
McNamara answered my questions about what exactly was going on with these issues and even pointed me to a pretty good document explaining how the overall filtering process works. He did this in favor of giving me a description of the process himself, which he characterized as" a set of incredibly complex mind-numbing processes to find text, graphics, and attributes in a Word document to map them to our own document features."
Still, if you have an interest in this sort of thing, I recommend you go peruse that filtering document. It's a good read.
When I asked McNamara about loss of formatting in Word documents, he was understandably hesitant to speak directly to the problem, since he had not seen the documents I was referring to. But, he did point to one of three main possibilities. First off, he conceded that formatting loss could be caused by a bug in the filtering process since the sheer complexity of mapping document features from one format to another could lead to the occasional error.
Secondly, he stated that formatting loss or changes would occur in a situation what Word had a document feature that OpenOffice did not yet support. If there is nothing to map a feature to, then, the next best alternative will be displayed.
Finally, the formatting changes could be a conscious decision on the part of the OpenOffice team. "Sometimes a Word layout misfeature... is so alien to common sense that it just isn't supported," he explained.
McNamara was quick to point out that whenever a filtering problem is discovered, users were strongly encouraged to report their findings and attach the Word document in question to the report. This kind of reporting would lead to a faster resolution of the problem, no matter what the cause.
As for file size, there is a simple explanation why OpenOffice-saved Word documents can grow so much. McNamara explained that within OpenOffice, all documents are saved in 16-bit Unicode format. This translates into two bytes per character in an OpenOffice document. This is completely different than Word's document saving.
"We always save to Unicode on export, but in Word if it detects that you are saving an English document (or some other language it feels safe in doing so with) it will save in 8-bit mode," McNamara explained. "Word can store its text in a number of separate pieces and has the ability to save some chunks in 8 bit and some in 16 bit, which gets complex and nasty. We avoid the whole can of worms and save as straightforward 16 bit throughout when saving to [Word] 97 and above."
This storage methodology does shrink the file size of Word documents but the process of scanning a document for characters that it is safe to save in 8-bit mode will take up more time. By saving consistently in 16-bit Unicode mode, the OpenOffice developers have traded disk space for efficiency.
There are other techniques the OpenOffice developers use that could inflate the file size even more. The OpenOffice developers, it seems, are a conservative bunch of folk and they refuse to take shortcuts when saving formatting properties for a document.
"There are a few other cases e.g., table properties where we take a conservative approach and explicitly write out all the necessary properties to regenerate it, but where Microsoft may use some (dubious) optimizations to avoid writing part of the properties," McNamara explained. "Again we take a safe, guaranteed to work approach."
Graphics are another area where file sizes can get big in a hurry, McNamara added.
"We store each graphic separately on saving to Word but Word itself may not save a graphic twice if it is used twice. But this is another reasonably complex thing to implement which isn't all that common in the general view. There are only a few border cases where the same graphic is used a large number of times in one document."
But McNamara also went on to explain how in the overall scheme of things, saving Word documents in Word format with OpenOffice might decrease the overall file size of the typical Word user's documents.