HTML Documentation UTF8-Latin1 errors

Please use this forum to signal bugs.
Merci d'utiliser ce forum pour signaler des bugs.

HTML Documentation UTF8-Latin1 errors

Postby Serge » Tue 20 Mar 2018 17:12

Hello,

I think there is a mismatch in the HTML documentation generation between UTF8 ant ISO-8859-1.

First my configuration :
    * BOUML release 7.2
    * Qt 4.8.0
    * Windows 10 64 bits

Tools to check the results :
    * Notepad++
    * Hexedit (to confirm byte encoding

Project character encoding setting is UTF8 ( key "gxmi encoding" value "UTF-8" in project file).
However it's impossible to use accentuated characters in entity names. But this is not a real problem.
In description section, no problem.

Tested character : é (one can you "à", "è", "ç" and any character code with at least 2 bytes in UTF-8)
    * UTF8 code : C3 A9 (2 bytes)
    * latin1 code (ISO-8859-1) : A9 (single byte)
In project file (UTF8 encoded) "é" character is correctly encoded (C3 A9).


First test : Generate as UTF8
HTML files start with : <?xml version="1.0" encoding="UTF-8"?>
In the bodies, "é" is wrongly encoded with 4 bytes (C3 83 C2 A9).

Second test : Generate as latin1
HTML files start with : <?xml version="1.0" encoding="ISO-8859-1"?>
In the bodies, "é" is encoded with 2 bytes (C3 A9) which is UTF8 so the brower displays "é" because it assumes character is 8859-1 encoded.

Remarks
In the whole process, i think that one of the process is awaiting a string "ISO-8859-1" code while it receives a "UTF-8" coded one when you generate HTML documentation in "UTF-8" and the oppsite when you generaye it using "latin1".
Serge
 
Posts: 3
Joined: Mon 19 Mar 2018 18:13

Re: HTML Documentation UTF8-Latin1 errors

Postby Bruno Pagès » Tue 20 Mar 2018 18:14

Hello,
Serge wrote:...
Project character encoding setting is UTF8 ( key "gxmi encoding" value "UTF-8" in project file).

I don't know why the key "gxmi encoding" exists into the HTML generator project, probably a side effect, but this key is only used by the XMI generator, so it has no effect for the HTML code generator.

Serge wrote:However it's impossible to use accentuated characters in entity names.

Yes, only few parts like the descriptions allow accentuated characters.

It would have been better to remove that restriction when I moved from Qt3 to QT4 for instance, but the restriction is still present :oops:

Serge wrote:In project file (UTF8 encoded) "é" character is correctly encoded (C3 A9).

:shock: I never encode characters in UTF8 in the project files, this means these two characters wasn't introduced by an edition inside bouml, but editing a project file out of bouml or may be through a plug-out modifying the project like a reverse or xmi import etc, but this is not expected, only latin1 characters must be used

If I create a dedicated project "accent" and just set the description of the package-project "accent" to "aaaéaaa" then save it, then load accent.prj with HexEdit and search "aaa" I found the hexa codes 61 61 61 E9 61 61 61, so "é" is encoded in ISO-8859-1/latin-1 as expected

Using the project "accent", if I generate HTML in flat mode asking for UTF8 (button "yes" in the corresponding dialog) in the file index.html the description is encoded as 61 61 61 C3 A9 61 61 61 and if I ask for latin1 (button "no" in the corresponding dialog) the description is encoded as 61 61 61 E9 61 61 61, so for me all is ok

For bouml and the plug-outs your sequence C3 A9 into your project file is not one utf8 character but two latin1 characters, they are managed separately, using the HTML generator when you generate latin1 there are unchanged, and when you generate UTF8 it encode each of them producing 2 times 2 bytes
ImageAuthor of Bouml
Bruno Pagès
 
Posts: 603
Joined: Mon 20 Feb 2012 09:23
Location: France

Re: HTML Documentation UTF8-Latin1 errors

Postby Serge » Mon 26 Mar 2018 17:01

Hello,

Sorry, to be late to answer, but i didn't receive any mail from the site when you answered. Is it normal ? :o
Thanks for your reply 8-)

First of all, i never edited the BOUML project file using an external text processor. I just used NotePad++ and Hexedit (in reading mode only) to look what was wrong with doc generation.
I develop a lot using PHP and MySQL and you get this type of errors when you use UTF-8 encoding for PHP and MySQL but you forget to tell it to the MySQL connector (mysql/mysqli) with "SET NAMES 'utf8';" just after connection to database has been opened. This is a very common mistake, and all "é" become "é"...
In my enviroment setting for the external editor, it is set to NotePad++. I don't know if this has any effect here.

I never encode characters in UTF8 in the project files, this means these two characters wasn't introduced by an edition inside bouml,...

I didn't specify that enviroment setting for charset was set to UTF-8 before creating my project.

    1 - You're right. If it is set to ISO-8859-1.
    Then in this case accentuated characters are encoded with a single byte in project file.

    2 - But if you set it to "UTF-8" which is my case then they are encoded whith 2 bytes (UTF-8) for "éèêéàç" characters at least.
    So I don't agree with your statement. And this is probably why the doc plugout is awaiting ISO-8859-1 but receives UTF-8, and this explain the bug.

I did some other tests, and I saw that if you change enviroment setting for charset after the project has been created, the new charset is applied for new texts but the older ones are still encoded with the original charset. :!: :!: :!:
Of course, this make no sense to change this along a project life, but if you work on several projects from different sources with different settings, it can become a nigthmare because BOUML knows nothing about the original used charset:x

It would be a good improvment to store the used charset in the project file itself, set the environment setting accordingly and lock the setting on the current project.

Moreover storing it in the project file could help to generate the doc correctly.
Serge
 
Posts: 3
Joined: Mon 19 Mar 2018 18:13

Re: HTML Documentation UTF8-Latin1 errors

Postby Serge » Mon 26 Mar 2018 17:04

Serge wrote:Sorry, to be late to answer, but i didn't receive any mail from the site when you answered. Is it normal ?


Sorry, I forgot to monitor the subject :? .
Serge
 
Posts: 3
Joined: Mon 19 Mar 2018 18:13


Return to Bug reports / Rapports de bugs

Who is online

Users browsing this forum: No registered users and 1 guest

cron