Discussion:
[developer] Review 4006 Certain printable unicode characters misclassified as nonprintable
Lauri Tirkkonen via illumos-developer
2014-10-15 08:41:11 UTC
Permalink
Issues: https://www.illumos.org/issues/4006
https://www.illumos.org/issues/5227

Webrev http://www.niksula.hut.fi/~ltirkkon/webrev/4006/

I will note that this diff is *huge*, because it consists of importing
locale data that is correctly formatted for localedef. I'm not 100%
comfortable with this; it would be possible to do this conversion at
build-time to greatly reduce the size of the diff, but since I
implemented the conversion utility in Python3 [0], that would either add
a build-time dependency or require further work. However, since there is
a precedent for this kind of solution in localedef (commit
2da1cd3a39e2d3da7f9d15071ea9462919c011ac) I thought I'd ask what the
list thinks.

This changeset adds a script 'mkclasses.py' to convert data from the
Unicode Character Database (UCD) into the character classification data
format localedef expects in LC_CTYPE, and also imports that data into
the gate so that localedef can use it for all UTF-8 locales. In addition
I had to update the UTF-8.cm charmap file from CLDR because the latest
UCD data references characters that weren't present in the charmap
currently in the gate.

This changeset does not touch case mapping data; that still comes from
the CLDR data files. While out of scope for this issue, that data might
also need some love.

Lastly I moved some code from mkwidths.py to utf8_util.py to facilitate
reuse, and regenerated widths.txt with the new UTF-8.cm (to verify that
that script still works after my changes).

[0]: I would have used Python 2, like mkwidths.py does, but the copies
readily available to me (those in OI and OmniOS) were what Python calls
"narrow builds", which limits the valid argument range of unichr [1].
Python 3 does not have this problem.
[1]: https://docs.python.org/2/library/functions.html#unichr
--
Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-developer
2014-10-15 12:54:44 UTC
Permalink
Huh.

I had a version in tree of similar code, but using C instead of python.
Also, i had intended to add the upper/lower case conversions.

Mostly this looks good, but I'm really unhappy about adding python3 -- we
don't use it *anywhere* else in our build. Lets hold off on this. I'll
supply a C version of the programs, and upper/lower conversions, if you
give me a little more time. (A few days at least.)


On Wed, Oct 15, 2014 at 1:41 AM, Lauri Tirkkonen via illumos-developer <
Post by Lauri Tirkkonen via illumos-developer
Issues: https://www.illumos.org/issues/4006
https://www.illumos.org/issues/5227
Webrev http://www.niksula.hut.fi/~ltirkkon/webrev/4006/
I will note that this diff is *huge*, because it consists of importing
locale data that is correctly formatted for localedef. I'm not 100%
comfortable with this; it would be possible to do this conversion at
build-time to greatly reduce the size of the diff, but since I
implemented the conversion utility in Python3 [0], that would either add
a build-time dependency or require further work. However, since there is
a precedent for this kind of solution in localedef (commit
2da1cd3a39e2d3da7f9d15071ea9462919c011ac) I thought I'd ask what the
list thinks.
This changeset adds a script 'mkclasses.py' to convert data from the
Unicode Character Database (UCD) into the character classification data
format localedef expects in LC_CTYPE, and also imports that data into
the gate so that localedef can use it for all UTF-8 locales. In addition
I had to update the UTF-8.cm charmap file from CLDR because the latest
UCD data references characters that weren't present in the charmap
currently in the gate.
This changeset does not touch case mapping data; that still comes from
the CLDR data files. While out of scope for this issue, that data might
also need some love.
Lastly I moved some code from mkwidths.py to utf8_util.py to facilitate
reuse, and regenerated widths.txt with the new UTF-8.cm (to verify that
that script still works after my changes).
[0]: I would have used Python 2, like mkwidths.py does, but the copies
readily available to me (those in OI and OmniOS) were what Python calls
"narrow builds", which limits the valid argument range of unichr [1].
Python 3 does not have this problem.
[1]: https://docs.python.org/2/library/functions.html#unichr
--
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
https://www.listbox.com/member/archive/rss/182179/21239177-3604570e
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Lauri Tirkkonen via illumos-developer
2014-10-15 13:17:00 UTC
Permalink
Post by Garrett D'Amore via illumos-developer
I had a version in tree of similar code, but using C instead of python.
Also, i had intended to add the upper/lower case conversions.
I didn't because I wasn't entirely sure whether or not some case
mappings *are* locale dependent. POSIX ctype classification supposedly
isn't for UTF-8 locales.
Post by Garrett D'Amore via illumos-developer
Mostly this looks good, but I'm really unhappy about adding python3 -- we
don't use it *anywhere* else in our build. Lets hold off on this. I'll
supply a C version of the programs, and upper/lower conversions, if you
give me a little more time. (A few days at least.)
Sure, that sounds like a better solution. Let me know if I can help.
--
Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Gordon Ross via illumos-developer
2014-10-15 16:11:05 UTC
Permalink
Why the preference for C over python for a build time tool?
Performance is not an issue, right?

On Wed, Oct 15, 2014 at 8:54 AM, Garrett D'Amore via illumos-developer <
Post by Garrett D'Amore via illumos-developer
Huh.
I had a version in tree of similar code, but using C instead of python.
Also, i had intended to add the upper/lower case conversions.
Mostly this looks good, but I'm really unhappy about adding python3 -- we
don't use it *anywhere* else in our build. Lets hold off on this. I'll
supply a C version of the programs, and upper/lower conversions, if you
give me a little more time. (A few days at least.)
On Wed, Oct 15, 2014 at 1:41 AM, Lauri Tirkkonen via illumos-developer <
Post by Lauri Tirkkonen via illumos-developer
Issues: https://www.illumos.org/issues/4006
https://www.illumos.org/issues/5227
Webrev http://www.niksula.hut.fi/~ltirkkon/webrev/4006/
I will note that this diff is *huge*, because it consists of importing
locale data that is correctly formatted for localedef. I'm not 100%
comfortable with this; it would be possible to do this conversion at
build-time to greatly reduce the size of the diff, but since I
implemented the conversion utility in Python3 [0], that would either add
a build-time dependency or require further work. However, since there is
a precedent for this kind of solution in localedef (commit
2da1cd3a39e2d3da7f9d15071ea9462919c011ac) I thought I'd ask what the
list thinks.
This changeset adds a script 'mkclasses.py' to convert data from the
Unicode Character Database (UCD) into the character classification data
format localedef expects in LC_CTYPE, and also imports that data into
the gate so that localedef can use it for all UTF-8 locales. In addition
I had to update the UTF-8.cm charmap file from CLDR because the latest
UCD data references characters that weren't present in the charmap
currently in the gate.
This changeset does not touch case mapping data; that still comes from
the CLDR data files. While out of scope for this issue, that data might
also need some love.
Lastly I moved some code from mkwidths.py to utf8_util.py to facilitate
reuse, and regenerated widths.txt with the new UTF-8.cm (to verify that
that script still works after my changes).
[0]: I would have used Python 2, like mkwidths.py does, but the copies
readily available to me (those in OI and OmniOS) were what Python calls
"narrow builds", which limits the valid argument range of unichr [1].
Python 3 does not have this problem.
[1]: https://docs.python.org/2/library/functions.html#unichr
--
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
https://www.listbox.com/member/archive/rss/182179/21239177-3604570e
Modify Your Subscription: https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
*illumos-developer* | Archives
<https://www.listbox.com/member/archive/182179/=now>
<https://www.listbox.com/member/archive/rss/182179/21175074-7782178a> |
Modify
<https://www.listbox.com/member/?&>
Your Subscription <http://www.listbox.com>
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Keith Wesolowski via illumos-developer
2014-10-15 16:21:45 UTC
Permalink
Post by Gordon Ross via illumos-developer
Why the preference for C over python for a build time tool?
Performance is not an issue, right?
At least for me, the preference is for reducing, not increasing, the
number of build-time dependencies. The less you force people to have
lying around, the easier it is to build a complete and working proto
area. Many if not most people do not have python3 on their system, and
many have no python at all; those who do may not have or want to go
build and install the particular version we would require. It's really
better all around if the entire gate can be built starting with the
smallest essential set of tools practical. No way is python included in
that, and we shouldn't add new dependencies on it. If it were
unavoidable, I would insist on adding a copy of the specific version
required to usr/src/tools and adding the necessary glue to build the
tool before using it.


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-developer
2014-10-15 16:23:58 UTC
Permalink
Agreed.

We *do* have python deps at build time (IPS damn you!), but nowhere do we
force version 3.

I would like to eliminate the python build dependencies that already exist.
Same for perl, btw.

On Wed, Oct 15, 2014 at 9:21 AM, Keith Wesolowski <
On Wed, Oct 15, 2014 at 12:11:05PM -0400, Gordon Ross via
Post by Gordon Ross via illumos-developer
Why the preference for C over python for a build time tool?
Performance is not an issue, right?
At least for me, the preference is for reducing, not increasing, the
number of build-time dependencies. The less you force people to have
lying around, the easier it is to build a complete and working proto
area. Many if not most people do not have python3 on their system, and
many have no python at all; those who do may not have or want to go
build and install the particular version we would require. It's really
better all around if the entire gate can be built starting with the
smallest essential set of tools practical. No way is python included in
that, and we shouldn't add new dependencies on it. If it were
unavoidable, I would insist on adding a copy of the specific version
required to usr/src/tools and adding the necessary glue to build the
tool before using it.
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Gordon Ross via illumos-developer
2014-10-15 16:35:25 UTC
Permalink
Well, I respect your preference, though I'd point out that having
build-time dependencies in an interpreted language reduces the number
of build-time tools you need to compile and run "native".
Post by Garrett D'Amore via illumos-developer
Agreed.
We *do* have python deps at build time (IPS damn you!), but nowhere do we
force version 3.
I would like to eliminate the python build dependencies that already exist.
Same for perl, btw.
On Wed, Oct 15, 2014 at 9:21 AM, Keith Wesolowski
On Wed, Oct 15, 2014 at 12:11:05PM -0400, Gordon Ross via
Post by Gordon Ross via illumos-developer
Why the preference for C over python for a build time tool?
Performance is not an issue, right?
At least for me, the preference is for reducing, not increasing, the
number of build-time dependencies. The less you force people to have
lying around, the easier it is to build a complete and working proto
area. Many if not most people do not have python3 on their system, and
many have no python at all; those who do may not have or want to go
build and install the particular version we would require. It's really
better all around if the entire gate can be built starting with the
smallest essential set of tools practical. No way is python included in
that, and we shouldn't add new dependencies on it. If it were
unavoidable, I would insist on adding a copy of the specific version
required to usr/src/tools and adding the necessary glue to build the
tool before using it.
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-developer
2014-10-15 16:37:48 UTC
Permalink
That's true, but the tools that we run "native" are generally small, and
anyway the dependency graph is self contained.

There is no "universal" interpreted language, except perhaps basic POSIX
shell. And indeed, some things can and should be expressed that way. But
for more complex processing, something else is called for.
Post by Gordon Ross via illumos-developer
Well, I respect your preference, though I'd point out that having
build-time dependencies in an interpreted language reduces the number
of build-time tools you need to compile and run "native".
Post by Garrett D'Amore via illumos-developer
Agreed.
We *do* have python deps at build time (IPS damn you!), but nowhere do we
force version 3.
I would like to eliminate the python build dependencies that already
exist.
Post by Garrett D'Amore via illumos-developer
Same for perl, btw.
On Wed, Oct 15, 2014 at 9:21 AM, Keith Wesolowski
On Wed, Oct 15, 2014 at 12:11:05PM -0400, Gordon Ross via
Post by Gordon Ross via illumos-developer
Why the preference for C over python for a build time tool?
Performance is not an issue, right?
At least for me, the preference is for reducing, not increasing, the
number of build-time dependencies. The less you force people to have
lying around, the easier it is to build a complete and working proto
area. Many if not most people do not have python3 on their system, and
many have no python at all; those who do may not have or want to go
build and install the particular version we would require. It's really
better all around if the entire gate can be built starting with the
smallest essential set of tools practical. No way is python included in
that, and we shouldn't add new dependencies on it. If it were
unavoidable, I would insist on adding a copy of the specific version
required to usr/src/tools and adding the necessary glue to build the
tool before using it.
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Garrett D'Amore via illumos-developer
2014-10-15 18:47:10 UTC
Permalink
Actually going through Lauri's work, there are unfortunately a number of
errors. For example, its actually an error to class anything except ASCII
digits 0-9 in the digit class. Even though Unicode puts marks them as
numeric digits.

I'm working through this with a fine tooth comb. Stay tuned.
Post by Garrett D'Amore via illumos-developer
That's true, but the tools that we run "native" are generally small, and
anyway the dependency graph is self contained.
There is no "universal" interpreted language, except perhaps basic POSIX
shell. And indeed, some things can and should be expressed that way. But
for more complex processing, something else is called for.
Post by Gordon Ross via illumos-developer
Well, I respect your preference, though I'd point out that having
build-time dependencies in an interpreted language reduces the number
of build-time tools you need to compile and run "native".
Post by Garrett D'Amore via illumos-developer
Agreed.
We *do* have python deps at build time (IPS damn you!), but nowhere do
we
Post by Garrett D'Amore via illumos-developer
force version 3.
I would like to eliminate the python build dependencies that already
exist.
Post by Garrett D'Amore via illumos-developer
Same for perl, btw.
On Wed, Oct 15, 2014 at 9:21 AM, Keith Wesolowski
On Wed, Oct 15, 2014 at 12:11:05PM -0400, Gordon Ross via
Post by Gordon Ross via illumos-developer
Why the preference for C over python for a build time tool?
Performance is not an issue, right?
At least for me, the preference is for reducing, not increasing, the
number of build-time dependencies. The less you force people to have
lying around, the easier it is to build a complete and working proto
area. Many if not most people do not have python3 on their system, and
many have no python at all; those who do may not have or want to go
build and install the particular version we would require. It's really
better all around if the entire gate can be built starting with the
smallest essential set of tools practical. No way is python included
in
Post by Garrett D'Amore via illumos-developer
that, and we shouldn't add new dependencies on it. If it were
unavoidable, I would insist on adding a copy of the specific version
required to usr/src/tools and adding the necessary glue to build the
tool before using it.
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Lauri Tirkkonen via illumos-developer
2014-10-15 19:03:58 UTC
Permalink
Post by Garrett D'Amore via illumos-developer
Actually going through Lauri's work, there are unfortunately a number of
errors. For example, its actually an error to class anything except ASCII
digits 0-9 in the digit class. Even though Unicode puts marks them as
numeric digits.
Hmm, good catch. I thought that applied only to the POSIX locale but
apparently the next two sentences make that entire character class
useless to exist in a locale definition file (since only <zero> through
<nine> can be specified and those are automatically included [0]).

I'm interested in hearing about any other mistakes I've made, too.
There's probably some more ambiguous cases too.

[0]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01
--
Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-86d49504
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-abdf7b7e
Powered by Listbox: http://www.listbox.com
Loading...