[Bug c++/49973] New: Column numbers count special characters as multiple columns

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] New: Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

           Summary: Column numbers count special characters as multiple
                    columns
           Product: gcc
           Version: 4.5.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
        AssignedTo: [hidden email]
        ReportedBy: [hidden email]


int main()
{
    /* 中 */ asdf;
}

g++ -finput-charset=utf8 hello.cpp

hello.cpp: In function ‘int main()’:
hello.cpp:3:12: error: ‘asdf’ was not declared in this scope

The column number should be 10, not 12.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-08-04 10:27:53 UTC ---
Depends on how the column numbers are defined.  I think gcc uses bytes from the
beginning of the line, then 12 is correct (and e.g. for tab characters gcc
counts them as one instead of 1-8 depending on position too).
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #2 from Timothy Liang <timothy003 at msn dot com> 2011-08-04 10:36:53 UTC ---
(In reply to comment #1)
> Depends on how the column numbers are defined.  I think gcc uses bytes from the
> beginning of the line, then 12 is correct (and e.g. for tab characters gcc
> counts them as one instead of 1-8 depending on position too).

That isn't the case here.  Substituting the '中' for another character makes the
column number 10.  Setting -finput-charset=latin1 makes the column number 15.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #3 from Andreas Schwab <[hidden email]> 2011-08-04 10:55:13 UTC ---
Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #4 from Timothy Liang <timothy003 at msn dot com> 2011-08-04 11:03:32 UTC ---
(In reply to comment #3)
> Why 10? "    /* 中 */ " has 12 characters (and 14 bytes as utf8).

The four spaces is supposed to be a tab.  Also, the column number starts from
one.  So:

 /* 中 */ asdf
         |
1234567890

Since I set the input charset as UTF-8, g++ should count the '中' as one
character, not three.  And when I set it to latin1, g++ should count the '中' as
three, not six.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Manuel López-Ibáñez <manu at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |manu at gcc dot gnu.org

--- Comment #5 from Manuel López-Ibáñez <manu at gcc dot gnu.org> 2011-08-04 12:18:19 UTC ---
GNU Emacs 23.2.1 counts it as two, and puts the cursor at s.

For the simpler case of:

/* ñ */ asdf;

we print

test.c:3:10: error: ‘asdf’ was not declared in this scope

whereas emacs counts only 1 char, so it again puts the cursor at s. I am not
sure whether Emacs is following some GNU standard, but the case of ñ versus n,
should at least produce the same result.

Unfortunately, I don't have time to work on this.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #6 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2011-08-04 14:38:16 UTC ---
The GCS says "column numbers should start from 1 at the beginning of the
line ... Calculate column numbers assuming that space and all ASCII
printing characters have equal width, and assuming tab stops every 8
columns.".  This doesn't say how other characters should be counted,
although for the results of wcswidth seem appropriate.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Manuel López-Ibáñez <manu at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.08.04 15:18:20
     Ever Confirmed|0                           |1

--- Comment #7 from Manuel López-Ibáñez <manu at gcc dot gnu.org> 2011-08-04 15:18:20 UTC ---
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the
> line ... Calculate column numbers assuming that space and all ASCII
> printing characters have equal width, and assuming tab stops every 8
> columns.".  This doesn't say how other characters should be counted,
> although for the results of wcswidth seem appropriate.

Then GCC is not using wcswidth to count or it is setting the locale
inappropriately because it is counting 2 for ñ and 3 for 中, while it should be
1 and 2.
Reply | Threaded
Open this post in threaded view
|

[Bug c++/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #8 from Timothy Liang <timothy003 at msn dot com> 2011-08-04 19:52:31 UTC ---
(In reply to comment #7)

> (In reply to comment #6)
> > The GCS says "column numbers should start from 1 at the beginning of the
> > line ... Calculate column numbers assuming that space and all ASCII
> > printing characters have equal width, and assuming tab stops every 8
> > columns.".  This doesn't say how other characters should be counted,
> > although for the results of wcswidth seem appropriate.
>
> Then GCC is not using wcswidth to count or it is setting the locale
> inappropriately because it is counting 2 for ñ and 3 for 中, while it should be
> 1 and 2.

I'm confused.  Shouldn't 中 be 1?
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Tom Tromey <tromey at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tromey at gcc dot gnu.org
          Component|c++                         |preprocessor

--- Comment #9 from Tom Tromey <tromey at gcc dot gnu.org> 2011-12-07 17:59:52 UTC ---
(In reply to comment #6)
> The GCS says "column numbers should start from 1 at the beginning of the
> line ... Calculate column numbers assuming that space and all ASCII
> printing characters have equal width, and assuming tab stops every 8
> columns.".  This doesn't say how other characters should be counted,
> although for the results of wcswidth seem appropriate.

Note that GCC also handles the tab case incorrectly here.
This shows up if you M-x next-error in Emacs in the case where
gcc emits column numbers and your source code includes tabs.\

Refiling this to preprocessor.
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #10 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2011-12-07 20:56:01 UTC ---
On Wed, 7 Dec 2011, tromey at gcc dot gnu.org wrote:

> Note that GCC also handles the tab case incorrectly here.

Yes, GCC should be fixed to follow the GCS there as well.

The GCS now explicitly say "For non-ASCII characters, Unicode character
widths should be used when in a UTF-8 locale; GNU libc and GNU gnulib
provide suitable @code{wcwidth} functions."
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #11 from Manuel López-Ibáñez <manu at gcc dot gnu.org> ---
This should be fixed in libcpp, probably in lex.c, but maybe other places also.
A good testcase to start with would be:

/* ñ /* */
/* a /* */

cc1 -Wcomment

prog.cc:1:7: warning: "/*" within comment [-Wcomment]
 /* ñ /* */
       ^
prog.cc:2:6: warning: "/*" within comment [-Wcomment]
 /* a /* */
      ^

Both locations should point to column 6. Look for places where column info is
converted to source_location (linemap_position_for_column or
linemap_position_for_line_and_colum). Figure out where the column info got the
wrong value. Use something like wcwidth() to measure the width of non-ASCII
characters and get the right width of 'ñ'.

Unfortunately, GCC 6 seems to be broken for the above testcase
(https://gcc.gnu.org/PR69664). Revision 229665 seems fine.
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count special characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Eric Gallager <egallager at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |egallager at gcc dot gnu.org
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=69664

--- Comment #12 from Eric Gallager <egallager at gcc dot gnu.org> ---
(In reply to Manuel López-Ibáñez from comment #11)

> This should be fixed in libcpp, probably in lex.c, but maybe other places
> also. A good testcase to start with would be:
>
> /* ñ /* */
> /* a /* */
>
> cc1 -Wcomment
>
> prog.cc:1:7: warning: "/*" within comment [-Wcomment]
>  /* ñ /* */
>        ^
> prog.cc:2:6: warning: "/*" within comment [-Wcomment]
>  /* a /* */
>       ^
>
> Both locations should point to column 6. Look for places where column info
> is converted to source_location (linemap_position_for_column or
> linemap_position_for_line_and_colum). Figure out where the column info got
> the wrong value. Use something like wcwidth() to measure the width of
> non-ASCII characters and get the right width of 'ñ'.
>
> Unfortunately, GCC 6 seems to be broken for the above testcase
> (https://gcc.gnu.org/PR69664). Revision 229665 seems fine.

Bug 69664 is fixed now.
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

Lewis Hyatt <lhyatt at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lhyatt at gmail dot com

--- Comment #13 from Lewis Hyatt <lhyatt at gmail dot com> ---
Created attachment 46882
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46882&action=edit
partial patch illustrating the intended approach

Hello-

I would like to help fix this up. I have another patch, hopefully being
approved soon, that will implement support for extended characters in
identifiers. If that sees some use, then this problem will become more
noticeable, so it would be nice to address the column numbering issue at the
same time.

It's worth noting that there is a related problem not specifically mentioned so
far in this bug report -- in addition to the column number, the caret output in
diagnostic-show-locus.c also goes to the same wrong column.

Attached for comment is a preliminary patch (no tests yet) that fixes the
column number only (not the caret). If the overall approach seems sound, I am
happy to flesh it out to handle all the fancy diagnostics too. I have a couple
questions about the implementation, though, which I thought worthwhile to clear
up before proceeding further.

Here is a useful test case for this:

=======
const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
=======

Currently it outputs:

t.cpp:1:65: error: ‘y’ was not declared in this scope
 const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
                                                                 ^

And indeed the column number and caret are both incorrect (but consistent with
each other). Not visible here is that the 'y' is properly colorized in the
source line. That works as-is because the ANSI color escapes should be inserted
at the proper byte, not the proper display column. So the machinery in
diagnostic-show-locus.c requires both the byte count and the logical column
count in order to both colorize the source line and put the caret at the right
place.

Therefore the approach I chose was to keep most things exactly as they are --
all locations are tracked via byte counts just as now. Then all that's
necessary is to compute the logical display column just when it's needed, at
the time the diagnostic is output. This seems tenable because
location_get_source_line() already lets us retrieve the line and work with it
efficiently. That's the idea I went with here to adjust the column number;
happy to hear any feedback before diving in to applying the same idea
throughout diagnostic-show-locus.c.

I have one other question though. This quick attempt uses wchar.h, namely the
mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to introduce a
dependency on those, which may be problematic for Windows, etc. (I borrowed the
same configure checks used in intl.c for that.) But also, this will always
convert multibyte input files using the user's locale, whereas GCC doesn't work
like this when lexing files -- it ignores the locale and rather defaults to
UTF-8 unless overridden by -finput-charset. It seems that in a perfect world
diagnostics parsing of the source should work the same way.

I feel like the most portable solution is just to use directly the necessary
code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth()
functionality, and tweak it for this purpose. It's in essence just a binary
search in a table. Basically I would convert the source line from the input
charset to UTF-8 the same way the file is read on original input (using the
facilities in libcpp/charset.c), and then I would just need a variant of
wcwidth() that, rather than dealing with wchar_t and locales, simply takes
UTF-8 input and returns the necessary width. Does this seem like a good
approach? Otherwise, if I keep the use of wchar APIs, I think it would be
necessary to make a temporary change to the locale when computing the display
column, as dictated by the input charset.

Note that this also brings up an unrelated question with -finput-charset. Say
right now, I have a file in latin1 encoding and it is processed with
-finput-charset=latin1, and then I compile it from a UTF-8 locale. If a source
line is output in diagnostics, currently it gets output in the latin1 encoding
and produces unreadable characters on the terminal. With changes along the
lines I propose in the previous paragraph, the diagnostics would end up being
output in UTF-8 (or whatever the current locale demands), which I think is
maybe an improvement as well.

Thanks for any suggestions.

-Lewis
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #14 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Sun, 15 Sep 2019, lhyatt at gmail dot com wrote:

> I feel like the most portable solution is just to use directly the necessary
> code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth()
> functionality, and tweak it for this purpose. It's in essence just a binary

Both of those use data generated in some way from Unicode data (stored in
the locale in the glibc case).

A standalone generator implementing UAX#11 rules for character width
should be straightforward (we'd need to check in the generator sources as
well as the generated table).

> search in a table. Basically I would convert the source line from the input
> charset to UTF-8 the same way the file is read on original input (using the
> facilities in libcpp/charset.c), and then I would just need a variant of

Yes, sources need to be processed consistently (converted from input
charset to UTF-8).  And then of course converted from UTF-8 to the locale
character set for the final output on the terminal (with some form of
graceful degradation if the source character set has characters that can't
be represented in the locale character set - extended identifiers
diagnostics use UCNs in that case, but I don't know if that's best in
general).

If a source line in the default -finput-charset=utf-8 case contains
non-UTF-8 bytes in strings or comments, we can't safely display them on
the terminal, so my inclination in such a case would be to treat such
bytes as width-1 and output them as '?'.
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #15 from Manuel López-Ibáñez <manu at gcc dot gnu.org> ---
(In reply to Lewis Hyatt from comment #13)
> I have one other question though. This quick attempt uses wchar.h, namely
> the mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to
> introduce a dependency on those, which may be problematic for Windows, etc.

Apart from Joseph suggestion, in a more general sense, there is no issue with
introducing a dependency on gnulib. It is a long-term goal to replace parts of
libiberty with gnulib: https://gcc.gnu.org/wiki/replacelibibertywithgnulib
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #16 from Lewis Hyatt <lhyatt at gmail dot com> ---
Thank you both for the feedback so far. Regarding the use of wcwidth(), one
thing I noticed is that glibc has a much different result than gnulib does, say
for instance emojis return width 2 in the former rather than 1. (Which seems
better based on what I can tell.) It seems that glibc has undergone a fair
amount of tweaking to match what applications expect and so what it provides is
not coming directly from parsing the Unicode specs, although that's probably
the bulk of it. But I wonder, perhaps this is a sign that it might be better to
just make use of glibc and not try to add in a third implementation to the mix?

In any case, the underlying source of wcwidth() could easily be changed as a
drop-in replacement so I guess it can also be decided later. The use of
mbrtowc() is the bigger problem, since this converts from the user's locale and
it needs to convert from what -finput-charset asked for (or else UTF-8)
instead.

I have a more or less fully-baked patch at this point, that fixes up all
diagnostics that I am aware of (changes mostly in diagnostic.c and
diagnostic-show-locus.c) to be multi-byte aware. That includes column numbers,
carets, annotations, notes, fixit hints, etc. The patch still ignores the
input-charset issue and uses mbrtowc(), so that is the last thing for me to add
before I think it is worth sharing. I was wondering if I could get some advice
as to where to start here please?

It seems that basically location_get_source_line() in input.c needs to return
the lines converted to UTF-8, since all parsing has been working with the lines
in this form, and all the byte offsets they populated rich_locations with, etc,
are relative to the converted data too. I am not sure what's the correct way
though for location_get_source_line() to know the value of the -finput-charset
option. Typically this is inspected from a cpp_reader object, but none is
available in the context where this runs, that I understand anyway. It seems
that in order to make use of the existing conversion machinery in
libcpp/charset.c, I need to have a cpp_reader instance available too.
Appreciate any suggestions here. Thanks!

-Lewis
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #17 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
On Tue, 17 Sep 2019, lhyatt at gmail dot com wrote:

> In any case, the underlying source of wcwidth() could easily be changed as a
> drop-in replacement so I guess it can also be decided later. The use of
> mbrtowc() is the bigger problem, since this converts from the user's locale and
> it needs to convert from what -finput-charset asked for (or else UTF-8)
> instead.

If __STDC_ISO_10646__ is defined, wchar_t is Unicode and so local code
converting from UTF-8 to wchar_t can be used (together with wcwidth from
libc if available).

If __STDC_ISO_10646__ is not defined, the encoding of wchar_t is unknown.  
Maybe in that case it's best to avoid libc's wcwidth (if any) and just use
a local implementation of wcwidth on the results of converting UTF-8 to
Unicode code points.
Reply | Threaded
Open this post in threaded view
|

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

rguenth at gcc dot gnu.org
In reply to this post by rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973

--- Comment #18 from Lewis Hyatt <lhyatt at gmail dot com> ---
(In reply to [hidden email] from comment #17)

> On Tue, 17 Sep 2019, lhyatt at gmail dot com wrote:
>
> > In any case, the underlying source of wcwidth() could easily be changed as a
> > drop-in replacement so I guess it can also be decided later. The use of
> > mbrtowc() is the bigger problem, since this converts from the user's locale and
> > it needs to convert from what -finput-charset asked for (or else UTF-8)
> > instead.
>
> If __STDC_ISO_10646__ is defined, wchar_t is Unicode and so local code
> converting from UTF-8 to wchar_t can be used (together with wcwidth from
> libc if available).
>
> If __STDC_ISO_10646__ is not defined, the encoding of wchar_t is unknown.  
> Maybe in that case it's best to avoid libc's wcwidth (if any) and just use
> a local implementation of wcwidth on the results of converting UTF-8 to
> Unicode code points.

It seems to erase a lot of complexity just to always use an internal wcwidth(),
so that's what I ended up doing. Patch was posted to
https://gcc.gnu.org/ml/gcc-patches/2019-09/msg01558.html for your
consideration. This one just addresses diagnostics, not the input-charset or
user locale conversion stuff. I will submit those separately after this one is
reviewed. Thanks!