Author Topic: Using ncursesw with unicode wide characters. (Read 30117 times)

Bear · « **on:** August 10, 2014, 06:42:51 PM »

Today's blog post at http://dillingers.com/blog/ is actually a cleaned-up and extended version of an old article I wrote on RGRD, which has also been picked up on roguebasin. In hopes that it is helpful, I'm going to paste/post it here as well.

Quote

The version of the popular ncurses library that handles wide characters, or Unicode, is surprisingly hard to get working correctly with C programs. This article is intended to be a checklist for developers so that they can effectively use the library. This is material I learned in programming a roguelike game, but it’s useful to everybody who wants to use ncurses with a full unicode repertoire.

As with most development articles, this will be a bit too specific in terms of platform. This article was written with respect to a Linux development platform running Debian Linux. To the extent that your platform is different, there are likely to be important things I don’t know about getting development on your platform working with this library.

First, you have to be using a UTF-8 locale (Mine is en_US.UTF-8; I imagine others will have different choices). Type ‘locale’ at a shell prompt to be sure.

Second, you have to have a term program that can display non-ASCII characters. Most of them can handle that these days, but there are still a few holdouts. rxvt-unicode and konsole, popular term programs on Linux, are both good.

Third, you have to use a console font which contains glyphs for the non-ASCII characters that you use. Again, most default console fonts can handle that these days, but it’s still another gotcha, and if you routinely pick some random blambot font to use on the console you’re likely to miss out.

Try typing a non-ASCII character at the console prompt just to make sure you see it. If you don’t know how to type non-ASCII characters from the keyboard, that’s beyond the scope of what’s covered here and you’ll need to go and read some documentation and possibly set some keyboard preferences. Anyway, if you see it, then you’ve got the first, second, and third things covered.

Fourth, you have to have ncurses configured to deal with wide characters. For most linux distributions, that means: Your ncurses distribution is based on version 5.4 or later (mine is 5.9) but NOT on version 11. I have no idea where version 11 came from, but it’s definitely a fork based on a pre-5.4 ncurses version, and hasn’t got the Unicode extensions. Also, you must have the ‘ncursesw’ versions, which are configured and compiled for wide characters.

How this works depends on your distribution, but for Debian, you have to get both the ‘ncursesw’ package to run ncurses programs that use wide characters and the ‘ncursesw-dev’ package to compile them. The current versions are ncursesw5 and ncursesw5-dev.

But there’s an apparent packaging mistake where the wide-character dev package, ncursesw-dev, does not contain any documentation for the wide-character functions. If you want the man pages for the wide-character curses functions, you must also install ncurses-dev, which comes with a “wrong” version of ncurses that doesn’t have the wide-character functions. Don’t think too much about why anyone would do this; you’ll only break your head. The short version of the story is that you pretty much have to install ncurses, ncurses-dev, ncursesw, and ncursesw-dev, all at the same time, and then just be very very careful about not ever using the library versions that don’t actually have the wide character functions in them.

Fifth, your program has to call “setlocale” immediately after it starts up, before it starts curses or does any I/O. If it doesn’t call setlocale, your program will remain in the ‘C’ locale, which assumes that the terminal cannot display any characters outside the ASCII set. If you do any input or output, or start curses before calling setlocale, you will force your runtime to commit to some settings before it knows the locale, and then setlocale when you do call it won’t have all of the desired effects. Your program is likely to print ASCII transliterations for characters outside the ASCII range if this happens.

Sixth, you have to #define _XOPEN_SOURCE_EXTENDED in your source before any library #include statements. The wide character curses functions are part of a standard called the XOPEN standard, and preprocessing conditionals check this symbol to see whether your program expects to use that standard. If this symbol is found, and you’ve included the right headers (see item Seven) then macroexpansion will configure the headers you include to actually contain definitions for the documented wide-character functions. But it’s not just the ‘curses’ headers that depend on it; you will get bugs and linking problems with other libraries if you have this symbol defined for some includes but not others, so put it before all include statements.

Unfortunately, the XOPEN_SOURCE_EXTENDED macro is not mentioned in the man pages of many of the functions that won’t link if you don’t do it. You’d have to hunt through a bunch of not-very-obviously related ‘see also’ pages before you find one that mentions it, and then it might not be clear that it relates to the function you were interested in. Trust me, it does. Without this macro, you can use the right headers and still find that there are no wide-curses definitions in them to link to.

Seventh, you have to include the right header file rather than the one the documentation tells you to include. This isn’t a joke. The man page tells you that you have to include “curses.h” to get any of the wide-character functions working, but the header that actually contains the wide-character function definitions is “ncursesw/curses.h“. I hope this gets fixed soon but it’s been this way for several years so some idiot may think this isn’t a bug.

Eighth, you have to use the -lncursesw compiler option (as opposed to the -lncurses option) when you’re linking your executable. Earlier versions of gcc contained a bug that -WError and -WAll would cause linking to fail on the ncursesw library, but this appears to have been fixed.

Ninth, use the wide-character versions of everything, not just a few things. This is harder than it ought to be, because the library doesn’t issue link warnings warn you about mixing functionality, and the documentation doesn’t specifically say which of the things it recommends won’t work correctly with wide characters. That means cchar_t rather than chtype, wide video attributes rather than standard video attributes, and setcchar rather than OR to combine attributes with character information.

Use cchar_t rather than chtype. cchar_t is a record type that contains colorpair information, video attributes, and a short unicode string. The only thing about this unicode string that affects your display is the first spacing character, which must also be the first character. So the rest of the string is pretty useless until someone implements a term program that handles unicode combining characters, but you still have to build a null-terminated unicode string to make a cchar_t.

Use the new WA_* video attributes rather than the older A_* video attributes. That is, WA_STANDOUT rather than A_STANDOUT, WA_UNDERLINE rather than A_UNDERLINE, and so on. The WA_* attributes are of the newly defined attr_t type and have their bits aligned correctly for using in cchar_t rather than chtype. On my platform, attr_t is an unsigned long int. If you have code that casts video attributes to or from int or short int, it will fail with wide video attributes.

Use get_wch rather than getch to get input from the keyboard. If the keyboard driver delivers unicode characters, you want the whole character rather than just the last 8 bits of it, right?

Use setcchar to combine character, wide video attributes and colorpair number together into a cchar_t. Your existing curses code probably uses logical OR. The documentation says you can use OR, but the documentation is talking about single ASCII characters, chtype and narrow attributes rather than unicode strings, cchar_t and wide attributes, and it will definitely do the wrong thing if you try to use it here. You can still use logical OR to combine wide video attributes, but don’t attempt to combine them with the character values or with narrow attributes. Note that the color pair number can be converted into a video attribute using the COLOR_PAIR(n) macro provided by ncurses, and can then be correctly combined with wide or narrow video attributes.

Now, if you jumped through all the hoops, you can compile and use an ncursesw application with support for Unicode characters.

mushroom patch · « **Reply #1 on:** August 11, 2014, 12:58:12 AM »

Hey, thanks for the article. I've been dabbling (very lightly =\) lately with this circle of ideas in connection with roguelikes. I've been particularly interested in what "itkatchev" is doing with incavead and his telnet client for it.

I've been fiddling around with unicode and wide unicode characters in the standard curses python module. It seems to work on my system without any particular effort (just encoding the characters and writing them to the terminal using the usual curses functions seems to work on gnome terminal). Do you think I should worry that this kind of thing will not work consistently on other systems/terminal emulators that have wide unicode support?

Bear · « **Reply #2 on:** August 11, 2014, 04:50:35 AM »

Nope. As far as I can tell all the hoops to jump through are on the development side. If you have a program that works with one unicode-enabled terminal, it should work the same way anywhere.

mushroom patch · « **Reply #3 on:** August 12, 2014, 03:01:08 AM »

Alright, that's reassuring. Do you know of any roguelike projects other than tkatchev's that use wide ASCII and other wide unicode characters for dungeon representation?

Kevin Granade · « **Reply #4 on:** August 12, 2014, 03:47:15 AM »

DDA is heavily integrated with Unicode, we use it for very nearly everything to support translation.

Wide characters on the dungeon map though, that's rough. Haven't even considered doing that.

Btw, the main gotchas other than library integration in my experience are making sure all your character handling is wide enough (e.g. use a unsigned int for a single character instead of a char) and make sure you use unicode-aware methods for string manipulation, such as checking length etc.

Article needs a bit of an update, I don't recall any issues with point 5, 6 or 8.
We call setlocale multiple times at runtime to dynamically change selected language and it works fine.
We don't define XOPEN_SOURCE_EXTENDED and everything works great.
We use -Werror and -Wall and ncursesw with no issues. On the other hand, you actually have to use -lncursesw OR -lncurses, depending on the platform. Some platforms have decided to be helpful and make ncurses link against ncursesw without also aliasing ncursesw (Mac). Better systems (Linux) supply ncursesw5-config, which is invoked like pkg-config.

Most of the rest I agree with, though I wouldn't recommend cchar_t, seems really cumbersome.
The header file and library thing continues to be a pain, and getting things compiling on arbitrary systems (Windows and Mac) can be frustrating, no worse than gettext or SDL though.

Bear · « **Reply #5 on:** August 12, 2014, 03:02:09 PM »

On checking: You're right about -Wall -Werror no longer being a problem with lncursesw. Yay. A couple or three years ago, that was a very mysterious thing which had me stymied and which I only discovered by accident - trying to make the simplest possible case where the bug that was biting me arose, I discovered things that worked when compiled via the command line but not via the makefile with my build options.

So I guess either the curses headers or gcc has been fixed.

And wide characters on the dungeon map is actually kind of nice. For example U+2592 (medium shade character) for dungeon walls that are lit and in view, U+2591 (light shade character) for walls that are known but not currently visible. Or U+2020 (dagger) for, well, daggers. Or U+2192 (rightwards arrow) for arrows. Or U+21F6 (three rightwards arrows) for a pile of arrows. And so on.

Bear · « **Reply #6 on:** August 12, 2014, 03:19:11 PM »

Quote from: Kevin Granade on August 12, 2014, 03:47:15 AM

We call setlocale multiple times at runtime to dynamically change selected language and it works fine.

That's expected, and it's no problem provided you get it into a UTF8 locale before you do any I/O or initialize curses. You can shift to other UTF8 locales arbitrarily after that.

Quote from: Kevin Granade on August 12, 2014, 03:47:15 AM

We don't define XOPEN_SOURCE_EXTENDED and everything works great.

That is a feature of your build environment (defined by default) that is not also a feature of mine. Nor is it probably a feature of yours if you compile with -std=c99 or similar. It's still something people can trip on.

Quote from: Kevin Granade on August 12, 2014, 03:47:15 AM

We use -Werror and -Wall and ncursesw with no issues. On the other hand, you actually have to use -lncursesw OR -lncurses, depending on the platform. Some platforms have decided to be helpful and make ncurses link against ncursesw without also aliasing ncursesw (Mac). Better systems (Linux) supply ncursesw5-config, which is invoked like pkg-config.

Such help is appreciated, except insofar as it makes solutions that work in one place break in others. The emphasis is on using a library that you know has the wide-character definitions in it, and using the standard, macros, and compile options, whatever they are, that let those definitions be visible. If your build environment helps, that's nice. But then you're going to have different rules in effect on other build platforms, and you're going to have to figure them out.

Quote from: Kevin Granade on August 12, 2014, 03:47:15 AM

Most of the rest I agree with, though I wouldn't recommend cchar_t, seems really cumbersome.
The header file and library thing continues to be a pain, and getting things compiling on arbitrary systems (Windows and Mac) can be frustrating, no worse than gettext or SDL though.

cchar_t is the only data structure defined by ncurses that has room for a wide character and can be displayed at a particular location on the screen. It's the right answer for using on the map (in the place that a different build will replace with a graphical tile). Other than that you can position the cursor and write a wide-character string, but that's the right answer for text which would be text even in a graphical version, because with text you don't need such precise character-by-character control over its positioning.

News: Read the RULES before posting.

Author Topic: Using ncursesw with unicode wide characters. (Read 30117 times)

Bear

Using ncursesw with unicode wide characters.

mushroom patch

Re: Using ncursesw with unicode wide characters.

Bear

Re: Using ncursesw with unicode wide characters.

mushroom patch

Re: Using ncursesw with unicode wide characters.

Kevin Granade

Re: Using ncursesw with unicode wide characters.

Bear

Re: Using ncursesw with unicode wide characters.

Bear

Re: Using ncursesw with unicode wide characters.