How to count Unicode glyphs in Vala using Gtk
Want to try something really scary this Halloween? Try counting the characters in a string in Vala.
For example, how many characters do you think there are in the following string?
var zombie = "🧟♀️️";
One?
You’re right!
So how would you go about getting this result in Vala? Use zombie.length
, did I hear you say?
Oh, you poor, dear, naïve, sweet thing…
"🧟♀️️".length // 13
What’s that? What about char_count ()
?
"🧟♀️️".char_count (); // 4
What do you think this is? Some toy language and standard library like Swift where you can just do zombie.characters.count
and get the right answer (1) straight away?
This is Vala and Gtk, motherfucker, get ready for some pain!
So which is correct? 13, 4, or 1?
Yes.
(They’re all “correct” based on what question you happen to be asking and what you need to do with the answer.)
If we break it down, a female zombie (🧟♀️️) is, in UTF-8:
Count | Type | Details |
---|---|---|
1 | glyph | as a typographer would call it |
1 | grapheme cluster | as Unicode calls it |
4 | code points | 🧟️️ (F09FA79F ) + Zero-width Joiner (E2808D ) + ♀ (E29980 ) + Variation Selector 16 (EFB88F ) |
13 | bytes | F0 9F A7 9F E2 80 8D E2 99 80 EF B8 8F |
How to count (ha, ha, ha) in Vala
First off, let me tell you, I still don’t know what the equivalent of the Swift one-liner zombie.characters.count
is in Vala/GLib but, after several days of banging my head against a wall, I figured out a way to count glyphs properly using a Gtk.TextIter with a Gtk.TextBuffer1:
public int glyph_count (string text) {
var text_buffer = new Gtk.TextBuffer (null);
text_buffer.set_text (text);
Gtk.TextIter glyph_count_iter;
text_buffer.get_start_iter (out glyph_count_iter);
var glyphs = 0;
while (!glyph_count_iter.is_end ()) {
glyphs++;
glyph_count_iter.forward_cursor_position ();
}
return glyphs;
}
Using this function, you can now do:
glyph_count ("🧟♀️️"); // 1
And get the result approved by most human beings who think like human beings and not like computers.
Iterators and cursors are your friend
When working with Unicode in Vala/Gtk, iterators and cursors are your friend so remember to always work with cursor positions and not characters.
To add to the mindfuck, you don’t always get a cursor position when you ask for one. Sometimes you get a “character” offset instead.
For example, the cursor_position
property on a Gtk.TextBuffer
instance isn’t a cursor position, it’s a “character” offset. (Because they hate you and your little dog too.) So you cannot compare the cursor position from a text buffer directly with the count you get from the glyph_count ()
function, above. The former uses Vala “characters” while the latter deals in glyphs (what humans call characters).
Imagine the following is a string in a Gtk.TextView
and the cursor is at the end of the string:
123🧟♀️️5│
In that situation:
glyph_count (text_buffer.text) // 5
text_buffer.cursor_position // 9
So we can’t use the cursor_position
property directly when we’re dealing with glyphs.
What we can do is to create and use an iterator based on the “character” offset stored in the cursor_position
property:
var cursor_position = text_buffer.cursor_position;
Gtk.TextIter cursor_position_iter;
message_view_buffer.get_iter_at_offset (out cursor_position_iter, cursor_position);
Then, using its compare () method we can find out where in the string it is.
For example, to check if the cursor is within the first line of text as a person types in a Gtk.TextView:
var text_view = new Gtk.TextView ();
var text_buffer = new Gtk.TextBuffer (null);
text_buffer.set_text ("First line\nSecond line");
text_view.set_buffer (text_buffer);
text_view.end_user_action.connect (() => {
Gtk.TextIter end_of_first_line_iter;
text_view.get_iter_at_line (out end_of_first_line_iter, 0);
end_of_first_line_iter.forward_to_line_end ();
var cursor_position = text_buffer.cursor_position;
Gtk.TextIter cursor_position_iter;
message_view_buffer.get_iter_at_offset (out cursor_position_iter, cursor_position);
if (cursor_position_iter.compare(end_of_first_line_iter) <= 0) {
print ("Cursor is on the first line!\n");
}
});
Finally, if you want to go to a specific position within a string, get the start iterator and use the forward_cursor_positions () method to move to your desired offset.
Hopefully, these tips will save you some time should you need to work with Unicode glyphs in Vala/Gtk2.
Hope the dents in my head save you from ones in yours.
Oh, and happy Halloween!
-
If you know of an easier way, please do let me know. ↩︎
-
Perhaps while making apps for elementary OS (hint, hint!) ↩︎