New user registration is currently disabled due to spam abuse / Регистрация новых пользователей в настоящее время приостановлена из-за злоупотреблений спаммерами

An issue with words in Tibetan script

Report bugs here

An issue with words in Tibetan script

Postby CFynn » Sun May 19, 2013 5:26 pm

In Tibetan, Dzongkha, and other languages written in the Tibetan script, there is no word delimiter (like the space in English).

There is a syllable delimiter ་ (Unicode character U+0F0B) which will occur between the syllables of a multi-syllable word and between words in a phrase. There is also a phrase delimiter ། (Unicode character U+0F0D) - which is something like a comma or a full-stop, but not precisely the same.

The problem is this - at the end of key words, some Tibetan and Dzonkha dictionaries put no character at the end, some put character ་ U+0F0B at the end, while others put character ། U+0F0D at the end.
Where the character U+0F0D is used, it gets even more complicated as when words end in the consonant ཀ (U+0F40) or ག (U+0F42) then the character ། (U+0F0D) is dropped (not used); and where a word ends in the consonant ང (U+0F44) both U+0F0B and U+0F0D are used together. (ང་།).

This means in Tibetan and Dzongkha dictionaries a keyword key word e.g. སངས་རྒྱས (meaning "Buddha") may occur as either སངས་རྒྱས, སངས་རྒྱས་ or སངས་རྒྱས།
Currently these are treated as seperate entries in GoldenDict - ideally these three would all be treated as equivallent. This particulaly matters when looking up words contained in definitions because within a definition one cannot select a word with the cursor without selecting a final ་ (U+0F0B) or ། (U+0F0D). So, if the character used at the end of a word in a definition does not have the same Tibetan punctuation character as in a keyword (and it usually doesn't) then we cannot look up the word after selecting it with the cursor.

Would it be possible to make GoldenDict ignore (or trim) the characters U+0F0B and U+0F0D when doing lookups? (In other words treat these characters as ignorables when they occur at the end of a word.) The character U+0F0B cannot be ignored when it occurs anywhere in the middle of a word.

[I've also filed this as Issue #317 on the bug tracker]

Thanks

- Chris
CFynn
 
Posts: 6
Joined: Fri Apr 06, 2012 3:55 am

Return to Bugs

Who is online

Users browsing this forum: No registered users and 16 guests