Page 1 of 5

How to Create a DSL dictionary for Goldendict

PostPosted: Wed Mar 24, 2010 3:22 am
by fast_rizwaan
Hi would like to share how to create and deploy a dsl dictionary for Goldendict 0.9+
-----------
Requirements:
-----------------

1. Linux/Windows
2. Text editor
3. Goldendict installed

About DSL format:
----------------------

it's abboy lingo's dsl source format; and we can easily create any kind of list into a dictionary.


using dsl:

we can create a telephone directory for us:
joe 1234567
amy 2345678
yogi 3456789

but let's try country-capital first;

Quick start:
--------------

Let's say we start making a word country-capital city dictionary, where we can have country name followed by it's major city or many cities.
let's take data from here: http://geography.about.com/od/countryin ... pitals.htm
-------------
suppose we have only these six countries in our list:
-------------
Afghanistan - Kabul
Albania - Tirane
Algeria - Algiers
Andorra - Andorra la Vella
Angola - Luanda
Antigua and Barbuda - Saint John's
--------------

Now the dictionay dsl format will be like this (observe the header -the first 3 lines):
------------------------------------<file begins here>----------------------
Code: Select all
#NAME "Country Capital Dictionary [en-en]"
#INDEX_LANGUAGE "English"
#CONTENTS_LANGUAGE "English"

Afghanistan
      [m1][trn]Kabul[/trn][/m1]
Albania
      [m1][trn]Tirane[/trn][/m1]
Algeria
      [m1][trn]Algiers[/trn][/m1]
Andorra
      [m1][trn]Andorra la Vella[/trn][/m1]
Angola
      [m1][trn]Luanda[/trn][/m1]
Antigua and Barbuda
      [m1][trn]Saint John's[/trn][/m1]

------------------------------------<file ends here>----------------------

Now we can save the "dsl" source file as country-capital.dsl and add to goldendictionary dictionaries folder.

Linux/Mac/Unix users:
must convert the file from unix format to windows format, i.e., utf8 to utf16 with crlf terminators; here's how to do it in bash:

Convert utf-8 to utf-16 for dsl dictionary [Linux users]
--------------

sed 's/$'"/`echo \\\r`/" myfile-with-dsl-code.txt > myfile.crlf-added.dsl
iconv -f utf8 myfile.crlf-added.dsl -t utf16 -o myfile.utf16.dsl

Now we can put the myfile.utf16.dsl to golden-dictionary's search path (dictionaries folder).

Finally to save space; we can do compression to the file [Linux users]:
------------

dictzip myfile.utf16.dsl
------------
will get us 'dictzip myfile.utf16.dsl.dz' which is very small file

extract back to dsl.dz dictzipped file format back to .dsl
-----------

dictunzip dictzip myfile.utf16.dsl.dz


and to switch back to utf8 format
-----------

iconv -f myfile.utf16.dsl $file -t utf-8 -o myfile.utf8.step.dsl #utf16 is now utf8 still we have '\r' windows line-terminators
sed 's/\\\r$//g' myfile.utf8.stage1.dsl > myfile.utf8.dsl # now we can edit the file in linux using gedit/kedit etc.

Hope this helps. will add more info. later :-)
Happy creating new dictionaries!

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Tue Apr 06, 2010 4:36 pm
by panho10
1. DSL tag
I found that DSL format can be very useful to make a user dictionary, for it's in uncompiled plain text format. So I can make changes anytime I want and put in new articles incrementally.
Then when I want to make the article richer and more readable, dsl format's markup tags are necessary. I found some tags formats are supported in dsl from Lingvo's help file.

Code: Select all
[b],[/b] - boldfaced font
[i ],[/i] - italics
[u],[/u] - underlined font
[c],[/c] - coloured (highlighted) font
[mN],[/m]- the left paragraph margin. N is the number of spaces(0-9).
[s],[/s] - multimedia zone (used to add pictures or sound files into a dictionary entries ).
[url],[/url] - link to a Web page.
[p],[/p] - labels (clicking a label displays its full text)
[ref],[/ref]- hyperlink to a card in the same dictionary (or <<, >>)
[sub][/sub] - subscript
[sup][/sup] -  superscript
['],[/'] - a stressed vowel in a word.
[ex], [/ex] - examples zone.


I tested and confirmed that above tags are internally recognized in GD.
Some Lingvo tags seem to do nothing in GD.
These tags seem to have no references in article-style.css and don't show any recognizable effect.

Code: Select all
[*], [/*]  - the text between these tags is only displayed in full translation mode
[trn], [/trn] - translations zone.
[com], [/com] - comments zone.
[!trs], [/!trs] - the text between these tags will not be indexed

If I misunderstand something, please let me know it.

2. representative headword for multiple ones
In DSL dictionary, I want to make only one headword appear for several synonyms.
That is, assume several headwords(ex. yi, いち, 일, 一) have one article, I want to make GD show like this even when I search "yi", いち or 일)
Code: Select all

one, single; individual; undivided


I used Babylon program formerly and it only showed the main headword, and BGL dictionaries are still acting so as in babylon. So I hopefully guess that function will also be possible in DSL format. Am I wrong?

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Tue Apr 06, 2010 6:08 pm
by fast_rizwaan
panho10 wrote:
2. representative headword for multiple ones
In DSL dictionary, I want to make only one headword appear for several synonyms.
That is, assume several headwords(ex. yi, いち, 일, 一) have one article, I want to make GD show like this even when I search "yi", いち or 일)

one, single; individual; undivided


I used Babylon program formerly and it only showed the main headword, and BGL dictionaries are still acting so as in babylon. So I hopefully guess that function will also be possible in DSL format. Am I wrong?


I think that software is not designed with "multi headwords -> multi meanings" but it is designed as "one headword -> many meangings"! So, you want yi, いち, 일, 一 to have "one, single; individual; undivided"

we should make it single headword to meaning format (i know this will increase the dictionary size) but even with such huge 2,00,000 entries, the zipped dsl is about 2mb. so, it would be practical to swith to one-headword to one-or-many-means like this:

Code: Select all
yi     - one, single; individual; undivided
いち   - one, single; individual; undivided
일     - one, single; individual; undivided
一     - one, single; individual; undivided


or even better -> one headword - one word; this will really help us make a reverse dictionary..
Code: Select all
yi     - one
yi     - single
yi     - individual
yi     - undivided

いち     - one
いち     - single
いち     - individual
いち     - undivided

일     - one
일     - single
일     - individual
일     - undivided

一     - one
一     - single
一     - individual
一     - undivided



In Linux, I use this riz2dsl script, which has 3 parts to it word, part of speech, and meaning. and I use tags around the words, parts-of-speech (pos) and meanings; like this:

[rizvan@chakra-desktop ~]$ cat file.txt
Code: Select all
<wb>yi<we>      <pb>adj.<pe>    <mb>one<me>
<wb>yi<we>      <pb>adj.<pe>    <mb>single<me>
<wb>yi<we>      <pb>adj.<pe>    <mb>individual<me>
<wb>yi<we>      <pb>adj.<pe>    <mb>undivided<me>

<wb>いち<we>    <pb>adj.<pe>    <mb>one<me>
<wb>いち<we>    <pb>adj.<pe>    <mb>single<me>
<wb>いち<we>    <pb>adj.<pe>    <mb>individual<me>
<wb>いち<we>    <pb>adj.<pe>    <mb>undivided<me>

<wb>일<we>      <pb>adj.<pe>    <mb>one<me>
<wb>일<we>      <pb>adj.<pe>    <mb>single<me>
<wb>일<we>      <pb>adj.<pe>    <mb>individual<me>
<wb>일<we>      <pb>adj.<pe>    <mb>undivided<me>

<wb>一<we>      <pb>adj.<pe>    <mb>one<me>
<wb>一<we>      <pb>adj.<pe>    <mb>single<me>
<wb>一<we>      <pb>adj.<pe>    <mb>individual<me>
<wb>一<we>      <pb>adj.<pe>    <mb>undivided<me>


and after doing ./riz2dsl-final-vgood.sh file.txt; i get:
Code: Select all

        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]individual[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]one[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]single[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]undivided[/ref][/trn][/m2]
いち
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]individual[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]one[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]single[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]undivided[/ref][/trn][/m2]
yi
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]individual[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]one[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]single[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]undivided[/ref][/trn][/m2]

        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]individual[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]one[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]single[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]undivided[/ref][/trn][/m2]


I also have "reverse-mean2word.sh" script which reverses the word to meaning along with part-of-speech (i'll make a new file for only word-meaning)
Code: Select all
./reverse-mean2-word.sh file.txt
reverse.txt <-output file


the <wb><we> <pb><pe> <mb><me> is the tags i use for my files.
[rizvan@chakra-desktop ~]$ cat reverse.txt
Code: Select all
 
<wb>individual<we>      <pb>adj.<pe>    <mb>일<me>
<wb>individual<we>      <pb>adj.<pe>    <mb>いち<me>
<wb>individual<we>      <pb>adj.<pe>    <mb>yi<me>
<wb>individual<we>      <pb>adj.<pe>    <mb>一<me>
<wb>one<we>     <pb>adj.<pe>    <mb>일<me>
<wb>one<we>     <pb>adj.<pe>    <mb>いち<me>
<wb>one<we>     <pb>adj.<pe>    <mb>yi<me>
<wb>one<we>     <pb>adj.<pe>    <mb>一<me>
<wb>single<we>  <pb>adj.<pe>    <mb>일<me>
<wb>single<we>  <pb>adj.<pe>    <mb>いち<me>
<wb>single<we>  <pb>adj.<pe>    <mb>yi<me>
<wb>single<we>  <pb>adj.<pe>    <mb>一<me>
<wb>undivided<we>       <pb>adj.<pe>    <mb>일<me>
<wb>undivided<we>       <pb>adj.<pe>    <mb>いち<me>
<wb>undivided<we>       <pb>adj.<pe>    <mb>yi<me>


and running riz2dsl to the reverse file will get us:
./riz2dsl-final-vgood.sh reverse.txt
Code: Select all
        [m1][p][/p][/m1]
        [m2][b]1.[/b] [trn][ref][/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]5.[/b] [trn][ref]一[/ref][/trn][/m2]

        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]一[/ref][/trn][/m2]
individual
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]一[/ref][/trn][/m2]
one
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]一[/ref][/trn][/m2]
single
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]一[/ref][/trn][/m2]
undivided
        [m1][p]adj.[/p][/m1]
        [m2][b]1.[/b] [trn][ref]일[/ref][/trn][/m2]
        [m2][b]2.[/b] [trn][ref]いち[/ref][/trn][/m2]
        [m2][b]3.[/b] [trn][ref]yi[/ref][/trn][/m2]
        [m2][b]4.[/b] [trn][ref]一[/ref][/trn][/m2]

------------
for your convenience, i'll attach the script files... you need linux to run those... I'm learning python to make these "bash scripts" work on windows/macosx and to improve performance of the dictionary creation process.

here are the scripts
scripts.tar.gz
(1.51 KiB) Downloaded 2775 times

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Tue Apr 06, 2010 7:34 pm
by panho10
Thanks for reply. It's a pity I can't use Linux.
I am not conversant with computer, just a simple user.

My concern is in utilizing Hanyudacidian(Chinese version Oxford unabridged of sort).
It has tens of thousands of Chinese hieroglyphs and more than 300,000 of headwords.

Therefore searching keywords through keyboard typing is difficult. So I want to give each headwords corresponding pronunciations in Roman, Japanese, Korean thereby making searching easy and fast.

Naturally search result must show original Chinese words because other headwords are just pronunciations and each pronunciation has several corresponding words.

In Babylon Pro the function is supported (with limitations, of course, because each meanig can have only one headword and variant forms of headword can't be displayed unless you make separate article, but it is very unefficient in a large database).

If DSL can't support representative headword function, I hope ikm make unique GD dictionary format which incorporates merits of other dictionary formats.

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Tue Apr 06, 2010 11:18 pm
by fast_rizwaan
panho10 wrote:Thanks for reply. It's a pity I can't use Linux.
I am not conversant with computer, just a simple user.

My concern is in utilizing Hanyudacidian(Chinese version Oxford unabridged of sort).
It has tens of thousands of Chinese hieroglyphs and more than 300,000 of headwords.

Therefore searching keywords through keyboard typing is difficult. So I want to give each headwords corresponding pronunciations in Roman, Japanese, Korean thereby making searching easy and fast.

Naturally search result must show original Chinese words because other headwords are just pronunciations and each pronunciation has several corresponding words.

In Babylon Pro the function is supported (with limitations, of course, because each meanig can have only one headword and variant forms of headword can't be displayed unless you make separate article, but it is very unefficient in a large database).

If DSL can't support representative headword function, I hope ikm make unique GD dictionary format which incorporates merits of other dictionary formats.


Ok, i've created a python3 program which you can run in any operating system to get the following result:

Say we have "file.txt" which has this data:
Code: Select all
#NAME "roman-chinese-korean"
#INDEX_LANGUAGE "chinese"
#CONTENTS_LANGUAGE "english"

yi, いち, 일, 一        one, single; individual; undivided


when you run the python3.py in windows using python3 korean.py (i named the program korean.py), we get output in "file-output.txt"
Code: Select all
#NAME "roman-chinese-korean"
#INDEX_LANGUAGE "chinese"
#CONTENTS_LANGUAGE "english"

yi
        [m1]one, single; individual; undivided[/m1]
いち
        [m1]one, single; individual; undivided[/m1]

        [m1]one, single; individual; undivided[/m1]

        [m1]one, single; individual; undivided[/m1]




You want to access chinese words with roman, japanese, korean words, right?

let's create our new database with "chinese word" in the meaning
Code: Select all
#NAME "roman-chinese-korean"
#INDEX_LANGUAGE "chinese"
#CONTENTS_LANGUAGE "english"

yi, いち, 일, 一        [b]一[/b];one, single; individual; undivided

observe that chinese character is inserted into the meaning!
after running "python3 korean.py" in c:\python3 folder where you need to keep your "file" as "file.txt" to get file-output.txt in utf8 format

Code: Select all
#NAME "roman-chinese-korean"
#INDEX_LANGUAGE "chinese"
#CONTENTS_LANGUAGE "english"

[b]yi [/b]
     [m1] [b]一;[/b]one, single; individual; undivided [/m1]
[b]いち [/b]
     [m1] [b]一;[/b]one, single; individual; undivided [/m1]
[b]일 [/b]
     [m1] [b]一;[/b]one, single; individual; undivided [/m1]
[b]一 [/b]
     [m1] [b]一;[/b]one, single; individual; undivided [/m1]


I hope this somehow solves your problem;

other solution is to have 2 dictionaries,
1. one for roman,japanese,korean to chinese and
2. second dictioanry is chinese to english/other language
3. when we search yi in roman,japanese,korean dictionary, we will be shown , and we can click on 一 to get english meanings from 2nd dictionary. need to add [ref] mean [/ref] tag for clickable search

Please use this program; there are ways to convert the .py to exe for easier deployment. i am yet to try that. here is the python3 program which can make word-meaning type word1,word2,word3<separated by tab>meanings.
korean.py.tar.gz
this python3 program korean.py file can be used to make any tabbed one\tmean (word <tab> mean> to dsl format.
(729 Bytes) Downloaded 2662 times

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Wed Apr 07, 2010 6:14 am
by panho10
I appreciate your helpful reply.
Actually the original data is very complex and large(more than 200MB in text only).
Since Chinese headwords also have some variants in some cases and meaning sections have many lines,
your script can't be applied directly. And some Chinese words don't have corresponding pronunciations because they are found only in ancient texts and are not yet known how to pronounce them.
Anyway I can use your script some other dictionaries in the future. Thank you.
I just wanted to add query by pronunciation. I am sorry to find that we can't do it in DSL itself.

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Wed Apr 07, 2010 7:21 am
by ikm
In dsl you can have multiple headwords per article, e.g.

Code: Select all
one
two
three
    Body for one, two and three

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Wed Apr 07, 2010 8:02 am
by panho10
I know it, too. As I mentioned above, what I want is to make a representative headword among multiple ones.
Example:
Code: Select all
yi
いち


   [trn][m1][c green]yī[/c] [/m]
   [m1][b]1.[/b]數詞。 大寫作“壹”。 最小的正整數。 常用以表示人或事、物的最少數量。  [/m]
   [m2][*][ex]《詩‧鄭風‧野有蔓草》: “有美一人, 清揚婉兮。” [/ex][/*][/m]
   .....

In above code, three headwords(yi, いち, 일) are only pronunciation forms in Roman, Japanese and Korean.
So I hope to make it possible that even if I search by any of yi, いち or 일, only 一 should be displayed as the representative headword in the query result.

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Wed Apr 07, 2010 9:02 am
by ikm
There are a couple of tricks for this. First, you can use ~ (tilde), which always expands to the first headword. E.g.

Code: Select all

yi
    The main headword is ~

The body will always display as "The main headword is 一"

Second, you can specify so-called "unsorted parts" in headword. They go enclosed in curly braces and will display in the article body, but not in index. E.g.

Code: Select all

{一 \(}yi{\)}
  Article body

This will make two headwords appear in index: 一 and yi. However, the "yi" headword in the card itself will appear as "一 (yi)". Try it out for yourself. Note that the parentheses (round braces) are escaped with the backslashes -- they need to since when unescaped they have another meaning (so-called "optional parts").

DSL has some (mostly known to few) things like this.

p.s. About creating custom format -- there's little value in creating a format when there are no dictionaries in that format, it's a chicken-and-egg problem. It takes quite some time to develop one, but it won't gain any significant adoption fast.

Re: How to Create a DSL dictionary for Goldendict

PostPosted: Wed Apr 07, 2010 9:48 am
by panho10
Yeah, actually I myself found the method in langvo's help file and tried it.
But I din't like the result because it made the article look somewhat unclean.
By the way if a new custom format is meaningless, how about considering to support gls format.
gls is a raw format before compiled to bgl format. and the syntax is as follows and html tags are supported.
Code: Select all
Term1 | Alternate1 | Alternate2| ...
Definition

Term2 | ...
...

As you will know, alternative terms are searchable but not shown in article display.
But I would prefer it if GD can extend dsl syntax internally and allow some sort of babylon-like unseen headword alternative function.