New user registration is currently disabled due to spam abuse / Регистрация новых пользователей в настоящее время приостановлена из-за злоупотреблений спаммерами

Finding duplicate entries and fixing tags in a dictionary

All about dictionaries

Finding duplicate entries and fixing tags in a dictionary

Postby betwee » Thu Jul 07, 2011 3:22 pm

I'm using this small dictionary in DSL format found here, http://www.mediafire.com/?cmc2d17crn2ai2n.
Very rarely I see that some entries are presented twice (meaning the head word and the descriptions). Is there any program or script I can use to find and remove all the duplicate entries in this dictionary?

A second problem that I notice is the ending tags. If we have for example a text in italics, normally what starts with (i) should end with (/i), but in this one at some cases it ends with (/ex) or (/p) or with no ending tag at all. It looks like a manually maintained dictionary, hence all the errors, but it would be nice to know if there is a way to fix these too.

Thanks
betwee
 
Posts: 33
Joined: Tue May 17, 2011 8:10 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby Tvangeste » Thu Jul 07, 2011 4:51 pm

betwee wrote:I'm using this small dictionary in DSL format found here, http://www.mediafire.com/?cmc2d17crn2ai2n.
Very rarely I see that some entries are presented twice (meaning the head word and the descriptions). Is there any program or script I can use to find and remove all the duplicate entries in this dictionary?

Are those *completely* identical entries? I see some duplicate headwords, but with different entries though.

A second problem that I notice is the ending tags. If we have for example a text in italics, normally what starts with (i) should end with (/i), but in this one at some cases it ends with (/ex) or (/p) or with no ending tag at all. It looks like a manually maintained dictionary, hence all the errors, but it would be nice to know if there is a way to fix these too.

Please provide a name of such headword where tags are incorrect. Quickly looking at the DSL, I see most of i tags to be properly closed.
Tvangeste
 
Posts: 893
Joined: Thu Jun 02, 2011 11:42 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby betwee » Thu Jul 07, 2011 5:36 pm

Tvangeste wrote:Are those *completely* identical entries? I see some duplicate headwords, but with different entries though.


Yes. Look for the words "shoshare 1" or "shpërshpije".

Please provide a name of such headword where tags are incorrect. Quickly looking at the DSL, I see most of i tags to be properly closed.


Look at the word "shumëfish" and notice how the examples start with 'ex' and end with '/p'.

Code: Select all
shumëfish
 [p]mb.[/p]
 [m1][b]1.[/b] Që përbëhet prej shumë njësish, pjesësh, elementesh etj. [ex]Emërtime shumëfishe. Formë shumëfishe. Fryt shumëfish.[/p] [i][p][c peru]bot.[/c][/p][/i]
betwee
 
Posts: 33
Joined: Tue May 17, 2011 8:10 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby Tvangeste » Thu Jul 07, 2011 6:59 pm

betwee wrote:Yes. Look for the words "shoshare 1" or "shpërshpije".

For exact duplicates, there are some scripts available that could eliminate them. I ran such script on your dictionary and see that there are only 93 duplicates.

Look at the word "shumëfish" and notice how the examples start with 'ex' and end with '/p'.


Ah, yeah. Structure problems. These are hard to automate, unfortunately, so you might just consider doing it manually.

I quickly tried to apply some of the scripts I collected over years, and here's the result, but use with care! 8-)
http://www.multiupload.com/C4VWLWJ9FR
Tvangeste
 
Posts: 893
Joined: Thu Jun 02, 2011 11:42 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby betwee » Fri Jul 08, 2011 1:23 pm

Tvangeste wrote:I quickly tried to apply some of the scripts I collected over years, and here's the result, but use with care! 8-)


That was such a great help, Tvangeste, thank you very much, :). I would have never found them all with my eyes only!
betwee
 
Posts: 33
Joined: Tue May 17, 2011 8:10 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby betwee » Tue Aug 16, 2011 1:01 pm

I think I may need your help once again, :).

I have a new bilingual DSL dictionary, which is full of repetitive phrases, in this case []em.[/] (noun.).
As you can see, all the words are nouns, but I would like to keep only the first []em.[/] and delete the others.

So, I want to turn this:

Code: Select all
abdikim
   [m1][p]em.[/p] abdication; [p]em.[/p] disclaimer; [p]em.[/p] disclamation; [p]em.[/p] disavowal; [p]em.[/p] denial; [p]em.[/p] demise[/m]


into this:

Code: Select all
abdikim
   [m1][p]em.[/p] abdication, disclaimer, disclamation, disavowal, denial, demise[/m]


Any special script that could carry this out?
betwee
 
Posts: 33
Joined: Tue May 17, 2011 8:10 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby Tvangeste » Tue Aug 16, 2011 3:06 pm

There are no special scripts for that, but such replacements are pretty easy to do in any text editor that properly supports regexps (EmEditor, for example). Or, maybe, even regexps are not needed here. Just find all the following and delete or replace by comma:
Code: Select all
; [p]em.[/p]
Tvangeste
 
Posts: 893
Joined: Thu Jun 02, 2011 11:42 am

Re: Finding duplicate entries and fixing tags in a dictionar

Postby betwee » Tue Aug 16, 2011 4:08 pm

That was easy, yes. Thanks, :).
betwee
 
Posts: 33
Joined: Tue May 17, 2011 8:10 am


Return to Dictionaries

Who is online

Users browsing this forum: No registered users and 64 guests