Author Topic: [Ati.eu] Pagenames and Pali diacritics (Read 3397 times)

Dhammañāṇa · « **on:** May 15, 2019, 08:46:09 AM »

Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.
If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recoded to match either with or without diacritic characters.

Every idea and suggestion welcome here.

Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.

Moritz · « **Reply #1 on:** May 18, 2019, 07:50:24 AM »

Quote from: Johann on May 15, 2019, 08:46:09 AM

Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.

Vandami Bhante $_/\_$

I do not really understand the problem.
The dictionary now includes pages like for example http://accesstoinsight.eu/en/dictionary/ṭhiti-bhāgiya-samādhi

, which has diacritics in the pagename.
Searching for the phrase ṭhiti-bhāgiya-samādhi yields 211 matches on 2 pages

. But searching for thiti-bhagiya-samadhi (without diacritics) yields no matches at all

.

Where does it happen that Roman characters with diacritics are stripped of diacritics?

Quote from: Johann on May 15, 2019, 08:46:09 AM

If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recoded to match either with or without diacritic characters.

This seems to be already the case here: One would have to enter any phrase exactly correct with diacritics to find matching results.
Could Bhante give an example of a search where this is not the case? Or have I misunderstood something?

In any case, yes, it seems very useful, being able to search for Pali phrases without all the exact diacritics. But would also be good to have the option to search for exact matching diacritics. Not sure how easy or difficult it would be to implement.

Quote from: Johann on May 15, 2019, 08:46:09 AM

Every idea and suggestion welcome here.

Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.

Not sure what choice in regard to filenames is needed? At the moment, it seems dictionary filen ames/page names include diacritics etc., which I think makes sense. If one would strip diacritics from filenames then one might have some conflicting words which would be the same without diacritics.

So I think it would be best to have all diacritics included in the file names / page names. Still not sure where anything is stripped of diacritics. Is it the case that one would have a file name with diacritics, but the URL would be without diacritics? Would be good to see an example.

$_/\_$

Dhammañāṇa · « **Reply #2 on:** May 18, 2019, 10:33:20 AM »

Quote from: Moritz on May 18, 2019, 07:50:24 AM

Quote from: Johann on May 15, 2019, 08:46:09 AM
Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.

Vandami Bhante $_/\_$

I do not really understand the problem.
The dictionary now includes pages like for example http://accesstoinsight.eu/en/dictionary/ṭhiti-bhāgiya-samādhi , which has diacritics in the pagename.
Searching for the phrase ṭhiti-bhāgiya-samādhi yields 211 matches on 2 pages . But searching for thiti-bhagiya-samadhi (without diacritics) yields no matches at all .

Where does it happen that Roman characters with diacritics are stripped of diacritics?

There is current no problem because Atma looked after to have no "same" names within the standard but that will no more possible, or with tricks like using aa instead of ā in pagenames. When uploading the "whole" set of words such can be troublesome.

The filename is ṭhitibhagiyasamadhi.txt or better %E1%B9%ADhitibhagiyasamadhi.txt. When ever a page is made new, the program would cut the name down that standard.
Since ṭ is not included in that standard, it can be distinguished from t. The cutting down is fine as long as there is no need of a page a and ā. If needed, than it meets it's limits. The quicksearch searches for header and filename and is in the current situation great because in that way finding 2 different kind of spellings. If filename = header, real spelling, one would currently need to write correct in regard of finding words with resisting diacritics.

Not having tried yet, put to upload files not in the standard of the page names would, say ṭhitibhāgiyasamādhi.txt would make them invalid files causing error, not displayed, as far as understanding.

Quote from: Moritz on May 18, 2019, 07:50:24 AM

Quote from: Johann on May 15, 2019, 08:46:09 AM
If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recorded to match either with or without diacritic characters.

This seems to be already the case here: One would have to enter any phrase exactly correct with diacritics to find matching results.
Could Bhante give an example of a search where this is not the case? Or have I misunderstood something?

Not in the case where pagename and header is different, at least for quick-search (as searching both, header and filename. Standard search looks for what ever is as it is for now, as far as known.

Quote from: Moritz on May 18, 2019, 07:50:24 AM

In any case, yes, it seems very useful, being able to search for Pali phrases without all the exact diacritics. But would also be good to have the option to search for exact matching diacritics. Not sure how easy or difficult it would be to implement.

Quote from: Johann on May 15, 2019, 08:46:09 AM
Every idea and suggestion welcome here.

Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.
Not sure what choice in regard to filenames is needed? At the moment, it seems dictionary filen ames/page names include diacritics etc., which I think makes sense. If one would strip diacritics from filenames then one might have some conflicting words which would be the same without diacritics.

So I think it would be best to have all diacritics included in the file names / page names. Still not sure where anything is stripped of diacritics. Is it the case that one would have a file name with diacritics, but the URL would be without diacritics? Would be good to see an example.

$_/\_$

May person thinks that it would be the best if the filename = right spelling and search engines as well as programs handle files would be certain adopted, but my person thinks that such might require huge programmer work, possible and has probably much impact on such as applications, plugins as well.

Another opinion would be turning the deaccent

-config to 0, currently is 2 (remove diacritics in Latin, with matches some of the Pali diacritics as well). Using 0 now may make some current pagesnames problematic and cause troubles.
If removing the whole set of Pali diacritics out of the cut away in the responsible script, possible better. Yet the matter that the searches are not optimized for different spellings is another and resists.

In the current system used Atma used tricks like "aa" but for m with a upper dot, for example more problematic, m with a dot below would stay as it is since not in the latin block of diacritics.

While always finding way to make best use with what is available, in this case, may person has no idea of the amount of programming effort and skill needed for a good rendering, and would therefore not really ask for going after this or that better solution. He just aware that it would possible require huge and skilled work and would need a lot of scarifies and concentration and giving into this matter.

Dhammañāṇa · « **Reply #3 on:** May 18, 2019, 10:39:54 AM »

Using now "url" encoding, my person thinks, although it's saver for certain windows-program use, that uft-8 would be better and easier, avoiding the long and not human readable filenames.

Also here

Quote from: https://www.dokuwiki.org/config:fnencode

Warning: Changing this option could cause unintended behaviour. By changing it you can make pages created under a previous setting inaccessible.

Please also note that storing UTF-8 filenames might not be possible with all file systems. Windows systems have been reported to not work with this setting.

Windows might possible follow soon, since utf-8 is easier for many.

Yet at the moment still relative less pages would have to be re-saved, especially Khmer.

Dhammañāṇa · « **Reply #4 on:** May 19, 2019, 06:26:51 AM »

The most "easiest" way with given, existing, possibilities might be that my person changes the filename-mode to utf-8, deaccent to 0 and add systematical alternative spellings as page content. That would make pages searchable and match in quick search as well. Writing the header without diacritics would make the quicksearch fine as well.

Linking the pages will then require right selling.

The work on the existing content would be rename and save mostly Khmer pages. In regard of the impact for windows systems, my person is not informed for now.

Moritz · « **Reply #5 on:** May 19, 2019, 10:14:42 AM »

Quote from: Johann on May 19, 2019, 06:26:51 AM

The most "easiest" way with given, existing, possibilities might be that my person changes the filename-mode to utf-8, deaccent to 0 and add systematical alternative spellings as page content. That would make pages searchable and match in quick search as well. Writing the header without diacritics would make the quicksearch fine as well.

Linking the pages will then require right spelling.

The work on the existing content would be rename and save mostly Khmer pages. In regard of the impact for windows systems, my person is not informed for now.

That seems like the cleanest solution for a start.
Regarding impact for Windows systems , I think there should be no need to consider, because the server, like most servers is running some variant of Linux.
$_/\_$

Dhammañāṇa · « **Reply #6 on:** May 19, 2019, 10:50:42 AM »

Good, so then Atma follows that Vision when prepearing further the next days.

Dhammañāṇa · « **Reply #7 on:** September 17, 2019, 11:24:52 AM »

My person has now turned "deaccent" off and it will need it's time till all pages, touched by it (especially dictionary) are reproduced with proper file-name (many might not be found meanwhile).

As for finding words in dictionary easier by quick-search suggestions, my person adds "[dic]" at the beginning of the header and in brakes the spelling without diacritics. For example "[dic] ācariya (acariya)" or "[dic] āciṇṇakakamma (acinnakakamma)"

Dhammañāṇa · « **Reply #8 on:** September 17, 2019, 03:25:26 PM »

...and filename-modus to utf-8, of course.

Meanwhile uploaded all (more or lesser) new files (incl. the new words from Bhante Varados Glossary, while source-files not made yet).

Redirects from old to new filenames, where it has changed are to be made and surely countless corrections here and there incl. an new template-file for the dic-section.

That the files are found is up to the status of indexing which may need a while.

Su	Mo	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30
Birthdays
Apr 26: Hermann(83)
May 10: Mohan Gnanathilake(49)
May 14: Norbert(64)
May 15: Depabhasadhamma(68)
May 16: You Y(33)
May 17: Somonea(39)
Events
May 1: 🌗 Sīla day 8. waning moon (union) - ថ្ងៃ សីល ៨ រោច (ទាំងអស់)
May 7: 🌑 New moon Uposatha (Mahanikaya) - ខែដាច់ ឧបោសថ (មហានិកាយ)
May 8: 🌑 New moon Uposatha (Dhammayutt) - ខែដាច់ ឧបោសថ (ធម្មយុត្តិកនិកាយ)
May 15: 🌓 Sīla day 8. waxing moon (Mahanikaya) - ថ្ងៃ សីល ៨ កើត (មហានិកាយ)
Holidays
May 5: Cinco de Mayo

News:

Recent Topics

Talkbox

Tipitaka Khmer

Search sangham.net

Search ATI on ZzE

Übersicht Verzeichnisse ZzE

Welcome! (Help)

Chaṭṭha Saṅgāyana Tipitaka

New essays

New Uploads

Calender

Dear Visitor!

Zugang zur Einsicht - Übersetzung, Kritik und Anmerkungen

Author Topic: [Ati.eu] Pagenames and Pali diacritics (Read 3397 times)

Dhammañāṇa

[Ati.eu] Pagenames and Pali diacritics

Moritz

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Moritz

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Dhammañāṇa

Re: [Ati.eu] Pagenames and Pali diacritics

Quick Reply