Currently still up to prepare the PTS-dictionary for ati's dictionary, having some 10.000ts of words, Atma thinks on how to best handle the pagenames. The software, as it is now, would cut certain characters with diacritics, incl. in the roman table to standard roman characters. Some not, because not incl. Out of that, which is fine when using search tools, now able to search with or without diacritics for this characters, there would be many double, tripple names.
Vandami Bhante
I do not really understand the problem.
The dictionary now includes pages like for example http://accesstoinsight.eu/en/dictionary/ṭhiti-bhāgiya-samādhi , which has diacritics in the pagename.
Searching for the phrase ṭhiti-bhāgiya-samādhi yields 211 matches on 2 pages . But searching for thiti-bhagiya-samadhi (without diacritics) yields no matches at all .
Where does it happen that Roman characters with diacritics are stripped of diacritics?
There is current no problem because Atma looked after to have no "same" names within the standard but that will no more possible, or with tricks like using
aa instead of
ā in pagenames. When uploading the "whole" set of words such can be troublesome.
The filename is
ṭhitibhagiyasamadhi.txt or better
%E1%B9%ADhitibhagiyasamadhi.txt. When ever a page is made new, the program would cut the name down that standard.
Since
ṭ is not included in that standard, it can be distinguished from
t. The cutting down is fine as long as there is no need of a page
a and
ā. If needed, than it meets it's limits. The quicksearch searches for header and filename and is in the current situation great because in that way finding 2 different kind of spellings. If filename = header, real spelling, one would currently need to write correct in regard of finding words with resisting diacritics.
Not having tried yet, put to upload files not in the standard of the page names would, say
ṭhitibhāgiyasamādhi.txt would make them invalid files causing error, not displayed, as far as understanding.
If now, when possible, one would disable the cut of of diacritics also in regard of the roman table, the search would no more that pleasing, possible would need to be recorded to match either with or without diacritic characters.
This seems to be already the case here: One would have to enter any phrase exactly correct with diacritics to find matching results.
Could Bhante give an example of a search where this is not the case? Or have I misunderstood something?
Not in the case where pagename and header is different, at least for quick-search (as searching both, header and filename. Standard search looks for what ever is as it is for now, as far as known.
In any case, yes, it seems very useful, being able to search for Pali phrases without all the exact diacritics. But would also be good to have the option to search for exact matching diacritics. Not sure how easy or difficult it would be to implement.
Every idea and suggestion welcome here.
Atma thinks it will need another week or two till it would need to make a choice in regard of filenames to progress.
Not sure what choice in regard to filenames is needed? At the moment, it seems dictionary filen ames/page names include diacritics etc., which I think makes sense. If one would strip diacritics from filenames then one might have some conflicting words which would be the same without diacritics.
So I think it would be best to have all diacritics included in the file names / page names. Still not sure where anything is stripped of diacritics. Is it the case that one would have a file name with diacritics, but the URL would be without diacritics? Would be good to see an example.
May person thinks that it would be the best if the filename = right spelling and search engines as well as programs handle files would be certain adopted, but my person thinks that such might require huge programmer work, possible and has probably much impact on such as applications, plugins as well.
Another opinion would be turning the
deaccent -config to 0, currently is 2 (remove diacritics in Latin, with matches some of the Pali diacritics as well). Using 0 now may make some current pagesnames problematic and cause troubles.
If removing the whole set of Pali diacritics out of the cut away in the responsible script, possible better. Yet the matter that the searches are not optimized for different spellings is another and resists.
In the current system used Atma used tricks like "aa" but for m with a upper dot, for example more problematic, m with a dot below would stay as it is since not in the latin block of diacritics.
While always finding way to make best use with what is available, in this case, may person has no idea of the amount of programming effort and skill needed for a good rendering, and would therefore not really ask for going after this or that better solution. He just aware that it would possible require huge and skilled work and would need a lot of scarifies and concentration and giving into this matter.