Virtual Dhamma-Vinaya Vihara

Studies, projects & library - [Studium, Projekte & Bibliothek] (brahma & nimmanarati deva) => Translation projects - [Übersetzungsprojekte] => Studygroups & Dhamma Dana - [Studiengruppen & Dhamma Dana] => Zugang zur Einsicht - [Access to Insight] => Topic started by: Johann on February 27, 2019, 07:04:53 PM

Title: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 27, 2019, 07:04:53 PM

Aramika   *

Dieses neue Thema (bzw. diese/r Beitrag/e) wurde  aus abgetrennten Beiträgen, ursprünglich in [ATI.eu] CSCD xml to ati.eu format: converting, editing  (http://sangham.net/index.php/topic,8672.msg18113.html#msg18113), hinzugefügt. Für ev. ergänzende Informationen zur sehen Sie bitte das Ursprugsthema ein. Anumodana!
The new topic (or post/s) here are originaly from [ATI.eu] CSCD xml to ati.eu format: converting, editing  (http://sangham.net/index.php/topic,8672.msg18113.html#msg18113). For eventual additionally information: please visit also the Topic of origin. Anumodana!
[Original post:]


Sadhu

Vandami Bhante _/\_

Ohh... btw. did Nyom Moritz , and that is just a question not a demand anyhow, run the replacement for the newly uploaded previous "broken" 169 pages another time, just thought there might be are some incl./misssed in this match(es) actually.

I just added the list of files (http://sangham.net/index.php/topic,8672.msg17960.html#msg17960) manually to the index now.

Oh, it was more about the placeholder-replacment that Nyom did with a script. Indexing seems fine


hmm...

Quote from: Moritz
I just added the list of files manually to the index now.

Could it be that the index for the searchmachine has been gone by that, Nyom Moritz ? Since only finding very, very view... possible just in the 169 files.

(As far as knowing: Batchedit uses the index just to know which files there are in the wiki, it does not search the index in regard of content. Once a file is in the index it matches the case. But Nyom knows such probably much better)

It's gone: search in searcheng. of the Word Buddha gives: 1 Hit in all pages of ati.eu: "Cūḷavaggo @cs-rm:tika
    1 Hits, Last modified: 27 minutes ago
    allako siyā;</div> <div gathalast>Nāropetabbakaṃ buddha-vacanaṃ aññathā pana.</div> <div hangnum><span p"
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on February 27, 2019, 08:44:40 PM
hmm...

Quote from: Moritz
I just added the list of files manually to the index now.

Could it be that the index for the searchmachine has been gone by that, Nyom Moritz ? Since only finding very, very view... possible just in the 169 files.
Oh... no idea how this could happen.
Trying to re-index everything again now by the same "manual" method. (Thinking it would go faster.)
Will look again in two hours how far anything has changed or if there are errors. If it is not working this way, I will have to run a complete re-index by the "normal" method which will probably need more than one day to run.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 27, 2019, 08:48:02 PM
Many errors and empty pages although content. Various that not listing for now, possible not required.

Take your time Nyom.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on February 27, 2019, 11:14:00 PM
After starting to re-index and searching for some words that should already be indexed by now (the first indexed pages, starting with cs-km), not much is found. Also, trying some random words from the index brings wrong results (e.g. searching for the romanized Pali word tanādayo finds a result pages in the cs-km/atthakatha, although these only contain Khmer-Pali words).

It seems the index was corrupted somehow. So I'm doing a full index rebuild now. Started at 17:11 German time.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 28, 2019, 06:26:19 AM
Sadhu
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on February 28, 2019, 09:31:00 AM
The server has been running into disk space shortage again. Hopefully the index has not been corrupted by this.
I have readjusted the disk space limits, taken away 1000 MB from sangham.net and 100 MB from zugangzureinsicht.org to allocate it to accesstoinsight.eu.
Now 11.92 GB of 12.99 GB used for accesstoinsight.eu, according to server panel stats. (sangham.net and zugangzureinsicht.org also running near the limit)

(The index does not really use huge amounts of space. Currently it is at ~65 MB, still building, about half complete at page 10000 of 20000.).


Probably unrelated to the disk space issue, there was also an error on indexing the page http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att) or http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v02_att (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v02_att):
Quote
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/clients/client2157/web5417/web/inc/indexer.php on line 614

That is not about disk space but about RAM. Apparently the page contains too many distinct words to fit in working memory. So the page could not be indexed. Not sure if that has happened for any other pages.


If nothing goes wrong now I think the index rebuild should be finished in another 12-14 hours.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 28, 2019, 09:46:38 AM
The two pageshad been indexed before. Actually those to pages have been some of the last which Atma modified. He made the headers. One containing 100 one good 50. Not sure what could be the problem for now and if that can be solved aside of server settings.

Indexing comes by hitting a link as well. Meaning that such problem would arise again. Storing the old version an hitting a link, then give the current again, would put it into the index for search possible.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on February 28, 2019, 10:05:14 AM
Okay. Good to know.
Now the same error for http://accesstoinsight.eu/cs-th/tipitaka/sut.kn.jat.v22 (http://accesstoinsight.eu/cs-th/tipitaka/sut.kn.jat.v22)

Yes, page indexing should happen automatically whenever a page is opened. As far as read this is happening in some background process where I don't know what else might happen there and when it will happen. Maybe that process can be very busy and things might go wrong without notice.

I tried to search for some words from http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att) and http://accesstoinsight.eu/cs-th/atthakatha (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v02_att) but without any results, so far.

Maybe better to not stress the server too much now while it's still indexing and test further the next days, or next week.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 28, 2019, 10:20:47 AM
Just for info: this are all pages within the 169 pages which have be en replaced, and 51 of them in cs-rm tipitaka, have been replaced anew yesterday, since some notes have been wrong displaced. So there might be certain confusion of system ways and manual and the hint of Nyom is surely good. Clear chache by saving the admin setting helps often then.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on February 28, 2019, 05:46:05 PM
Indexing is now already finished, not needing as much time as I thought. No more fatal errors for the last ~8000 files.
Only thing I noticed that files with capital letters can not be indexed and also are not displayed by DokuWiki, e.g http://accesstoinsight.eu/de/lib/authors/nanamoli/PathofPurification2011 (http://accesstoinsight.eu/de/lib/authors/nanamoli/PathofPurification2011). The file exists, but it seems DokuWiki wants to have everything in lower case to make it work.

I also replaced the {file}, {no} and so on in this list of files (http://sangham.net/index.php/topic,8672.msg17960.html#msg17960). Replaced, not re-indexed now.

Have not tried to search and find words in them again. Not sure if they are indexed and searchable and if re-indexing would give an error again. Have to test later.


If there are too many problems with the DokuWiki built-in search, which seems like maybe working at the limit here with this number of pages and content, maybe one could look at alternative solutions at some point. I found this topic (https://forum.dokuwiki.org/thread/14282) in the DokuWiki forum for some ideas. But I think most of the solutions there would not be supported on this server. A really strong and fast search engine for a large scale can probably not exist in only PHP and we can only use PHP on this server.

I will probably not have much time the next days. So maybe will be mostly away for one week at least.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on February 28, 2019, 07:08:38 PM
Indexing is now already finished, not needing as much time as I thought. No more fatal errors for the last ~8000 files.
Only thing I noticed that files with capital letters can not be indexed and also are not displayed by DokuWiki, e.g http://accesstoinsight.eu/de/lib/authors/nanamoli/PathofPurification2011 (http://accesstoinsight.eu/de/lib/authors/nanamoli/PathofPurification2011). The file exists, but it seems DokuWiki wants to have everything in lower case to make it work.

Yes. Looks like my person had possible overseen some. Atma will change the case.


I also replaced the {file}, {no} and so on in this list of files (http://sangham.net/index.php/topic,8672.msg17960.html#msg17960). Replaced, not re-indexed now.

Have not tried to search and find words in them again. Not sure if they are indexed and searchable and if re-indexing would give an error again. Have to test later.

Sadhu

If there are too many problems with the DokuWiki built-in search, which seems like maybe working at the limit here with this number of pages and content, maybe one could look at alternative solutions at some point. I found this topic (https://forum.dokuwiki.org/thread/14282) in the DokuWiki forum for some ideas. But I think most of the solutions there would not be supported on this server. A really strong and fast search engine for a large scale can probably not exist in only PHP and we can only use PHP on this server.

I will probably not have much time the next days. So maybe will be mostly away for one week at least.

_/\_

It worked fine so far and at the stage of implementation, the search.

Search is set that namespaces up that where one stands are chosen, so normal, if not changing the @lang, the amount is not that large. "Buddha" gave now "50541 matches on 5303 pages" on the whole wiki, for orientation of status. Searching "div" which should be probably on all page, gives results on +14.000 pages.

May Nyom spend the his time always with joy on good deeds, very ever that might be.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 01, 2019, 03:00:23 PM
A batch-edit search which should match all cs-.. pages gives 5824 page matches of 2698x4 pages should. As only cs-ru,  is total missing in the search, ati-pages and others can be found, it's assume-able that it break up the index anywhere between.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 02, 2019, 10:47:13 AM
Atma, since clear weather, and good internet (actually when running sure even faster as in Europa) today, tries to reindex. Just to let it be known.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 02, 2019, 06:00:04 PM
25% of the 20310 pages till now. Let's see whether the rest of battery will last till morning light and the connection does not break up over night.

May all spend a blessed day, a blessed night. (battery-safe time now)
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 03, 2019, 07:41:10 AM
Energy was off at about 7.000 pages.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 04, 2019, 08:43:25 AM
Quote from: serverpanel
sangham.net   5605.8 MB   6000 MB   6001 MB   5605.8 MB 6000 MB
zugangzureinsicht.org   689.27 MB   700 MB   701 MB   689.27 MB 700 MB
accesstoinsight.eu   13297.37 MB   13300 MB   13301 MB

My person would say that if the history storage (attic) and maintenance wouldn't work well, it might not be possible to work on.

So, althought probably time that will no more be avaliable, my person has to stop all deeds another time.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on March 04, 2019, 08:01:27 PM
Energy was off at about 7.000 pages.

I am starting a re-indexing now (complete rebuild), after enough storage memory should now be cleared up again. I think the last try was probably broken, because running out of memory for storage.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 04, 2019, 08:10:06 PM
Sadhu
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on March 05, 2019, 02:53:26 AM
Progress at ~4000/20000 pages now. (seems a lot slower than before)

http://accesstoinsight.eu/cs-km/tipitaka/sut.an02.v01 had a zero-width whitespace in the filename. So it could not be indexed before. Now the zero-width whitespace is removed.
If there have been any batchedits before which should have affected this file, they should probably be made again.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 05, 2019, 06:30:29 AM
Zero-white space "should" be no problem. Every page in km contains mass of thousands and it would "destroy" a lot to remove them.

Others then in khmer Pali, which does not use it usually.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on March 05, 2019, 08:54:47 AM
Only in the file name it was a problem. The file name was not "sut.an02.v01.txt" but "[zero whitespace]sut.an02.v01.txt", so it could not be found or indexed. Now it has been renamed (zero-whitespace removed from file name) and indexed. Nothing was changed in the content of the file.
Same also for "sut.an02.v01.txt" in cs-ru/tipitaka, cs-rm/tipitaka and cs-th/tipitaka, has now been fixed.

Only wanted to make aware that possible previous batchedits could have had no effect on the "sut.an02.v01.txt" files, because they were not indexed before. (They look fine, though.)

Current indexing progress: about 7000 of 20000.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 05, 2019, 09:05:14 AM
Oh yes. Remember this file file now. Sometimes appearing as ?sut.an02.v01.txt while renaming... Sadhu. (not attentive read file/filename)

Related to it: Atma told Mr. Andreas about the 'general problem with zero whitespaces in filenames (https://forum.dokuwiki.org/thread/16720)' and he asked to raise a bug issue somewhere at other place, my person could not.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on March 06, 2019, 03:14:18 AM
Indexing still running, now at ~18700 of 20456
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on March 06, 2019, 09:04:19 AM
Indexing finished now.

Not sure if not broken again.

The same error that happened before (http://sangham.net/index.php/topic,8672.msg18001.html#msg18001) happened again with two other files from cs-th.

Probably unrelated to the disk space issue, there was also an error on indexing the page http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v01_att) or http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v02_att (http://accesstoinsight.eu/cs-th/atthakatha/sut.kn.jat.v02_att):
Quote
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/clients/client2157/web5417/web/inc/indexer.php on line 614

That is not about disk space but about RAM. Apparently the page contains too many distinct words to fit in working memory. So the page could not be indexed. Not sure if that has happened for any other pages.

It happened also for:


Not sure if missed some more, but I believe that are all.

After such an error, the indexer does not work for a few minutes and every try to index fails. But seems to work again after waiting long enough. Not sure if something really bad happened during these errors which would break the index.



The following file could not be indexed before because it had a space in the file name. Now renamed (space removed) and indexed. Any BatchEdits applied to all other files would not have been applied to this file. So it still has all the HTML tags from ZzE in it etc. (link (http://www.accesstoinsight.eu/en:lib:authors:thanissaro:thinkingcure)):



The following is a list of (hopefully all) files which could not be indexed because of unallowed big letters in the file name. Any BatchEdits applied to all other files would not have been applied to these files. So they probably still have all the HTML tags from ZzE in it etc.:



Some look like they don't belong on ATI (like Google stuff).
Some might be PDFs? (like en:lib:authors:brahmali:Warders_Lessons_Key?) Not sure.

And then there is the SLTP Tipitaka from ZzE. Maybe good trying to bring in similar format like the CSCD edition? Would probably not be easily done.


Hoping the search index works without problems now, but not sure about it.
Index size: currently 376 MB

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on March 06, 2019, 10:13:43 AM
Sadhu

Atma is aware of the most big letter issues and most of the others. (just took no time to fix it till now). Renaming of big letter-files is done according the list. The small letter files: ? no idea.

Quote from: Moritz
And then there is the SLTP Tipitaka from ZzE. Maybe good trying to bring in similar format like the CSCD edition?

That's another task Atma would desire to do since a long, but probably another some years. Everybody is always welcome, at least there could be an end of the problems with PTS-references and others after having it brought into the standard here.

Since the most might be indexed Atma can then continue to finish the implementation of the CSCD-files, correcting the last xml-tags and the sutta-codes.

Title: [ati.eu] Indexing, search engine
Post by: Johann on March 06, 2019, 01:30:16 PM
Given the heard of indexing Atma thought to give it an own topic (hopefully not twice)

A moderator on DW-forum generously gave a hint in the topic Not public pages, search-index renew, search engines (https://forum.dokuwiki.org/post/65165) to

Quote from: https://forum.dokuwiki.org/post/65161
If you have shell access on your server, you can try to use the indexer-script, which is in the bin-directory of your wiki: https://www.dokuwiki.org/cli#indexerphp

Not so much into that, may it be of help for the indexing issues, Nyom Moritz .
Title: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 25, 2019, 10:08:33 AM

Aramika   *

Dieses neue Thema (bzw. diese/r Beitrag/e) wurde  aus abgetrennten Beiträgen, ursprünglich in [ATI.eu] CSCD xml to ati.eu format: converting, editing  (http://sangham.net/index.php/topic,8672.msg18472.html#msg18472), hinzugefügt. Für ev. ergänzende Informationen zur sehen Sie bitte das Ursprugsthema ein. Anumodana!
The new topic (or post/s) here are originaly from [ATI.eu] CSCD xml to ati.eu format: converting, editing  (http://sangham.net/index.php/topic,8672.msg18472.html#msg18472). For eventual additionally information: please visit also the Topic of origin. Anumodana!
[Original post:]


After 48h 9600 pages where indexed, then the energy supply failed because of clouds and the connection as well. Atma may try it later again.

Since cool whether, it's possible good to work on giving conditions here by "finishing" certain buildings and dwellings the next time, with that what has been given or left behind.
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 26, 2019, 05:58:09 PM
Atma just started another try to "update index". Lets see.
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 27, 2019, 07:46:51 PM
Indexing has finished, but a quick search with batchedit gives that not all pages might be included in the search index for now. Lets see.
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Moritz on March 28, 2019, 02:22:00 PM
If it is helpful at this time, I could start the process to rebuild the search index from here, which should be finished after one day.

So, if it is not necessary to search much in the next 24 hours, I would just start to re-index now.

_/\_

/me I think there might be solutions with more professional search engines that would make it less time consuming to keep a search index working and maintained. Maybe I can find a possibility with some time. But that is another topic. _/\_
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 28, 2019, 02:32:56 PM
Currently not using search or batchedit, how ever Nyom might think.

(There is a inbuilt search.php, told that it can be executed direct on the server to rebuild the index. Maybe that helps. https://www.dokuwiki.org/cli#indexerphp )
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Moritz on March 28, 2019, 02:48:11 PM
Currently not using search or batchedit, how ever Nyom might think.

(There is a inbuilt search.php, told that it can be executed direct on the server to rebuild the index. Maybe that helps. https://www.dokuwiki.org/cli#indexerphp )

Rebuilding index started.

The helper scripts listed on https://www.dokuwiki.org/cli (https://www.dokuwiki.org/cli) are only usable if one has shell access on the server. But that is not the case for the Greensta server here. (But still possibly useful to look into and adapt something maybe when having more time for it.) So just using the previous approach now.

_/\_
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 28, 2019, 02:51:04 PM
Sadhu
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Moritz on March 29, 2019, 02:25:48 PM
I accidentally restarted rebuilding the index again from scratch. So now, progress is again at about 5000/20000 pages.

I wrote a new script, adapting methods from the CLI script (https://www.dokuwiki.org/cli), so that the whole process would run on the server, not needing to have a connection and open browser window all the time to send commands for every single page to be indexed one by one.
This should at least be a little bit faster, without the sending commands and responses back and forth, but the speed difference is not really noticeable. So it should, again, be finished in one day.

The current progress can be seen by opening http://accesstoinsight.eu/indexer.success.log (http://accesstoinsight.eu/indexer.success.log) (listing pages that were indexed successfully) and http://accesstoinsight.eu/indexer.error.log (http://accesstoinsight.eu/indexer.error.log) (listing pages which could not be indexed for some reason, currently empty).
There is a counting number before each page name in the lists, so one can see how many pages have already been processed.

_/\_
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 29, 2019, 03:44:28 PM
Sadhu
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Moritz on March 30, 2019, 02:48:15 PM
The indexing script I had started on the server (which should be doing just the same as the CLI indexer script) stopped at some point due to running out of memory (working memory, not storage memory). It seems that certain pages simply cannot be indexed because the indexer would need too much memory for it.
For example http://accesstoinsight.eu/cs-th:tika:sut.dn.0_tik and following pages always fail with
Code: [Select]
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/clients/client2157/web5417/web/inc/indexer.php on line 612
or similar.

Line 612 is here:
Code: [Select]
$wordlist = explode(' ', $text);
splitting the whole text of a page into single words by spaces.

But I really do not understand why this would take so much memory. Also, replicating this same operation on my computer, splitting the same page text with the same methods into single words and storing in a variable in PHP, does not need nearly as much memory here.

Trying to find a way to work around it, I gave up now.

Continued indexing with the other method (which runs locally on my computer and sends a command for every single page to be indexed through the network, and does not stop if a page fails to be indexed), currently indexed until ~11000 pages (with many "holes" of pages which just cannot be indexed with the current server).

Should be finished in some 16 hours maybe if now just let to run. But with the current server infrastructure it seems the search index will always be incomplete.

_/\_
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on March 30, 2019, 04:07:07 PM
Sadhu for effort and care. May Nyom always give/take himself his time.

(The "big pages", Atma thinks about 10 %, like the other of the cscd Tipitaka, would not change later on in regard of content. Atma remembers that once there was still a search engine on ZzE, it was also never possible to index all Pali Tipitaka pages of original Ati as well, always having errors.

On the other side, on ZzE once and also now on ati.eu, there have been times where the index was obviously complete.)

Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Moritz on April 01, 2019, 12:23:59 AM
Quote
May Nyom always give/take himself his time.
_/\_

Indexing finished some time this morning.

Quote
On the other side, on ZzE once and also now on ati.eu, there have been times where the index was obviously complete.)
Obviously (offensichtlich)? Or apparently (offenbar, scheinbar, anscheinend)?

I think maybe the latter, because these errors would never appear in the Searchindex Manager plugin. It would just say "page already up to date" or something, when a page could not be indexed.

After retrying several times to index the files which failed, all files which still could not be indexed are just 474 pages in Thai script (listed below). I think the reason is the way DokuWiki handles some Asian scripts, including Thai, treating every character as a single word, which would take a lot of memory for the indexer. Quote from inc/indexer.php file, line 18 and following:
Code: [Select]
// Asian characters are handled as words. The following regexp defines the
// Unicode-Ranges for Asian characters
// Ranges taken from http://en.wikipedia.org/wiki/Unicode_block
// I'm no language expert. If you think some ranges are wrongly chosen or
// a range is missing, please contact me
define('IDX_ASIAN1','[\x{0E00}-\x{0E7F}]'); // Thai

I have deleted all files in en:s and de:s which were just examples on how to integrate Google Site Search and comments about other search engines tested by Mr. Bullitt for accesstoinsight.org in the past.

List of unindexed Thai script files:
Code: [Select]
cs-th:atthakatha:sut.kn.jat.v01_att
cs-th:atthakatha:sut.kn.jat.v02_att
cs-th:atthakatha:sut.kn.jat.v03_att
cs-th:atthakatha:sut.kn.jat.v04_att
cs-th:atthakatha:sut.kn.jat.v05_att
cs-th:atthakatha:sut.kn.jat.v06_att
cs-th:atthakatha:sut.kn.jat.v07_att
cs-th:atthakatha:sut.kn.jat.v08_att
cs-th:atthakatha:sut.kn.jat.v09_att
cs-th:atthakatha:sut.kn.jat.v10_att
cs-th:atthakatha:sut.kn.jat.v11_att
cs-th:atthakatha:sut.kn.jat.v12_att
cs-th:atthakatha:sut.kn.jat.v13_att
cs-th:atthakatha:sut.kn.jat.v14_att
cs-th:atthakatha:sut.kn.jat.v15_att
cs-th:atthakatha:sut.kn.jat.v16_att
cs-th:atthakatha:sut.kn.jat.v17_att
cs-th:atthakatha:sut.kn.jat.v18_att
cs-th:atthakatha:sut.kn.jat.v19_att
cs-th:atthakatha:sut.kn.jat.v20_att
cs-th:atthakatha:sut.kn.jat.v21_att
cs-th:atthakatha:sut.kn.jat.v22_att
cs-th:atthakatha:sut.kn.jat.v23_att
cs-th:atthakatha:sut.kn.khp.0_att
cs-th:atthakatha:sut.kn.khp.1_att
cs-th:atthakatha:sut.kn.khp.2_att
cs-th:atthakatha:sut.kn.khp.3_att
cs-th:atthakatha:sut.kn.khp.4_att
cs-th:atthakatha:sut.kn.khp.5_att
cs-th:atthakatha:sut.kn.khp.6_att
cs-th:atthakatha:sut.kn.khp.7_att
cs-th:atthakatha:sut.kn.khp.8_att
cs-th:atthakatha:sut.kn.khp.9_att
cs-th:atthakatha:sut.kn.man.00_att
cs-th:atthakatha:sut.kn.man.01_att
cs-th:atthakatha:sut.kn.man.02_att
cs-th:atthakatha:sut.kn.man.03_att
cs-th:atthakatha:sut.kn.man.04_att
cs-th:atthakatha:sut.kn.man.05_att
cs-th:atthakatha:sut.kn.man.06_att
cs-th:atthakatha:sut.kn.man.07_att
cs-th:atthakatha:sut.kn.man.08_att
cs-th:atthakatha:sut.kn.man.09_att
cs-th:atthakatha:sut.kn.man.10_att
cs-th:atthakatha:sut.kn.man.11_att
cs-th:atthakatha:sut.kn.man.12_att
cs-th:atthakatha:sut.kn.man.13_att
cs-th:atthakatha:sut.kn.man.14_att
cs-th:atthakatha:sut.kn.man.15_att
cs-th:atthakatha:sut.kn.man.16_att
cs-th:atthakatha:sut.kn.net.0_att
cs-th:atthakatha:sut.kn.net.1_att
cs-th:atthakatha:sut.kn.net.2_att
cs-th:atthakatha:sut.kn.net.3_att
cs-th:atthakatha:sut.kn.net.4_att
cs-th:atthakatha:sut.kn.net.5_att
cs-th:atthakatha:sut.kn.net.6_att
cs-th:atthakatha:sut.kn.pat.v0_att
cs-th:atthakatha:sut.kn.pat.v1.01_att
cs-th:atthakatha:sut.kn.pat.v1.02_att
cs-th:atthakatha:sut.kn.pat.v1.03_att
cs-th:atthakatha:sut.kn.pat.v1.04_att
cs-th:atthakatha:sut.kn.pat.v1.05_att
cs-th:atthakatha:sut.kn.pat.v1.06_att
cs-th:atthakatha:sut.kn.pat.v1.07_att
cs-th:atthakatha:sut.kn.pat.v1.08_att
cs-th:atthakatha:sut.kn.pat.v1.09_att
cs-th:atthakatha:sut.kn.pat.v1.10_att
cs-th:atthakatha:sut.kn.pat.v1_att
cs-th:atthakatha:sut.kn.pat.v2_att
cs-th:atthakatha:sut.kn.pat.v3.01_att
cs-th:atthakatha:sut.kn.pat.v3.02_att
cs-th:atthakatha:sut.kn.pat.v3.03_att
cs-th:atthakatha:sut.kn.pat.v3.04_att
cs-th:atthakatha:sut.kn.pat.v3.05_att
cs-th:atthakatha:sut.kn.pat.v3.06_att
cs-th:atthakatha:sut.kn.pat.v3.07_att
cs-th:atthakatha:sut.kn.pat.v3.08_att
cs-th:atthakatha:sut.kn.pat.v3.09_att
cs-th:atthakatha:sut.kn.pat.v3.10_att
cs-th:atthakatha:sut.kn.pat.v3_att
cs-th:atthakatha:sut.kn.pev.0_att
cs-th:atthakatha:sut.kn.pev.1_att
cs-th:atthakatha:sut.kn.pev.2_att
cs-th:atthakatha:sut.kn.pev.3_att
cs-th:atthakatha:sut.kn.pev.4_att
cs-th:atthakatha:sut.kn.snp.1_att
cs-th:atthakatha:sut.kn.snp.2_att
cs-th:atthakatha:sut.kn.snp.3_att
cs-th:atthakatha:sut.kn.snp.4_att
cs-th:atthakatha:sut.kn.snp.5_att
cs-th:atthakatha:sut.kn.tha.00_att
cs-th:atthakatha:sut.kn.tha.01_att
cs-th:atthakatha:sut.kn.tha.02_att
cs-th:atthakatha:sut.kn.tha.03_att
cs-th:atthakatha:sut.kn.tha.04_att
cs-th:atthakatha:sut.kn.tha.05_att
cs-th:atthakatha:sut.kn.tha.06_att
cs-th:atthakatha:sut.kn.tha.07_att
cs-th:atthakatha:sut.kn.tha.08_att
cs-th:atthakatha:sut.kn.tha.09_att
cs-th:atthakatha:sut.kn.tha.10_att
cs-th:atthakatha:sut.kn.tha.11_att
cs-th:atthakatha:sut.kn.tha.12_att
cs-th:atthakatha:sut.kn.tha.13_att
cs-th:atthakatha:sut.kn.tha.14_att
cs-th:atthakatha:sut.kn.tha.15_att
cs-th:atthakatha:sut.kn.tha.16_att
cs-th:atthakatha:sut.kn.tha.17_att
cs-th:atthakatha:sut.kn.tha.18_att
cs-th:atthakatha:sut.kn.tha.19_att
cs-th:atthakatha:sut.kn.tha.20_att
cs-th:atthakatha:sut.kn.tha.21_att
cs-th:atthakatha:sut.kn.thi.01_att
cs-th:atthakatha:sut.kn.thi.02_att
cs-th:atthakatha:sut.kn.thi.03_att
cs-th:atthakatha:sut.kn.thi.04_att
cs-th:atthakatha:sut.kn.thi.05_att
cs-th:atthakatha:sut.kn.thi.06_att
cs-th:atthakatha:sut.kn.thi.07_att
cs-th:atthakatha:sut.kn.thi.08_att
cs-th:atthakatha:sut.kn.thi.09_att
cs-th:atthakatha:sut.kn.thi.10_att
cs-th:atthakatha:sut.kn.thi.11_att
cs-th:atthakatha:sut.kn.thi.12_att
cs-th:atthakatha:sut.kn.thi.13_att
cs-th:atthakatha:sut.kn.thi.14_att
cs-th:atthakatha:sut.kn.thi.15_att
cs-th:atthakatha:sut.kn.thi.16_att
cs-th:atthakatha:sut.kn.uda.0_att
cs-th:atthakatha:sut.kn.uda.1_att
cs-th:atthakatha:sut.kn.uda.2_att
cs-th:atthakatha:sut.kn.uda.3_att
cs-th:atthakatha:sut.kn.uda.4_att
cs-th:atthakatha:sut.kn.uda.5_att
cs-th:atthakatha:sut.kn.uda.6_att
cs-th:atthakatha:sut.kn.uda.7_att
cs-th:atthakatha:sut.kn.uda.8_att
cs-th:atthakatha:sut.kn.viv.v0_att
cs-th:atthakatha:sut.kn.viv.v1_att
cs-th:atthakatha:sut.kn.viv.v2_att
cs-th:atthakatha:sut.mn.v00_att
cs-th:atthakatha:sut.mn.v01_att
cs-th:atthakatha:sut.mn.v02_att
cs-th:atthakatha:sut.mn.v03_att
cs-th:atthakatha:sut.mn.v04_att
cs-th:atthakatha:sut.mn.v05_att
cs-th:atthakatha:sut.mn.v06_att
cs-th:atthakatha:sut.mn.v07_att
cs-th:atthakatha:sut.mn.v08_att
cs-th:atthakatha:sut.mn.v09_att
cs-th:atthakatha:sut.mn.v10_att
cs-th:atthakatha:sut.mn.v11_att
cs-th:atthakatha:sut.mn.v12_att
cs-th:atthakatha:sut.mn.v13_att
cs-th:atthakatha:sut.mn.v14_att
cs-th:atthakatha:sut.mn.v15_att
cs-th:atthakatha:sut.sn.00_att
cs-th:atthakatha:sut.sn.01_att
cs-th:atthakatha:sut.sn.02_att
cs-th:atthakatha:sut.sn.03_att
cs-th:atthakatha:sut.sn.04_att
cs-th:atthakatha:sut.sn.05_att
cs-th:atthakatha:sut.sn.06_att
cs-th:atthakatha:sut.sn.07_att
cs-th:atthakatha:sut.sn.08_att
cs-th:atthakatha:sut.sn.09_att
cs-th:atthakatha:sut.sn.10_att
cs-th:atthakatha:sut.sn.11_att
cs-th:atthakatha:sut.sn.12_att
cs-th:atthakatha:sut.sn.13_att
cs-th:atthakatha:sut.sn.14_att
cs-th:atthakatha:sut.sn.15_att
cs-th:atthakatha:sut.sn.16_att
cs-th:atthakatha:sut.sn.17_att
cs-th:atthakatha:sut.sn.18_att
cs-th:atthakatha:sut.sn.19_att
cs-th:atthakatha:sut.sn.20_att
cs-th:atthakatha:sut.sn.21_att
cs-th:atthakatha:sut.sn.22_att
cs-th:atthakatha:sut.sn.23_att
cs-th:atthakatha:sut.sn.24_att
cs-th:atthakatha:sut.sn.25_att
cs-th:atthakatha:sut.sn.26_att
cs-th:atthakatha:sut.sn.27_att
cs-th:atthakatha:sut.sn.28_att
cs-th:atthakatha:sut.sn.29_att
cs-th:atthakatha:sut.sn.30_att
cs-th:atthakatha:sut.sn.31_att
cs-th:atthakatha:sut.sn.32_att
cs-th:atthakatha:sut.sn.33_att
cs-th:atthakatha:sut.sn.34_att
cs-th:atthakatha:sut.sn.35_att
cs-th:atthakatha:sut.sn.36_att
cs-th:atthakatha:sut.sn.37_att
cs-th:atthakatha:sut.sn.38_att
cs-th:atthakatha:sut.sn.39_att
cs-th:atthakatha:sut.sn.40_att
cs-th:atthakatha:sut.sn.41_att
cs-th:atthakatha:sut.sn.42_att
cs-th:atthakatha:sut.sn.43_att
cs-th:atthakatha:sut.sn.44_att
cs-th:atthakatha:sut.sn.45_att
cs-th:atthakatha:sut.sn.46_att
cs-th:atthakatha:sut.sn.47_att
cs-th:atthakatha:sut.sn.48_att
cs-th:atthakatha:sut.sn.49_att
cs-th:atthakatha:sut.sn.50_att
cs-th:atthakatha:sut.sn.51_att
cs-th:atthakatha:sut.sn.52_att
cs-th:atthakatha:sut.sn.53_att
cs-th:atthakatha:sut.sn.54_att
cs-th:atthakatha:sut.sn.55_att
cs-th:atthakatha:sut.sn.56_att
cs-th:atthakatha:vin.cv.01_att
cs-th:atthakatha:vin.cv.02_att
cs-th:atthakatha:vin.cv.03_att
cs-th:atthakatha:vin.cv.04_att
cs-th:atthakatha:vin.cv.05_att
cs-th:atthakatha:vin.cv.06_att
cs-th:atthakatha:vin.cv.07_att
cs-th:atthakatha:vin.cv.08_att
cs-th:atthakatha:vin.cv.09_att
cs-th:atthakatha:vin.cv.10_att
cs-th:atthakatha:vin.cv.11_att
cs-th:atthakatha:vin.cv.12_att
cs-th:atthakatha:vin.mv.01_att
cs-th:atthakatha:vin.mv.02_att
cs-th:atthakatha:vin.mv.03_att
cs-th:atthakatha:vin.mv.04_att
cs-th:atthakatha:vin.mv.05_att
cs-th:atthakatha:vin.mv.06_att
cs-th:atthakatha:vin.mv.07_att
cs-th:atthakatha:vin.mv.08_att
cs-th:atthakatha:vin.mv.09_att
cs-th:atthakatha:vin.mv.10_att
cs-th:atthakatha:vin.pac.ak_att
cs-th:atthakatha:vin.pac.nii_att
cs-th:atthakatha:vin.pac.pc_att
cs-th:atthakatha:vin.pac.pci_att
cs-th:atthakatha:vin.pac.pd_att
cs-th:atthakatha:vin.pac.pdi_att
cs-th:atthakatha:vin.pac.pri_att
cs-th:atthakatha:vin.pac.sgi_att
cs-th:atthakatha:vin.pac.sk_att
cs-th:atthakatha:vin.par.ay_att
cs-th:atthakatha:vin.par.ga_att
cs-th:atthakatha:vin.par.ni_att
cs-th:atthakatha:vin.par.pr_att
cs-th:atthakatha:vin.par.sg_att
cs-th:atthakatha:vin.par.ve_att
cs-th:atthakatha:vin.pv.01_att
cs-th:atthakatha:vin.pv.02_att
cs-th:atthakatha:vin.pv.03_att
cs-th:atthakatha:vin.pv.04_att
cs-th:atthakatha:vin.pv.05_att
cs-th:atthakatha:vin.pv.06_att
cs-th:atthakatha:vin.pv.07_att
cs-th:atthakatha:vin.pv.08_att
cs-th:atthakatha:vin.pv.09_att
cs-th:atthakatha:vin.pv.10_att
cs-th:atthakatha:vin.pv.11_att
cs-th:atthakatha:vin.pv.12_att
cs-th:atthakatha:vin.pv.13_att
cs-th:atthakatha:vin.pv.14_att
cs-th:atthakatha:vin.pv.15_att
cs-th:atthakatha:vin.pv.16_att
cs-th:atthakatha:vin.pv.17_att
cs-th:atthakatha:vin.pv.18_att
cs-th:tika:abh.ava-pura.01_tik
cs-th:tika:abh.ava-pura.02_tik
cs-th:tika:abh.ava-pura.03_tik
cs-th:tika:abh.ava-pura.04_tik
cs-th:tika:abh.ava-pura.05_tik
cs-th:tika:abh.ava-pura.06_tik
cs-th:tika:abh.ava-pura.07_tik
cs-th:tika:abh.ava-pura.08_tik
cs-th:tika:abh.ava-pura.09_tik
cs-th:tika:abh.ava-pura.10_tik
cs-th:tika:abh.ava-pura.11_tik
cs-th:tika:sut.dn.01_abh_tik
cs-th:tika:sut.dn.01_tik
cs-th:tika:sut.dn.02_abh_tik
cs-th:tika:sut.dn.02_tik
cs-th:tika:sut.dn.03_abh_tik
cs-th:tika:sut.dn.03_tik
cs-th:tika:sut.dn.04_abh_tik
cs-th:tika:sut.dn.04_tik
cs-th:tika:sut.dn.05_abh_tik
cs-th:tika:sut.dn.05_tik
cs-th:tika:sut.dn.06_abh_tik
cs-th:tika:sut.dn.06_tik
cs-th:tika:sut.dn.07_abh_tik
cs-th:tika:sut.dn.07_tik
cs-th:tika:sut.dn.08_abh_tik
cs-th:tika:sut.dn.08_tik
cs-th:tika:sut.dn.09_abh_tik
cs-th:tika:sut.dn.09_tik
cs-th:tika:sut.dn.0_tik
cs-th:tika:sut.dn.10_abh_tik
cs-th:tika:sut.dn.10_tik
cs-th:tika:sut.dn.11_abh_tik
cs-th:tika:sut.dn.11_tik
cs-th:tika:sut.dn.12_abh_tik
cs-th:tika:sut.dn.12_tik
cs-th:tika:sut.dn.13_abh_tik
cs-th:tika:sut.dn.13_tik
cs-th:tika:sut.dn.14_tik
cs-th:tika:sut.dn.15_tik
cs-th:tika:sut.dn.16_tik
cs-th:tika:sut.dn.17_tik
cs-th:tika:sut.dn.18_tik
cs-th:tika:sut.dn.19_tik
cs-th:tika:sut.dn.20_tik
cs-th:tika:sut.dn.21_tik
cs-th:tika:sut.dn.22_tik
cs-th:tika:sut.dn.23_tik
cs-th:tika:sut.dn.24_tik
cs-th:tika:sut.dn.25_tik
cs-th:tika:sut.dn.26_tik
cs-th:tika:sut.dn.27_tik
cs-th:tika:sut.dn.28_tik
cs-th:tika:sut.dn.29_tik
cs-th:tika:sut.dn.30_tik
cs-th:tika:sut.dn.31_tik
cs-th:tika:sut.dn.32_tik
cs-th:tika:sut.dn.33_tik
cs-th:tika:sut.dn.34_tik
cs-th:tika:sut.kn.paka00_tik
cs-th:tika:sut.kn.paka01_tik
cs-th:tika:sut.kn.paka02_tik
cs-th:tika:sut.kn.paka03_tik
cs-th:tika:sut.kn.paka04_tik
cs-th:tika:sut.kn.paka05_tik
cs-th:tika:sut.kn.paka06_tik
cs-th:tika:sut.kn.vibh01_tik
cs-th:tika:sut.kn.vibh02_tik
cs-th:tika:sut.kn.vibh03_tik
cs-th:tika:sut.kn.vibh04_tik
cs-th:tika:sut.kn.vibh05_tik
cs-th:tika:sut.kn.vibh06_tik
cs-th:tika:sut.mn.0_tik
cs-th:tika:sut.mn.v01_tik
cs-th:tika:sut.mn.v02_tik
cs-th:tika:sut.mn.v03_tik
cs-th:tika:sut.mn.v04_tik
cs-th:tika:sut.mn.v05_tik
cs-th:tika:sut.mn.v06_tik
cs-th:tika:sut.mn.v07_tik
cs-th:tika:sut.mn.v08_tik
cs-th:tika:sut.mn.v09_tik
cs-th:tika:sut.mn.v10_tik
cs-th:tika:sut.mn.v11_tik
cs-th:tika:sut.mn.v12_tik
cs-th:tika:sut.mn.v13_tik
cs-th:tika:sut.mn.v14_tik
cs-th:tika:sut.mn.v15_tik
cs-th:tika:sut.sn.01_tik
cs-th:tika:sut.sn.02_tik
cs-th:tika:sut.sn.03_tik
cs-th:tika:sut.sn.04_tik
cs-th:tika:sut.sn.05_tik
cs-th:tika:sut.sn.06_tik
cs-th:tika:sut.sn.07_tik
cs-th:tika:sut.sn.08_tik
cs-th:tika:sut.sn.09_tik
cs-th:tika:sut.sn.0_tik
cs-th:tika:sut.sn.10_tik
cs-th:tika:sut.sn.11_tik
cs-th:tika:sut.sn.12_tik
cs-th:tika:sut.sn.13_tik
cs-th:tika:sut.sn.14_tik
cs-th:tika:sut.sn.15_tik
cs-th:tika:sut.sn.16_tik
cs-th:tika:sut.sn.17_tik
cs-th:tika:sut.sn.18_tik
cs-th:tika:sut.sn.19_tik
cs-th:tika:sut.sn.20_tik
cs-th:tika:sut.sn.21_tik
cs-th:tika:sut.sn.22_tik
cs-th:tika:sut.sn.23_tik
cs-th:tika:sut.sn.24_tik
cs-th:tika:sut.sn.25_tik
cs-th:tika:sut.sn.26_tik
cs-th:tika:sut.sn.27_tik
cs-th:tika:sut.sn.28_tik
cs-th:tika:sut.sn.29_tik
cs-th:tika:sut.sn.30_tik
cs-th:tika:sut.sn.31_tik
cs-th:tika:sut.sn.32_tik
cs-th:tika:sut.sn.33_tik
cs-th:tika:sut.sn.34_tik
cs-th:tika:sut.sn.35_tik
cs-th:tika:sut.sn.36_tik
cs-th:tika:sut.sn.37_tik
cs-th:tika:sut.sn.38_tik
cs-th:tika:sut.sn.39_tik
cs-th:tika:sut.sn.40_tik
cs-th:tika:sut.sn.41_tik
cs-th:tika:sut.sn.42_tik
cs-th:tika:sut.sn.43_tik
cs-th:tika:sut.sn.44_tik
cs-th:tika:sut.sn.45_tik
cs-th:tika:sut.sn.46_tik
cs-th:tika:sut.sn.47_tik
cs-th:tika:sut.sn.48_tik
cs-th:tika:sut.sn.49_tik
cs-th:tika:sut.sn.50_tik
cs-th:tika:sut.sn.51_tik
cs-th:tika:sut.sn.52_tik
cs-th:tika:sut.sn.53_tik
cs-th:tika:sut.sn.54_tik
cs-th:tika:sut.sn.55_tik
cs-th:tika:sut.sn.56_tik
cs-th:tika:vin.bhi.0_dvem_tik
cs-th:tika:vin.bhi.0_kank_tik
cs-th:tika:vin.bhi.0_vima_tik
cs-th:tika:vin.bhi.v_dvem_tik
cs-th:tika:vin.bhu.0_dvem_tik
cs-th:tika:vin.bhu.0_kank_tik
cs-th:tika:vin.bhu.ni_kank_tik
cs-th:tika:vin.bhu.pc_kank_tik
cs-th:tika:vin.bhu.pr_kank_tik
cs-th:tika:vin.bhu.sg_kank_tik
cs-th:tika:vin.cv.01_sara_tik
cs-th:tika:vin.cv.01_vima_tik
cs-th:tika:vin.cv.02_sara_tik
cs-th:tika:vin.cv.02_vima_tik
cs-th:tika:vin.cv.03_sara_tik
cs-th:tika:vin.cv.03_vima_tik
cs-th:tika:vin.cv.04_sara_tik
cs-th:tika:vin.cv.04_vima_tik
cs-th:tika:vin.cv.05_sara_tik
cs-th:tika:vin.cv.05_vima_tik
cs-th:tika:vin.cv.06_sara_tik
cs-th:tika:vin.cv.06_vima_tik
cs-th:tika:vin.cv.07_sara_tik
cs-th:tika:vin.cv.07_vima_tik
cs-th:tika:vin.cv.08_sara_tik
cs-th:tika:vin.cv.08_vima_tik
cs-th:tika:vin.cv.09_sara_tik
cs-th:tika:vin.cv.09_vima_tik
cs-th:tika:vin.cv.0_paci_tik
cs-th:tika:vin.cv.0_sara_tik
cs-th:tika:vin.cv.0_vaji_tik
cs-th:tika:vin.cv.0_vima_tik
cs-th:tika:vin.cv.10_sara_tik
cs-th:tika:vin.cv.10_vima_tik
cs-th:tika:vin.cv.11_sara_tik
cs-th:tika:vin.cv.11_vima_tik
cs-th:tika:vin.cv.12_sara_tik
cs-th:tika:vin.cv.12_vima_tik
cs-th:tika:vin.kank.0_kank_tik
cs-th:tika:vin.kankha.0_dvem_tik
cs-th:tika:vin.khud.01_khud_tik
cs-th:tika:vin.khud.02_khud_tik
cs-th:tika:vin.pac.pci_vima_tik
cs-th:tika:vin.vila.08_vila_tik
cs-th:tika:vin.vila.09_vila_tik
cs-th:tika:vin.vila.10_vila_tik
cs-th:tika:vin.vila.11_vila_tik
cs-th:tika:vin.vila.12_vila_tik
cs-th:tika:vin.vila.13_vila_tik
cs-th:tika:vin.vila.14_vila_tik
cs-th:tika:vin.vila.15_vila_tik
cs-th:tika:vin.vila.16_vila_tik
cs-th:tika:vin.vila.17_vila_tik
cs-th:tika:vin.vila.18_vila_tik
cs-th:tika:vin.vila.19_vila_tik
cs-th:tika:vin.vila.20_vila_tik
cs-th:tika:vin.vila.21_vila_tik
cs-th:tika:vin.vila.22_vila_tik
cs-th:tika:vin.vila.23_vila_tik
cs-th:tika:vin.vila.24_vila_tik

_/\_
Title: Re: from: [ATI.eu] CSCD xml to ati.eu format: converting, editing
Post by: Johann on April 01, 2019, 06:17:50 AM
Sadhu

Atma recognized that the search is less rendered for scripts other then latin. Yet not understanding why separating certain ranges character by character.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on April 01, 2019, 06:34:00 PM
Atma has attached the Khmer and Thai Unicode table to possible exclude special characters like stops, computations ... breaking into single characters seems to be meaningless.

Not sure if Upasaka Vorapol may like to assist here for Thai. Atma will try to list special characters in Khmer but not sure now how the indexer or better the search engine works with compunctions and so on generally (simply cut them all away?)
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on April 01, 2019, 07:19:51 PM
17D4 KHMER SIGN KHAN • functions as a full stop, period
(→ 0E2F   thai character paiyannoi
→ 104A   myanmar sign little section)

17D5 KHMER SIGN BARIYOOSAN • indicates the end of a section or a text
(→ 0E5A   thai character angkhankhu
→ 104B   myanmar sign section)

17D6 KHMER SIGN CAMNUC PII KUUH • functions as colon
(• the preferred transliteration is camnoc pii kuuh
→ 00F7 ÷  division sign → 0F14   tibetan mark gter tsheg )

17D7 KHMER SIGN LEK TOO • repetition sign
(→ 0E46   thai character maiyamok )

17D8 KHMER SIGN BEYYAL • et cetera
• use of this character is discouraged; other abbreviations for et cetera also exist • preferred spelling: ។ល។

17D9 KHMER SIGN PHNAEK MUAN • indicates the beginning of a book or a treatise • the preferred transliteration is phnek moan
(→ 0E4F   thai character fongman )

17DA KHMER SIGN KOOMUUT • indicates the end of a book or treatise • this forms a pair with 17D9   • the preferred transliteration is koomoot
(→ 0E5B   thai character khomut )

17DB KHMER CURRENCY SYMBOL RIEL

17E0 KHMER DIGIT ZERO
17E1 KHMER DIGIT ONE
17E2 KHMER DIGIT TWO
17E3 KHMER DIGIT THREE
17E4 KHMER DIGIT FOUR
17E5 KHMER DIGIT FIVE
17E6 KHMER DIGIT SIX
17E7 KHMER DIGIT SEVEN
17E8 KHMER DIGIT EIGHT
17E9 KHMER DIGIT NINE

0E50 THAI DIGIT ZERO
0E51 THAI DIGIT ONE
0E52 THAI DIGIT TWO
0E53 THAI DIGIT THREE
0E54 THAI DIGIT FOUR
0E55 THAI DIGIT FIVE
0E56 THAI DIGIT SIX
0E57 THAI DIGIT SEVEN
0E58 THAI DIGIT EIGHT
0E59 THAI DIGIT NINE

Word breaks are either white spaces or zero width spaces in both scripts, Khmer and Thai.
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Moritz on April 01, 2019, 09:58:43 PM
Sadhu, I was just reading the searching code, trying to understand a bit how it works.

Probably the breaking into single characters idea came from Chinese or Japanese, where I think characters usually represent a complete word. Maybe Thai was included by mistake, thinking that all Asian scripts use such logographic "full word" characters.

I think it is best to just remove the Thai block from the "Asian" set and treat it "normally" like Roman etc. The Khmer block (1780–17FF) is not included there either, and search is working well. Most important change needed would be probably to use zero-width spaces as separators.

not sure now how the indexer or better the search engine works with compunctions and so on generally (simply cut them all away?)
/me ("punctuation" - not "compunction" = "Gewissenhaftigkeit", as Bhante Thanissaro translates "otappa")
Punctuation marks are removed during indexing, and the resulting pieces are stored as "words" in the index, along with some reference tables to store which page has how many occurrences of each word.
When searching, first the tables are searched to find all pages which contain all the single words in the search phrase. And then, if the search phrase (or parts of the search phrase) has been put into quotes "", also the whole search phrase (or the quoted parts) is matched to find the exact occurrence in the text, including punctuation marks and so on.
For example, searching for "ist den Drei Juwelen, dem Buddha, dem Dhamma, der Sangha, gewidmet" will find one result on the page http://www.accesstoinsight.eu/km/index (http://www.accesstoinsight.eu/km/index) now.
/me (Strange: It should also find the same on http://www.accesstoinsight.eu/de/index) (http://www.accesstoinsight.eu/de/index) but it does not. Seems like the index is somehow incomplete again.
If leaving out one comma or one word, like "ist den Drei Juwelen Buddha, dem Dhamma, der Sangha, gewidmet", still put in quotes, no result would be found, because there is no exact match of the quotation.
However, if searching for the same without quotation marks around, results would be found again, just looking for every word, not for the whole phrase, and ignoring all punctuation.
/me (Strange: Searching in this way also gives http://www.accesstoinsight.eu/de/index (http://www.accesstoinsight.eu/de/index) as a result, as it should be. But it did not find the match when searching for the whole quoted phrase.)

So it should still be possible then to find exact text passages including punctuation marks, if quoted. Although, as just seen, sometimes the search engine might not work as it should. ^-^

I think I know now how to do the necessary changes to include the Khmer and Thai punctuation marks and zero-white spaces as separators. After adding these separators, the pages would have to be re-indexed again.

I will try to do it later this week. Not today anymore.

_/\_
Title: Re: [ATI.eu] Indexing and search engine issues
Post by: Johann on April 01, 2019, 10:11:35 PM
Sadhu