Sadhu, I was just reading the searching code, trying to understand a bit how it works.
Probably the breaking into single characters idea came from Chinese or Japanese, where I think characters usually represent a complete word. Maybe Thai was included by mistake, thinking that all Asian scripts use such logographic "full word" characters.
I think it is best to just remove the Thai block from the "Asian" set and treat it "normally" like Roman etc. The Khmer block (1780–17FF) is not included there either, and search is working well. Most important change needed would be probably to use zero-width spaces as separators.
not sure now how the indexer or better the search engine works with compunctions and so on generally (simply cut them all away?)
* Moritz ("punctuation" - not "compunction" = "Gewissenhaftigkeit", as Bhante Thanissaro translates "otappa")
Punctuation marks are removed during indexing, and the resulting pieces are stored as "words" in the index, along with some reference tables to store which page has how many occurrences of each word.
When searching, first the tables are searched to find all pages which contain all the single words in the search phrase. And then, if the search phrase (or parts of the search phrase) has been put into quotes "", also the whole search phrase (or the quoted parts) is matched to find the exact occurrence in the text, including punctuation marks and so on.
For example, searching for
"ist den Drei Juwelen, dem Buddha, dem Dhamma, der Sangha, gewidmet" will find one result on the page
http://www.accesstoinsight.eu/km/index now.
If leaving out one comma or one word, like
"ist den Drei Juwelen Buddha, dem Dhamma, der Sangha, gewidmet", still put in quotes, no result would be found, because there is no exact match of the quotation.
However, if searching for the same without quotation marks around, results would be found again, just looking for every word, not for the whole phrase, and ignoring all punctuation.
* Moritz (Strange: Searching in this way also gives
http://www.accesstoinsight.eu/de/index as a result, as it should be. But it did not find the match when searching for the whole quoted phrase.)
So it
should still be possible then to find exact text passages including punctuation marks, if quoted. Although, as just seen, sometimes the search engine might not work as it should.
I think I know now how to do the necessary changes to include the Khmer and Thai punctuation marks and zero-white spaces as separators. After adding these separators, the pages would have to be re-indexed again.
I will try to do it later this week. Not today anymore.