Post reply

Warning - while you were reading 17 new replies have been posted. You may wish to review your post.
Name:
Email:
Subject:
Tags:

Seperate each tag by a comma
Message icon:

Attach:
(Clear Attachment)
(more attachments)
Allowed file types: apk, doc, docx, gif, jpg, mpg, pdf, png, txt, zip, xls, 3gpp, mp2, mp3, wav, odt, ods, html, mp4, amr, apk, m4a, jpeg, aac
Restrictions: 50 per post, maximum total size 150000KB, maximum individual size 150000KB
Note that any files attached will not be displayed until approved by a moderator.
Anti-spam: complete the task

shortcuts: hit alt+s to submit/post or alt+p to preview


Topic Summary

Posted by: Dhammañāṇa
« on: April 01, 2019, 10:11:35 PM »

Sadhu
Posted by: Moritz
« on: April 01, 2019, 09:58:43 PM »

Sadhu, I was just reading the searching code, trying to understand a bit how it works.

Probably the breaking into single characters idea came from Chinese or Japanese, where I think characters usually represent a complete word. Maybe Thai was included by mistake, thinking that all Asian scripts use such logographic "full word" characters.

I think it is best to just remove the Thai block from the "Asian" set and treat it "normally" like Roman etc. The Khmer block (1780–17FF) is not included there either, and search is working well. Most important change needed would be probably to use zero-width spaces as separators.

not sure now how the indexer or better the search engine works with compunctions and so on generally (simply cut them all away?)
* Moritz ("punctuation" - not "compunction" = "Gewissenhaftigkeit", as Bhante Thanissaro translates "otappa")
Punctuation marks are removed during indexing, and the resulting pieces are stored as "words" in the index, along with some reference tables to store which page has how many occurrences of each word.
When searching, first the tables are searched to find all pages which contain all the single words in the search phrase. And then, if the search phrase (or parts of the search phrase) has been put into quotes "", also the whole search phrase (or the quoted parts) is matched to find the exact occurrence in the text, including punctuation marks and so on.
For example, searching for "ist den Drei Juwelen, dem Buddha, dem Dhamma, der Sangha, gewidmet" will find one result on the page http://www.accesstoinsight.eu/km/index now.
* Moritz (Strange: It should also find the same on http://www.accesstoinsight.eu/de/index) but it does not. Seems like the index is somehow incomplete again.
If leaving out one comma or one word, like "ist den Drei Juwelen Buddha, dem Dhamma, der Sangha, gewidmet", still put in quotes, no result would be found, because there is no exact match of the quotation.
However, if searching for the same without quotation marks around, results would be found again, just looking for every word, not for the whole phrase, and ignoring all punctuation.
* Moritz (Strange: Searching in this way also gives http://www.accesstoinsight.eu/de/index as a result, as it should be. But it did not find the match when searching for the whole quoted phrase.)

So it should still be possible then to find exact text passages including punctuation marks, if quoted. Although, as just seen, sometimes the search engine might not work as it should. ^-^

I think I know now how to do the necessary changes to include the Khmer and Thai punctuation marks and zero-white spaces as separators. After adding these separators, the pages would have to be re-indexed again.

I will try to do it later this week. Not today anymore.

_/\_
Posted by: Dhammañāṇa
« on: April 01, 2019, 07:19:51 PM »

17D4 KHMER SIGN KHAN • functions as a full stop, period
(→ 0E2F   thai character paiyannoi
→ 104A   myanmar sign little section)

17D5 KHMER SIGN BARIYOOSAN • indicates the end of a section or a text
(→ 0E5A   thai character angkhankhu
→ 104B   myanmar sign section)

17D6 KHMER SIGN CAMNUC PII KUUH • functions as colon
(• the preferred transliteration is camnoc pii kuuh
→ 00F7 ÷  division sign → 0F14   tibetan mark gter tsheg )

17D7 KHMER SIGN LEK TOO • repetition sign
(→ 0E46   thai character maiyamok )

17D8 KHMER SIGN BEYYAL • et cetera
• use of this character is discouraged; other abbreviations for et cetera also exist • preferred spelling: ។ល។

17D9 KHMER SIGN PHNAEK MUAN • indicates the beginning of a book or a treatise • the preferred transliteration is phnek moan
(→ 0E4F   thai character fongman )

17DA KHMER SIGN KOOMUUT • indicates the end of a book or treatise • this forms a pair with 17D9   • the preferred transliteration is koomoot
(→ 0E5B   thai character khomut )

17DB KHMER CURRENCY SYMBOL RIEL

17E0 KHMER DIGIT ZERO
17E1 KHMER DIGIT ONE
17E2 KHMER DIGIT TWO
17E3 KHMER DIGIT THREE
17E4 KHMER DIGIT FOUR
17E5 KHMER DIGIT FIVE
17E6 KHMER DIGIT SIX
17E7 KHMER DIGIT SEVEN
17E8 KHMER DIGIT EIGHT
17E9 KHMER DIGIT NINE

0E50 THAI DIGIT ZERO
0E51 THAI DIGIT ONE
0E52 THAI DIGIT TWO
0E53 THAI DIGIT THREE
0E54 THAI DIGIT FOUR
0E55 THAI DIGIT FIVE
0E56 THAI DIGIT SIX
0E57 THAI DIGIT SEVEN
0E58 THAI DIGIT EIGHT
0E59 THAI DIGIT NINE

Word breaks are either white spaces or zero width spaces in both scripts, Khmer and Thai.
Posted by: Dhammañāṇa
« on: April 01, 2019, 06:34:00 PM »

Atma has attached the Khmer and Thai Unicode table to possible exclude special characters like stops, computations ... breaking into single characters seems to be meaningless.

Not sure if Upasaka Vorapol may like to assist here for Thai. Atma will try to list special characters in Khmer but not sure now how the indexer or better the search engine works with compunctions and so on generally (simply cut them all away?)
Posted by: Dhammañāṇa
« on: April 01, 2019, 06:17:50 AM »

Sadhu

Atma recognized that the search is less rendered for scripts other then latin. Yet not understanding why separating certain ranges character by character.
Posted by: Moritz
« on: April 01, 2019, 12:23:59 AM »

Quote
May Nyom always give/take himself his time.
_/\_

Indexing finished some time this morning.

Quote
On the other side, on ZzE once and also now on ati.eu, there have been times where the index was obviously complete.)
Obviously (offensichtlich)? Or apparently (offenbar, scheinbar, anscheinend)?

I think maybe the latter, because these errors would never appear in the Searchindex Manager plugin. It would just say "page already up to date" or something, when a page could not be indexed.

After retrying several times to index the files which failed, all files which still could not be indexed are just 474 pages in Thai script (listed below). I think the reason is the way DokuWiki handles some Asian scripts, including Thai, treating every character as a single word, which would take a lot of memory for the indexer. Quote from inc/indexer.php file, line 18 and following:
Code: [Select]
// Asian characters are handled as words. The following regexp defines the
// Unicode-Ranges for Asian characters
// Ranges taken from http://en.wikipedia.org/wiki/Unicode_block
// I'm no language expert. If you think some ranges are wrongly chosen or
// a range is missing, please contact me
define('IDX_ASIAN1','[\x{0E00}-\x{0E7F}]'); // Thai

I have deleted all files in en:s and de:s which were just examples on how to integrate Google Site Search and comments about other search engines tested by Mr. Bullitt for accesstoinsight.org in the past.

List of unindexed Thai script files:
Code: [Select]
cs-th:atthakatha:sut.kn.jat.v01_att
cs-th:atthakatha:sut.kn.jat.v02_att
cs-th:atthakatha:sut.kn.jat.v03_att
cs-th:atthakatha:sut.kn.jat.v04_att
cs-th:atthakatha:sut.kn.jat.v05_att
cs-th:atthakatha:sut.kn.jat.v06_att
cs-th:atthakatha:sut.kn.jat.v07_att
cs-th:atthakatha:sut.kn.jat.v08_att
cs-th:atthakatha:sut.kn.jat.v09_att
cs-th:atthakatha:sut.kn.jat.v10_att
cs-th:atthakatha:sut.kn.jat.v11_att
cs-th:atthakatha:sut.kn.jat.v12_att
cs-th:atthakatha:sut.kn.jat.v13_att
cs-th:atthakatha:sut.kn.jat.v14_att
cs-th:atthakatha:sut.kn.jat.v15_att
cs-th:atthakatha:sut.kn.jat.v16_att
cs-th:atthakatha:sut.kn.jat.v17_att
cs-th:atthakatha:sut.kn.jat.v18_att
cs-th:atthakatha:sut.kn.jat.v19_att
cs-th:atthakatha:sut.kn.jat.v20_att
cs-th:atthakatha:sut.kn.jat.v21_att
cs-th:atthakatha:sut.kn.jat.v22_att
cs-th:atthakatha:sut.kn.jat.v23_att
cs-th:atthakatha:sut.kn.khp.0_att
cs-th:atthakatha:sut.kn.khp.1_att
cs-th:atthakatha:sut.kn.khp.2_att
cs-th:atthakatha:sut.kn.khp.3_att
cs-th:atthakatha:sut.kn.khp.4_att
cs-th:atthakatha:sut.kn.khp.5_att
cs-th:atthakatha:sut.kn.khp.6_att
cs-th:atthakatha:sut.kn.khp.7_att
cs-th:atthakatha:sut.kn.khp.8_att
cs-th:atthakatha:sut.kn.khp.9_att
cs-th:atthakatha:sut.kn.man.00_att
cs-th:atthakatha:sut.kn.man.01_att
cs-th:atthakatha:sut.kn.man.02_att
cs-th:atthakatha:sut.kn.man.03_att
cs-th:atthakatha:sut.kn.man.04_att
cs-th:atthakatha:sut.kn.man.05_att
cs-th:atthakatha:sut.kn.man.06_att
cs-th:atthakatha:sut.kn.man.07_att
cs-th:atthakatha:sut.kn.man.08_att
cs-th:atthakatha:sut.kn.man.09_att
cs-th:atthakatha:sut.kn.man.10_att
cs-th:atthakatha:sut.kn.man.11_att
cs-th:atthakatha:sut.kn.man.12_att
cs-th:atthakatha:sut.kn.man.13_att
cs-th:atthakatha:sut.kn.man.14_att
cs-th:atthakatha:sut.kn.man.15_att
cs-th:atthakatha:sut.kn.man.16_att
cs-th:atthakatha:sut.kn.net.0_att
cs-th:atthakatha:sut.kn.net.1_att
cs-th:atthakatha:sut.kn.net.2_att
cs-th:atthakatha:sut.kn.net.3_att
cs-th:atthakatha:sut.kn.net.4_att
cs-th:atthakatha:sut.kn.net.5_att
cs-th:atthakatha:sut.kn.net.6_att
cs-th:atthakatha:sut.kn.pat.v0_att
cs-th:atthakatha:sut.kn.pat.v1.01_att
cs-th:atthakatha:sut.kn.pat.v1.02_att
cs-th:atthakatha:sut.kn.pat.v1.03_att
cs-th:atthakatha:sut.kn.pat.v1.04_att
cs-th:atthakatha:sut.kn.pat.v1.05_att
cs-th:atthakatha:sut.kn.pat.v1.06_att
cs-th:atthakatha:sut.kn.pat.v1.07_att
cs-th:atthakatha:sut.kn.pat.v1.08_att
cs-th:atthakatha:sut.kn.pat.v1.09_att
cs-th:atthakatha:sut.kn.pat.v1.10_att
cs-th:atthakatha:sut.kn.pat.v1_att
cs-th:atthakatha:sut.kn.pat.v2_att
cs-th:atthakatha:sut.kn.pat.v3.01_att
cs-th:atthakatha:sut.kn.pat.v3.02_att
cs-th:atthakatha:sut.kn.pat.v3.03_att
cs-th:atthakatha:sut.kn.pat.v3.04_att
cs-th:atthakatha:sut.kn.pat.v3.05_att
cs-th:atthakatha:sut.kn.pat.v3.06_att
cs-th:atthakatha:sut.kn.pat.v3.07_att
cs-th:atthakatha:sut.kn.pat.v3.08_att
cs-th:atthakatha:sut.kn.pat.v3.09_att
cs-th:atthakatha:sut.kn.pat.v3.10_att
cs-th:atthakatha:sut.kn.pat.v3_att
cs-th:atthakatha:sut.kn.pev.0_att
cs-th:atthakatha:sut.kn.pev.1_att
cs-th:atthakatha:sut.kn.pev.2_att
cs-th:atthakatha:sut.kn.pev.3_att
cs-th:atthakatha:sut.kn.pev.4_att
cs-th:atthakatha:sut.kn.snp.1_att
cs-th:atthakatha:sut.kn.snp.2_att
cs-th:atthakatha:sut.kn.snp.3_att
cs-th:atthakatha:sut.kn.snp.4_att
cs-th:atthakatha:sut.kn.snp.5_att
cs-th:atthakatha:sut.kn.tha.00_att
cs-th:atthakatha:sut.kn.tha.01_att
cs-th:atthakatha:sut.kn.tha.02_att
cs-th:atthakatha:sut.kn.tha.03_att
cs-th:atthakatha:sut.kn.tha.04_att
cs-th:atthakatha:sut.kn.tha.05_att
cs-th:atthakatha:sut.kn.tha.06_att
cs-th:atthakatha:sut.kn.tha.07_att
cs-th:atthakatha:sut.kn.tha.08_att
cs-th:atthakatha:sut.kn.tha.09_att
cs-th:atthakatha:sut.kn.tha.10_att
cs-th:atthakatha:sut.kn.tha.11_att
cs-th:atthakatha:sut.kn.tha.12_att
cs-th:atthakatha:sut.kn.tha.13_att
cs-th:atthakatha:sut.kn.tha.14_att
cs-th:atthakatha:sut.kn.tha.15_att
cs-th:atthakatha:sut.kn.tha.16_att
cs-th:atthakatha:sut.kn.tha.17_att
cs-th:atthakatha:sut.kn.tha.18_att
cs-th:atthakatha:sut.kn.tha.19_att
cs-th:atthakatha:sut.kn.tha.20_att
cs-th:atthakatha:sut.kn.tha.21_att
cs-th:atthakatha:sut.kn.thi.01_att
cs-th:atthakatha:sut.kn.thi.02_att
cs-th:atthakatha:sut.kn.thi.03_att
cs-th:atthakatha:sut.kn.thi.04_att
cs-th:atthakatha:sut.kn.thi.05_att
cs-th:atthakatha:sut.kn.thi.06_att
cs-th:atthakatha:sut.kn.thi.07_att
cs-th:atthakatha:sut.kn.thi.08_att
cs-th:atthakatha:sut.kn.thi.09_att
cs-th:atthakatha:sut.kn.thi.10_att
cs-th:atthakatha:sut.kn.thi.11_att
cs-th:atthakatha:sut.kn.thi.12_att
cs-th:atthakatha:sut.kn.thi.13_att
cs-th:atthakatha:sut.kn.thi.14_att
cs-th:atthakatha:sut.kn.thi.15_att
cs-th:atthakatha:sut.kn.thi.16_att
cs-th:atthakatha:sut.kn.uda.0_att
cs-th:atthakatha:sut.kn.uda.1_att
cs-th:atthakatha:sut.kn.uda.2_att
cs-th:atthakatha:sut.kn.uda.3_att
cs-th:atthakatha:sut.kn.uda.4_att
cs-th:atthakatha:sut.kn.uda.5_att
cs-th:atthakatha:sut.kn.uda.6_att
cs-th:atthakatha:sut.kn.uda.7_att
cs-th:atthakatha:sut.kn.uda.8_att
cs-th:atthakatha:sut.kn.viv.v0_att
cs-th:atthakatha:sut.kn.viv.v1_att
cs-th:atthakatha:sut.kn.viv.v2_att
cs-th:atthakatha:sut.mn.v00_att
cs-th:atthakatha:sut.mn.v01_att
cs-th:atthakatha:sut.mn.v02_att
cs-th:atthakatha:sut.mn.v03_att
cs-th:atthakatha:sut.mn.v04_att
cs-th:atthakatha:sut.mn.v05_att
cs-th:atthakatha:sut.mn.v06_att
cs-th:atthakatha:sut.mn.v07_att
cs-th:atthakatha:sut.mn.v08_att
cs-th:atthakatha:sut.mn.v09_att
cs-th:atthakatha:sut.mn.v10_att
cs-th:atthakatha:sut.mn.v11_att
cs-th:atthakatha:sut.mn.v12_att
cs-th:atthakatha:sut.mn.v13_att
cs-th:atthakatha:sut.mn.v14_att
cs-th:atthakatha:sut.mn.v15_att
cs-th:atthakatha:sut.sn.00_att
cs-th:atthakatha:sut.sn.01_att
cs-th:atthakatha:sut.sn.02_att
cs-th:atthakatha:sut.sn.03_att
cs-th:atthakatha:sut.sn.04_att
cs-th:atthakatha:sut.sn.05_att
cs-th:atthakatha:sut.sn.06_att
cs-th:atthakatha:sut.sn.07_att
cs-th:atthakatha:sut.sn.08_att
cs-th:atthakatha:sut.sn.09_att
cs-th:atthakatha:sut.sn.10_att
cs-th:atthakatha:sut.sn.11_att
cs-th:atthakatha:sut.sn.12_att
cs-th:atthakatha:sut.sn.13_att
cs-th:atthakatha:sut.sn.14_att
cs-th:atthakatha:sut.sn.15_att
cs-th:atthakatha:sut.sn.16_att
cs-th:atthakatha:sut.sn.17_att
cs-th:atthakatha:sut.sn.18_att
cs-th:atthakatha:sut.sn.19_att
cs-th:atthakatha:sut.sn.20_att
cs-th:atthakatha:sut.sn.21_att
cs-th:atthakatha:sut.sn.22_att
cs-th:atthakatha:sut.sn.23_att
cs-th:atthakatha:sut.sn.24_att
cs-th:atthakatha:sut.sn.25_att
cs-th:atthakatha:sut.sn.26_att
cs-th:atthakatha:sut.sn.27_att
cs-th:atthakatha:sut.sn.28_att
cs-th:atthakatha:sut.sn.29_att
cs-th:atthakatha:sut.sn.30_att
cs-th:atthakatha:sut.sn.31_att
cs-th:atthakatha:sut.sn.32_att
cs-th:atthakatha:sut.sn.33_att
cs-th:atthakatha:sut.sn.34_att
cs-th:atthakatha:sut.sn.35_att
cs-th:atthakatha:sut.sn.36_att
cs-th:atthakatha:sut.sn.37_att
cs-th:atthakatha:sut.sn.38_att
cs-th:atthakatha:sut.sn.39_att
cs-th:atthakatha:sut.sn.40_att
cs-th:atthakatha:sut.sn.41_att
cs-th:atthakatha:sut.sn.42_att
cs-th:atthakatha:sut.sn.43_att
cs-th:atthakatha:sut.sn.44_att
cs-th:atthakatha:sut.sn.45_att
cs-th:atthakatha:sut.sn.46_att
cs-th:atthakatha:sut.sn.47_att
cs-th:atthakatha:sut.sn.48_att
cs-th:atthakatha:sut.sn.49_att
cs-th:atthakatha:sut.sn.50_att
cs-th:atthakatha:sut.sn.51_att
cs-th:atthakatha:sut.sn.52_att
cs-th:atthakatha:sut.sn.53_att
cs-th:atthakatha:sut.sn.54_att
cs-th:atthakatha:sut.sn.55_att
cs-th:atthakatha:sut.sn.56_att
cs-th:atthakatha:vin.cv.01_att
cs-th:atthakatha:vin.cv.02_att
cs-th:atthakatha:vin.cv.03_att
cs-th:atthakatha:vin.cv.04_att
cs-th:atthakatha:vin.cv.05_att
cs-th:atthakatha:vin.cv.06_att
cs-th:atthakatha:vin.cv.07_att
cs-th:atthakatha:vin.cv.08_att
cs-th:atthakatha:vin.cv.09_att
cs-th:atthakatha:vin.cv.10_att
cs-th:atthakatha:vin.cv.11_att
cs-th:atthakatha:vin.cv.12_att
cs-th:atthakatha:vin.mv.01_att
cs-th:atthakatha:vin.mv.02_att
cs-th:atthakatha:vin.mv.03_att
cs-th:atthakatha:vin.mv.04_att
cs-th:atthakatha:vin.mv.05_att
cs-th:atthakatha:vin.mv.06_att
cs-th:atthakatha:vin.mv.07_att
cs-th:atthakatha:vin.mv.08_att
cs-th:atthakatha:vin.mv.09_att
cs-th:atthakatha:vin.mv.10_att
cs-th:atthakatha:vin.pac.ak_att
cs-th:atthakatha:vin.pac.nii_att
cs-th:atthakatha:vin.pac.pc_att
cs-th:atthakatha:vin.pac.pci_att
cs-th:atthakatha:vin.pac.pd_att
cs-th:atthakatha:vin.pac.pdi_att
cs-th:atthakatha:vin.pac.pri_att
cs-th:atthakatha:vin.pac.sgi_att
cs-th:atthakatha:vin.pac.sk_att
cs-th:atthakatha:vin.par.ay_att
cs-th:atthakatha:vin.par.ga_att
cs-th:atthakatha:vin.par.ni_att
cs-th:atthakatha:vin.par.pr_att
cs-th:atthakatha:vin.par.sg_att
cs-th:atthakatha:vin.par.ve_att
cs-th:atthakatha:vin.pv.01_att
cs-th:atthakatha:vin.pv.02_att
cs-th:atthakatha:vin.pv.03_att
cs-th:atthakatha:vin.pv.04_att
cs-th:atthakatha:vin.pv.05_att
cs-th:atthakatha:vin.pv.06_att
cs-th:atthakatha:vin.pv.07_att
cs-th:atthakatha:vin.pv.08_att
cs-th:atthakatha:vin.pv.09_att
cs-th:atthakatha:vin.pv.10_att
cs-th:atthakatha:vin.pv.11_att
cs-th:atthakatha:vin.pv.12_att
cs-th:atthakatha:vin.pv.13_att
cs-th:atthakatha:vin.pv.14_att
cs-th:atthakatha:vin.pv.15_att
cs-th:atthakatha:vin.pv.16_att
cs-th:atthakatha:vin.pv.17_att
cs-th:atthakatha:vin.pv.18_att
cs-th:tika:abh.ava-pura.01_tik
cs-th:tika:abh.ava-pura.02_tik
cs-th:tika:abh.ava-pura.03_tik
cs-th:tika:abh.ava-pura.04_tik
cs-th:tika:abh.ava-pura.05_tik
cs-th:tika:abh.ava-pura.06_tik
cs-th:tika:abh.ava-pura.07_tik
cs-th:tika:abh.ava-pura.08_tik
cs-th:tika:abh.ava-pura.09_tik
cs-th:tika:abh.ava-pura.10_tik
cs-th:tika:abh.ava-pura.11_tik
cs-th:tika:sut.dn.01_abh_tik
cs-th:tika:sut.dn.01_tik
cs-th:tika:sut.dn.02_abh_tik
cs-th:tika:sut.dn.02_tik
cs-th:tika:sut.dn.03_abh_tik
cs-th:tika:sut.dn.03_tik
cs-th:tika:sut.dn.04_abh_tik
cs-th:tika:sut.dn.04_tik
cs-th:tika:sut.dn.05_abh_tik
cs-th:tika:sut.dn.05_tik
cs-th:tika:sut.dn.06_abh_tik
cs-th:tika:sut.dn.06_tik
cs-th:tika:sut.dn.07_abh_tik
cs-th:tika:sut.dn.07_tik
cs-th:tika:sut.dn.08_abh_tik
cs-th:tika:sut.dn.08_tik
cs-th:tika:sut.dn.09_abh_tik
cs-th:tika:sut.dn.09_tik
cs-th:tika:sut.dn.0_tik
cs-th:tika:sut.dn.10_abh_tik
cs-th:tika:sut.dn.10_tik
cs-th:tika:sut.dn.11_abh_tik
cs-th:tika:sut.dn.11_tik
cs-th:tika:sut.dn.12_abh_tik
cs-th:tika:sut.dn.12_tik
cs-th:tika:sut.dn.13_abh_tik
cs-th:tika:sut.dn.13_tik
cs-th:tika:sut.dn.14_tik
cs-th:tika:sut.dn.15_tik
cs-th:tika:sut.dn.16_tik
cs-th:tika:sut.dn.17_tik
cs-th:tika:sut.dn.18_tik
cs-th:tika:sut.dn.19_tik
cs-th:tika:sut.dn.20_tik
cs-th:tika:sut.dn.21_tik
cs-th:tika:sut.dn.22_tik
cs-th:tika:sut.dn.23_tik
cs-th:tika:sut.dn.24_tik
cs-th:tika:sut.dn.25_tik
cs-th:tika:sut.dn.26_tik
cs-th:tika:sut.dn.27_tik
cs-th:tika:sut.dn.28_tik
cs-th:tika:sut.dn.29_tik
cs-th:tika:sut.dn.30_tik
cs-th:tika:sut.dn.31_tik
cs-th:tika:sut.dn.32_tik
cs-th:tika:sut.dn.33_tik
cs-th:tika:sut.dn.34_tik
cs-th:tika:sut.kn.paka00_tik
cs-th:tika:sut.kn.paka01_tik
cs-th:tika:sut.kn.paka02_tik
cs-th:tika:sut.kn.paka03_tik
cs-th:tika:sut.kn.paka04_tik
cs-th:tika:sut.kn.paka05_tik
cs-th:tika:sut.kn.paka06_tik
cs-th:tika:sut.kn.vibh01_tik
cs-th:tika:sut.kn.vibh02_tik
cs-th:tika:sut.kn.vibh03_tik
cs-th:tika:sut.kn.vibh04_tik
cs-th:tika:sut.kn.vibh05_tik
cs-th:tika:sut.kn.vibh06_tik
cs-th:tika:sut.mn.0_tik
cs-th:tika:sut.mn.v01_tik
cs-th:tika:sut.mn.v02_tik
cs-th:tika:sut.mn.v03_tik
cs-th:tika:sut.mn.v04_tik
cs-th:tika:sut.mn.v05_tik
cs-th:tika:sut.mn.v06_tik
cs-th:tika:sut.mn.v07_tik
cs-th:tika:sut.mn.v08_tik
cs-th:tika:sut.mn.v09_tik
cs-th:tika:sut.mn.v10_tik
cs-th:tika:sut.mn.v11_tik
cs-th:tika:sut.mn.v12_tik
cs-th:tika:sut.mn.v13_tik
cs-th:tika:sut.mn.v14_tik
cs-th:tika:sut.mn.v15_tik
cs-th:tika:sut.sn.01_tik
cs-th:tika:sut.sn.02_tik
cs-th:tika:sut.sn.03_tik
cs-th:tika:sut.sn.04_tik
cs-th:tika:sut.sn.05_tik
cs-th:tika:sut.sn.06_tik
cs-th:tika:sut.sn.07_tik
cs-th:tika:sut.sn.08_tik
cs-th:tika:sut.sn.09_tik
cs-th:tika:sut.sn.0_tik
cs-th:tika:sut.sn.10_tik
cs-th:tika:sut.sn.11_tik
cs-th:tika:sut.sn.12_tik
cs-th:tika:sut.sn.13_tik
cs-th:tika:sut.sn.14_tik
cs-th:tika:sut.sn.15_tik
cs-th:tika:sut.sn.16_tik
cs-th:tika:sut.sn.17_tik
cs-th:tika:sut.sn.18_tik
cs-th:tika:sut.sn.19_tik
cs-th:tika:sut.sn.20_tik
cs-th:tika:sut.sn.21_tik
cs-th:tika:sut.sn.22_tik
cs-th:tika:sut.sn.23_tik
cs-th:tika:sut.sn.24_tik
cs-th:tika:sut.sn.25_tik
cs-th:tika:sut.sn.26_tik
cs-th:tika:sut.sn.27_tik
cs-th:tika:sut.sn.28_tik
cs-th:tika:sut.sn.29_tik
cs-th:tika:sut.sn.30_tik
cs-th:tika:sut.sn.31_tik
cs-th:tika:sut.sn.32_tik
cs-th:tika:sut.sn.33_tik
cs-th:tika:sut.sn.34_tik
cs-th:tika:sut.sn.35_tik
cs-th:tika:sut.sn.36_tik
cs-th:tika:sut.sn.37_tik
cs-th:tika:sut.sn.38_tik
cs-th:tika:sut.sn.39_tik
cs-th:tika:sut.sn.40_tik
cs-th:tika:sut.sn.41_tik
cs-th:tika:sut.sn.42_tik
cs-th:tika:sut.sn.43_tik
cs-th:tika:sut.sn.44_tik
cs-th:tika:sut.sn.45_tik
cs-th:tika:sut.sn.46_tik
cs-th:tika:sut.sn.47_tik
cs-th:tika:sut.sn.48_tik
cs-th:tika:sut.sn.49_tik
cs-th:tika:sut.sn.50_tik
cs-th:tika:sut.sn.51_tik
cs-th:tika:sut.sn.52_tik
cs-th:tika:sut.sn.53_tik
cs-th:tika:sut.sn.54_tik
cs-th:tika:sut.sn.55_tik
cs-th:tika:sut.sn.56_tik
cs-th:tika:vin.bhi.0_dvem_tik
cs-th:tika:vin.bhi.0_kank_tik
cs-th:tika:vin.bhi.0_vima_tik
cs-th:tika:vin.bhi.v_dvem_tik
cs-th:tika:vin.bhu.0_dvem_tik
cs-th:tika:vin.bhu.0_kank_tik
cs-th:tika:vin.bhu.ni_kank_tik
cs-th:tika:vin.bhu.pc_kank_tik
cs-th:tika:vin.bhu.pr_kank_tik
cs-th:tika:vin.bhu.sg_kank_tik
cs-th:tika:vin.cv.01_sara_tik
cs-th:tika:vin.cv.01_vima_tik
cs-th:tika:vin.cv.02_sara_tik
cs-th:tika:vin.cv.02_vima_tik
cs-th:tika:vin.cv.03_sara_tik
cs-th:tika:vin.cv.03_vima_tik
cs-th:tika:vin.cv.04_sara_tik
cs-th:tika:vin.cv.04_vima_tik
cs-th:tika:vin.cv.05_sara_tik
cs-th:tika:vin.cv.05_vima_tik
cs-th:tika:vin.cv.06_sara_tik
cs-th:tika:vin.cv.06_vima_tik
cs-th:tika:vin.cv.07_sara_tik
cs-th:tika:vin.cv.07_vima_tik
cs-th:tika:vin.cv.08_sara_tik
cs-th:tika:vin.cv.08_vima_tik
cs-th:tika:vin.cv.09_sara_tik
cs-th:tika:vin.cv.09_vima_tik
cs-th:tika:vin.cv.0_paci_tik
cs-th:tika:vin.cv.0_sara_tik
cs-th:tika:vin.cv.0_vaji_tik
cs-th:tika:vin.cv.0_vima_tik
cs-th:tika:vin.cv.10_sara_tik
cs-th:tika:vin.cv.10_vima_tik
cs-th:tika:vin.cv.11_sara_tik
cs-th:tika:vin.cv.11_vima_tik
cs-th:tika:vin.cv.12_sara_tik
cs-th:tika:vin.cv.12_vima_tik
cs-th:tika:vin.kank.0_kank_tik
cs-th:tika:vin.kankha.0_dvem_tik
cs-th:tika:vin.khud.01_khud_tik
cs-th:tika:vin.khud.02_khud_tik
cs-th:tika:vin.pac.pci_vima_tik
cs-th:tika:vin.vila.08_vila_tik
cs-th:tika:vin.vila.09_vila_tik
cs-th:tika:vin.vila.10_vila_tik
cs-th:tika:vin.vila.11_vila_tik
cs-th:tika:vin.vila.12_vila_tik
cs-th:tika:vin.vila.13_vila_tik
cs-th:tika:vin.vila.14_vila_tik
cs-th:tika:vin.vila.15_vila_tik
cs-th:tika:vin.vila.16_vila_tik
cs-th:tika:vin.vila.17_vila_tik
cs-th:tika:vin.vila.18_vila_tik
cs-th:tika:vin.vila.19_vila_tik
cs-th:tika:vin.vila.20_vila_tik
cs-th:tika:vin.vila.21_vila_tik
cs-th:tika:vin.vila.22_vila_tik
cs-th:tika:vin.vila.23_vila_tik
cs-th:tika:vin.vila.24_vila_tik

_/\_
Posted by: Dhammañāṇa
« on: March 30, 2019, 04:07:07 PM »

Sadhu for effort and care. May Nyom always give/take himself his time.

(The "big pages", Atma thinks about 10 %, like the other of the cscd Tipitaka, would not change later on in regard of content. Atma remembers that once there was still a search engine on ZzE, it was also never possible to index all Pali Tipitaka pages of original Ati as well, always having errors.

On the other side, on ZzE once and also now on ati.eu, there have been times where the index was obviously complete.)

Posted by: Moritz
« on: March 30, 2019, 02:48:15 PM »

The indexing script I had started on the server (which should be doing just the same as the CLI indexer script) stopped at some point due to running out of memory (working memory, not storage memory). It seems that certain pages simply cannot be indexed because the indexer would need too much memory for it.
For example http://accesstoinsight.eu/cs-th:tika:sut.dn.0_tik and following pages always fail with
Code: [Select]
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/clients/client2157/web5417/web/inc/indexer.php on line 612
or similar.

Line 612 is here:
Code: [Select]
$wordlist = explode(' ', $text);
splitting the whole text of a page into single words by spaces.

But I really do not understand why this would take so much memory. Also, replicating this same operation on my computer, splitting the same page text with the same methods into single words and storing in a variable in PHP, does not need nearly as much memory here.

Trying to find a way to work around it, I gave up now.

Continued indexing with the other method (which runs locally on my computer and sends a command for every single page to be indexed through the network, and does not stop if a page fails to be indexed), currently indexed until ~11000 pages (with many "holes" of pages which just cannot be indexed with the current server).

Should be finished in some 16 hours maybe if now just let to run. But with the current server infrastructure it seems the search index will always be incomplete.

_/\_
Posted by: Dhammañāṇa
« on: March 29, 2019, 03:44:28 PM »

Sadhu
Posted by: Moritz
« on: March 29, 2019, 02:25:48 PM »

I accidentally restarted rebuilding the index again from scratch. So now, progress is again at about 5000/20000 pages.

I wrote a new script, adapting methods from the CLI script , so that the whole process would run on the server, not needing to have a connection and open browser window all the time to send commands for every single page to be indexed one by one.
This should at least be a little bit faster, without the sending commands and responses back and forth, but the speed difference is not really noticeable. So it should, again, be finished in one day.

The current progress can be seen by opening http://accesstoinsight.eu/indexer.success.log (listing pages that were indexed successfully) and http://accesstoinsight.eu/indexer.error.log (listing pages which could not be indexed for some reason, currently empty).
There is a counting number before each page name in the lists, so one can see how many pages have already been processed.

_/\_
Posted by: Dhammañāṇa
« on: March 28, 2019, 02:51:04 PM »

Sadhu
Posted by: Moritz
« on: March 28, 2019, 02:48:11 PM »

Currently not using search or batchedit, how ever Nyom might think.

(There is a inbuilt search.php, told that it can be executed direct on the server to rebuild the index. Maybe that helps. https://www.dokuwiki.org/cli#indexerphp )

Rebuilding index started.

The helper scripts listed on https://www.dokuwiki.org/cli are only usable if one has shell access on the server. But that is not the case for the Greensta server here. (But still possibly useful to look into and adapt something maybe when having more time for it.) So just using the previous approach now.

_/\_
Posted by: Dhammañāṇa
« on: March 28, 2019, 02:32:56 PM »

Currently not using search or batchedit, how ever Nyom might think.

(There is a inbuilt search.php, told that it can be executed direct on the server to rebuild the index. Maybe that helps. https://www.dokuwiki.org/cli#indexerphp )
Posted by: Moritz
« on: March 28, 2019, 02:22:00 PM »

If it is helpful at this time, I could start the process to rebuild the search index from here, which should be finished after one day.

So, if it is not necessary to search much in the next 24 hours, I would just start to re-index now.

_/\_

*  I think there might be solutions with more professional search engines that would make it less time consuming to keep a search index working and maintained. Maybe I can find a possibility with some time. But that is another topic. _/\_
Posted by: Dhammañāṇa
« on: March 27, 2019, 07:46:51 PM »

Indexing has finished, but a quick search with batchedit gives that not all pages might be included in the search index for now. Lets see.