can anybody help me about the search function for Japanese

Question

can anybody help me about the search function for Japanese

asked Mar 8, 2011 in Q2A Core by anonymous

In japanese, there are no space between words, so this search engline can't pick up any words and but only words in tags.

I found the following Q&A

Q2A converts text into words (for searching) based on separating out words by word delimiters, like spaces (obviously), commas, quote marks, etc...

You can see the full set used at the top of qa-util-string.php, in the constant QA_PREG_INDEX_WORD_SEPARATOR and also in the mapping $qa_utf8punctuation which converts UTF-8 punctuation characters.

In the case of Chinese and similar languages, there are no word delimiters per se, but rather each multibyte (UTF-8) character is essentially a separate word.

I don't know what exactly how I do that. It wound very much appreciate if someone write me a code and tell me where I put it on which file.

One of the usrs wrote the following code but it seems to be problem because of slowness.

That is what he did:

open qa-include/qa-db-selects.php file and go to qa_db_search_posts_selectspec() function.

Just before below line ( almost end of function )

if ($selectparts==0)

add this lines

if (!empty($handlewords)) {
$aaa = implode($handlewords);}
else
{$aaa = $handlewords;}

$selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
"(SELECT postid AS questionid, 0 AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^users WHERE (^posts.title like _utf8 '%".$aaa."%' OR ^posts.content like _utf8 '%".$aaa."%' ) AND type='Q' )";

This will work fine, except if your database starts containing a large amount of content, in which case it could become quite slow.

2 Answers

tenkana · Answer 1 · 2011-03-09T02:17:48+0000

Same question here. So you can refer to following QA:

1. http://www.question2answer.org/qa/4139/can-not-search-in-chinese#a4205

2. http://www.question2answer.org/qa/4852/how-can-convert-string-into-words-mixed-language-question

Nguyễn Đăng Phú · Answer 2 · 2016-10-27T10:48:24+0000

I have the same problem too. I searched and found that Q2A 1.7.0 has already add a space between 2 Kanji for Japanese sentence. And it make search result to be awfully very bad.

So I add some code to qa_string_to_words() function in qa_include/util/string.php

--------------------------------

/* remove this because japanese word can contain 1 or more kanji. Replace this code with these below
           if ($splitideographs) // put spaces around CJK ideographs so they're treated as separate words
               $string=preg_replace('/'.QA_PREG_CJK_IDEOGRAPHS_UTF8.'/', ' \0 ', $string);
           */

           //split を = (\p{3092})
           $string=preg_replace("/(\x{3092})/u", ' \1 ', $string);

           //split れ[る|た] = (\p{308c}[\p{308b}|\p{305f}])
           $string=preg_replace("/(\x{308c}[\x{308b}|\x{305f}])/u", '\1 ', $string);

           //split の = (\p{306e})
           $string=preg_replace("/(\p{Han})(\x{306e})(\p{Han})/u", '\1 \2 \3', $string);

            //split は = (\p{306f})
           $string=preg_replace("/(\p{Han})(\x{306f})(\p{Han})/u", '\1 \2 \3', $string);

           //split が = (\p{304c})
           $string=preg_replace("/(\p{Han})(\x{304c})(\p{Han})/u", '\1 \2 \3', $string);

           //split not kanji and kanji
           $string=preg_replace("/(\P{Han})(\p{Han})/u", '\1 \2', $string);
            $string=preg_replace("/(\p{Han})(?=\P{Han})(\P{Hiragana})/u", '\1 \2', $string);

           //split katakana and not katakana
           $string=preg_replace("/(\p{Katakana})(?=\P{Katakana})(?=[^\x{30fc}])/u", '\1 \2', $string);
           $string=preg_replace("/(?=[^\x{30fc}])(\P{Katakana})(\p{Katakana})/u", '\1 \2', $string);

           //split Hiragana and katakana, romaji
           $string=preg_replace("/(\p{Hiragana})(\P{Hiragana})/u", '\1 \2', $string);
           //$string=preg_replace("/([a-zA-z0-9])(\p{Hiragana})/u", '\1 \2', $string);
            $string=preg_replace("/(?=\P{Han})(\P{Hiragana})(\p{Hiragana})/u", '\1 \2', $string);

           // removing Japanese punctions
           $string=preg_replace("/[\x{3000}-\x{3004}]|[\x{3008}-\x{303f}]|\x{30fb}|[\x{ff00}-\x{ff0f}]|[\x{ff1a}-\x{ff20}]|[\x{ff3b}-\x{ff40}]|[\x{ff5b}-\x{ff65}]|[\x{ffa0}-\x{ffef}]/u", ' ', $string); // remove apostrophes in words

--------------------------------

Here is my page: Kiwidic.com Japanese - Vietnamese dictionary

I think it is not the best but the result is improved alot.

P.S remember to reindex DB after modify this function

can anybody help me about the search function for Japanese

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

can anybody help me about the search function for Japanese

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories