Welcome to the Question2Answer Q&A. There's also a demo if you just want to try it out.
+1 vote
1.8k views
in Q2A Core by

In japanese, there are no space between words, so this search engline can't pick up any words and but only words in tags.

I found the following Q&A

Q2A converts text into words (for searching) based on separating out words by word delimiters, like spaces (obviously), commas, quote marks, etc...

You can see the full set used at the top of qa-util-string.php, in the constant QA_PREG_INDEX_WORD_SEPARATOR and also in the mapping $qa_utf8punctuation which converts UTF-8 punctuation characters.

In the case of Chinese and similar languages, there are no word delimiters per se, but rather each multibyte (UTF-8) character is essentially a separate word.

 I don't know what exactly how I do that. It wound very much appreciate if someone write me a code and tell me where I put it on which file.

One of the usrs wrote the following code but it seems to be problem because of slowness.

That is what he did:

open qa-include/qa-db-selects.php file and go to qa_db_search_posts_selectspec() function.

 
Just before below line ( almost end of function )
 
if ($selectparts==0)
 
add this lines
 
 
if (!empty($handlewords)) {
$aaa = implode($handlewords);}
else
{$aaa = $handlewords;}
 
$selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
"(SELECT postid AS questionid, 0 AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^users WHERE (^posts.title like _utf8 '%".$aaa."%' OR ^posts.content like _utf8 '%".$aaa."%' ) AND type='Q' )";


This will work fine, except if your database starts containing a large amount of content, in which case it could become quite slow.

2 Answers

+1 vote
by
by
Thanks for the answer.
Yeah, I did see them but couldn't know what I should.

I saw one way to take care of this problem but my site will probably become slow.

If you solved this problem, please teach me where to put what code.
+3 votes
by

I have the same problem too. I searched and found that Q2A 1.7.0 has already add a space between 2 Kanji for Japanese sentence. And it make search result to be awfully very bad.

So I add some code to qa_string_to_words() function in qa_include/util/string.php

--------------------------------

/*  remove this because japanese word can contain 1 or more kanji. Replace this code with these below
            if ($splitideographs) // put spaces around CJK ideographs so they're treated as separate words
                $string=preg_replace('/'.QA_PREG_CJK_IDEOGRAPHS_UTF8.'/', ' \0 ', $string);
            */
            
            //split を = (\p{3092})
            $string=preg_replace("/(\x{3092})/u", ' \1 ', $string);
            
            //split れ[る|た] = (\p{308c}[\p{308b}|\p{305f}])
            $string=preg_replace("/(\x{308c}[\x{308b}|\x{305f}])/u", '\1 ', $string);
                        
            //split の = (\p{306e})
            $string=preg_replace("/(\p{Han})(\x{306e})(\p{Han})/u", '\1 \2 \3', $string);
            
            //split は = (\p{306f})
            $string=preg_replace("/(\p{Han})(\x{306f})(\p{Han})/u", '\1 \2 \3', $string);
            
            //split が = (\p{304c})
            $string=preg_replace("/(\p{Han})(\x{304c})(\p{Han})/u", '\1 \2 \3', $string);

            //split not kanji and kanji
            $string=preg_replace("/(\P{Han})(\p{Han})/u", '\1 \2', $string);
            $string=preg_replace("/(\p{Han})(?=\P{Han})(\P{Hiragana})/u", '\1 \2', $string);


            //split katakana and not katakana
            $string=preg_replace("/(\p{Katakana})(?=\P{Katakana})(?=[^\x{30fc}])/u", '\1 \2', $string);
            $string=preg_replace("/(?=[^\x{30fc}])(\P{Katakana})(\p{Katakana})/u", '\1 \2', $string);
            
            //split Hiragana and katakana, romaji
            $string=preg_replace("/(\p{Hiragana})(\P{Hiragana})/u", '\1 \2', $string);
            //$string=preg_replace("/([a-zA-z0-9])(\p{Hiragana})/u", '\1 \2', $string);
            $string=preg_replace("/(?=\P{Han})(\P{Hiragana})(\p{Hiragana})/u", '\1 \2', $string);
            
            // removing Japanese punctions
            $string=preg_replace("/[\x{3000}-\x{3004}]|[\x{3008}-\x{303f}]|\x{30fb}|[\x{ff00}-\x{ff0f}]|[\x{ff1a}-\x{ff20}]|[\x{ff3b}-\x{ff40}]|[\x{ff5b}-\x{ff65}]|[\x{ffa0}-\x{ffef}]/u", ' ', $string); // remove apostrophes in words

--------------------------------

Here is my page: Kiwidic.com Japanese - Vietnamese dictionary

I think it is not the best but the result is improved alot.

P.S remember to reindex DB after modify this function

by
edited by
My product supports Japanese. However, unfortunately, since my product is for commercial use of Japanese corporation , I can not provide plugin.

You seem to have programming skills. Because your behavior is interesting as one of Japanese, I would like to offer one hint for you. If post count of your site is not so much, you will be able to use "Morphological Analysis Engine" of Yahoo.
http://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html
http://www.powerqa.org/qa/275/yahoo-morphological-analysis

One example of my Japanese site (It is not Q2A):
http://ja.powerqa.org/qa/
This site has been customized for Japan. Here you can try Japanese.

I wish your success.
by
Also, why use so many preg_replace calls instead of combining most of them to increase the performance?
by
Thank you, sama55, I will do some research base on your information
@arjunsuresh: because I am newbie, so I didn't know how
by
@Nguyễn Đăng Phú
If your site exists on dedicated server or VPS, you may be able to use MeCab(php-mecab) and Igo (igo_php). These libraries must be compiled with gc and gcc. Recently, Google also provides morphological analysis API. However, it will take a use fee.
https://ja.wikipedia.org/wiki/%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90#.E5.85.A5.E6.89.8B.E5.8F.AF.E8.83.BD.E3.81.AA.E6.97.A5.E6.9C.AC.E8.AA.9E.E3.81.AE.E5.BD.A2.E6.85.8B.E7.B4.A0.E8.A7.A3.E6.9E.90.E3.82.A8.E3.83.B3.E3.82.B8.E3.83.B3.E3.80.80
https://techcrunch.com/2016/07/20/google-launches-new-api-to-help-you-parse-natural-language/
by
after trying both Yahoo API and Igo, it is a pity that I can not use neither of them.
Yahoo API do not work if my sentence contains both Vietnamese and Japanese.
Igo does not work on my host (though it does work on my local server)
by
That's really unfortunate. Google's CLOUD NATURAL LANGUAGE (CNL) API seems to be working properly. But, usage fee is very expensive · · ·

Example sentence:
Anh Tuấn viết được chữ Hán. トゥアンさんは漢字を書けます。

Analysis result in Google CNL-API.
https://cloud.google.com/natural-language/
http://www.powerqa.org/qa/?qa=blob&qa_blobid=4504330844865995434
by
There seems to be such an API. It is "Japanese Keyword Extraction API" by Goo labs.
https://labs.goo.ne.jp/api/en/keyword-extraction/
I pray for your reference.
by
thanks for your kindly help, sama55. But I also found that, for every search results, Q2A will call qa_string_to_words() at least once. So every time I search (which have many results), it will take several request to yahoo or any api provider. That will reduce search performance a lot.
So I will try another approach. Anyway, at the moment, using regular expression seems work well.
by
If it can be realized with regular expressions, it is good. Divided words have a great influence on the search feature. When the sentence is not correctly divided, search result is not narrowed down correctly. It is recommended to test search feature well. I expect that good results will be obtained.
...