Welcome to the Question2Answer Q&A. There's also a demo if you just want to try it out.
+5 votes
34.3k views
in Q2A Core by
For example, let's say you are asking something about word1. Question A and B both have word1 in their title but question A has only 2 words and question B has 6, I'd like Question A to have a better score.
related to an answer for: Usage and regularity of the search-box

1 Answer

+1 vote
by

The short answer is... not easily at all. You can see the query that Q2A uses to find related posts in function qa_db_related_qs_selectspec(...) in qa-db-selects.php - there's no straightforward way to modify this query to take account of the number of words in the title of each potentially related post, though no doubt it could be done, with a little more nesting of subqueries.

by
Well... I worked with that a long while ago! :)

The code I have modified is from Q2A 1.4 (now 1.7.4) But here is what I have in that method and it seems to put short answer first.  I'm not working on that anymore so I cannot help you much, but if you are in it maybe it will help you.  Some code modified was for moderation too, so you will need to see what may or may not help you.
by
File: qa-db-selects.php

    function qa_db_related_qs_selectspec($voteuserid, $questionid, $count=QA_DB_RETRIEVE_QS_AS)
/*
    Return the selectspec to retrieve the $count most closely related questions to $questionid,
    with the corresponding vote made by $voteuserid (if not null). This works by looking for other
    questions which have title words, tag words or an (exact) category in common.
*/
    {
        $selectspec=qa_db_posts_basic_selectspec($voteuserid);
       
        $selectspec['columns'][]='score';
       
        // added LOG(postid)/1000000 here to ensure ordering is deterministic even if several posts have same score
       
        $selectspec['source'].=" JOIN (SELECT postid, SUM(score)+LOG(postid)/1000000 AS score FROM ((SELECT ^titlewords.postid, LOG(#/titlecount) AS score FROM ^titlewords JOIN ^words ON ^titlewords.wordid=^words.wordid JOIN ^titlewords AS source ON ^titlewords.wordid=source.wordid WHERE source.postid=# AND titlecount<#) UNION ALL (SELECT ^posttags.postid, 2*LOG(#/tagcount) AS score FROM ^posttags JOIN ^words ON ^posttags.wordid=^words.wordid JOIN ^posttags AS source ON ^posttags.wordid=source.wordid WHERE source.postid=# AND tagcount<#) UNION ALL (SELECT ^posts.postid, LOG(#/^categories.qcount) FROM ^posts JOIN ^categories ON ^posts.categoryid=^categories.categoryid AND ^posts.type='Q' WHERE ^categories.categoryid=(SELECT categoryid FROM ^posts WHERE postid=#) AND ^categories.qcount<#)) x GROUP BY postid ORDER BY score DESC LIMIT #) y ON ^posts.postid=y.postid";
       
        array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $questionid, QA_IGNORED_WORDS_FREQ, QA_IGNORED_WORDS_FREQ,
            $questionid, QA_IGNORED_WORDS_FREQ, QA_IGNORED_WORDS_FREQ, $questionid, QA_IGNORED_WORDS_FREQ, $count);
           
        $selectspec['sortdesc']='score';
           
        return $selectspec;
    }
}
by
function qa_db_search_posts_selectspec($voteuserid, $titlewords, $contentwords, $tagwords, $handlewords, $handle, $start, $full=false, $count=QA_DB_RETRIEVE_QS_AS)
/*
    Return the selectspec to retrieve the $count top question matches, starting from the offset $start,
    with the corresponding vote made by $voteuserid (if not null) and including $full content or not.
    The search is performed for any of $titlewords in the title, $contentwords in the content (of the
    question or an answer or comment for whom that is the antecedent question), $tagwords in tags, for
    question author usernames which match a word in $handlewords or which match $handle as a whole.
    The results also include a 'score' column based on the matching strength and post hotness,
    and a 'matchparts' column that tells us where the score came from (since a question could get weight
    from a match in the question itself, and/or weight from a match in its answers, comments, or
    comments on answers). The 'matchparts' is a comma-separated list of tuples
    matchtype:matchpostid:matchscore to be used with qa_search_max_match_anchor().
*/
    {
        // add LOG(postid)/1000000 here to ensure ordering is deterministic even if several posts have same score
        // The score also gives a bonus for hot questions, where the bonus scales linearly with hotness. The hottest
        // question gets a bonus equivalent to a matching unique tag, and the least hot question gets zero bonus.

        $selectspec=qa_db_posts_basic_selectspec($voteuserid, $full);
       
        $selectspec['columns'][]='score';
        $selectspec['columns'][]='matchparts';
        $selectspec['source'].=" JOIN (SELECT questionid, SUM(score)+2*(LOG(#)*(^posts.hotness-(SELECT MIN(hotness) FROM ^posts WHERE type='Q'))/((SELECT MAX(hotness) FROM ^posts WHERE type='Q')-(SELECT MIN(hotness) FROM ^posts WHERE type='Q')))+LOG(questionid)/1000000 AS score, GROUP_CONCAT(CONCAT_WS(':', matchposttype, matchpostid, ROUND(score,3))) AS matchparts FROM (";
        $selectspec['sortdesc']='score';
        array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ);
       
        $selectparts=0;
       
        if (!empty($titlewords)) {
            // At the indexing stage, duplicate words in title are ignored, so this doesn't count multiple appearances.
           
            $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                "(SELECT postid AS questionid, LOG(#/titlecount) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^titlewords JOIN ^words ON ^titlewords.wordid=^words.wordid WHERE word IN ($) AND titlecount<#)";

            array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $titlewords, QA_IGNORED_WORDS_FREQ);
        }
       
        if (!empty($contentwords)) {
            // (1-1/(1+count)) weights words in content based on their frequency: If a word appears once in content
            // it's equivalent to 1/2 an appearance in the title (ignoring the contentcount/titlecount factor).
            // If it appears an infinite number of times, it's equivalent to one appearance in the title.
            // This will discourage keyword stuffing while still giving some weight to multiple appearances.
            // On top of that, answer matches are worth half a question match, and comment matches half again.
           
            $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                "(SELECT questionid, (1-1/(1+count))*LOG(#/contentcount)*(CASE ^contentwords.type WHEN 'Q' THEN 1.0 WHEN 'A' THEN 0.5 ELSE 0.25 END) AS score, ^contentwords.type AS matchposttype, ^contentwords.postid AS matchpostid FROM ^contentwords JOIN ^words ON ^contentwords.wordid=^words.wordid WHERE word IN ($) AND contentcount<#)";

            array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $contentwords, QA_IGNORED_WORDS_FREQ);
        }
       
        if (!empty($tagwords)) {
            // Appearances in the tag words count like 2 appearances in the title (ignoring the tagcount/titlecount factor).
            // This is because tags express explicit semantic intent, whereas titles do not necessarily.
           
            $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                "(SELECT postid AS questionid, 2*LOG(#/tagwordcount) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^tagwords JOIN ^words ON ^tagwords.wordid=^words.wordid WHERE word IN ($) AND tagwordcount<#)";

            array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $tagwords, QA_IGNORED_WORDS_FREQ);
        }
       
        if (!empty($handlewords)) {
            if (QA_FINAL_EXTERNAL_USERS) {
                $userids=qa_get_userids_from_public($handlewords);
               
                if (count($userids)) {
                    $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                        "(SELECT postid AS questionid, LOG(#/qposts) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^userpoints ON ^posts.userid=^userpoints.userid WHERE ^posts.userid IN ($) AND type='Q')";
                   
                    array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $userids);
                }

            } else {
                $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                    "(SELECT postid AS questionid, LOG(#/qposts) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^users ON ^posts.userid=^users.userid JOIN ^userpoints ON ^userpoints.userid=^users.userid WHERE handle IN ($) AND type='Q')";

                array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $handlewords);
            }
        }
       
        if (strlen($handle)) { // to allow searching for multi-word usernames (only works if search query contains full username and nothing else)
            if (QA_FINAL_EXTERNAL_USERS) {
                $userids=qa_get_userids_from_public(array($handle));
                $userid=@$userids[$handle];
               
                if (strlen($userid)) {
                    $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                        "(SELECT postid AS questionid, LOG(#/qposts) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^userpoints ON ^posts.userid=^userpoints.userid WHERE ^posts.userid=$ AND type='Q')";
                   
                    array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $userid);
                }

            } else {
                $selectspec['source'].=($selectparts++ ? " UNION ALL " : "").
                    "(SELECT postid AS questionid, LOG(#/qposts) AS score, _utf8 'Q' AS matchposttype, postid AS matchpostid FROM ^posts JOIN ^users ON ^posts.userid=^users.userid JOIN ^userpoints ON ^userpoints.userid=^users.userid WHERE handle=$ AND type='Q')";

                array_push($selectspec['arguments'], QA_IGNORED_WORDS_FREQ, $handle);
            }
        }
       
        if ($selectparts==0)
            $selectspec['source'].='(SELECT NULL as questionid, 0 AS score, NULL AS matchposttype, NULL AS matchpostid FROM ^posts WHERE postid=NULL)';

        $selectspec['source'].=") x LEFT JOIN ^posts ON ^posts.postid=questionid GROUP BY questionid ORDER BY score DESC LIMIT #,#) y ON ^posts.postid=y.questionid";
       
        array_push($selectspec['arguments'], $start, $count);
       
        return $selectspec;
    }
by
edited by
Im wondering if I could score buff using +.5 by counting the length of the question string too.

To see this via sql, you can use:

This will show the title of the post (Question) and the character count, including spaces:
SELECT title, length(title) FROM `qa_posts` WHERE title is not null;

This will show the title of the post (withoutSpaces) and the character count, excluding spaces as they will not be present.
SELECT replace(title,' ','') as title, length(replace(title,' ','')) as len_Count FROM `qa_posts` WHERE title is not null;

So, the issue now is that...
Hippopotamus, Eye, Toe will return 18 (for three words)
Toe, Eye, Hand, Nail, Neck will return 18 (for five words)

Overall it may be a bit better at best, but not 100% simply because two questions can have the same count, but at least the keywords COULD be different, keeping similar within this.  for example, if I searched for Eye...both of these would be returned.  But it would work if I searched for Hippo.

Now the trick is somehow implementing this count into the search result evaluation.

I just realized this question was asked in 2011.  :-(
...