Thank @pupi1985 for your thoughts.
https://stackoverflow.com/a/35046840/268273
This link is about turning Unicode characters into HTML entities, which is useful for applying this replacement upon posting or editing questions/answers/comments; but the layer would turn escaped HTML entities (like `😀`) back to unescaped HTML entities (like `😀`) and then getting the right title (this is dangerous, by the way).
The downside of it is that users will be unable to enter literal strings with this format (like `😀`) in question's title by themselves.
Another (less desirable) solution would be an override for `qa_post_html_fields` that switches HTML entities (like `😀`) out for emojis (like the ) before it is processed by `qa_html` in qa-include/app/format.php:346
after some processing
Yes, which is performed upon visiting single question pages and listings (once per list item), including search result listings; this might be a drawback for large Q2A sites.
how Q2A splits in words the search term
HTML entities are split by the semicolon; emojis, on the other hand, are interpreted as regular characters, e.g., the in the picture above would be one-character word, while the + would be a two-character word.
Since the (one-character word) is different than `😀` (eight-character word), it will never find the post listed in the picture above; unless the search page is overridden so that emojis are swapped out for HTML entities before processing the search.
The takeaways from all of this are:
- If the database uses utf8mb4, it will take up more bytes to represent characters and some queries might run slower
- If the database uses utf8 + HTML entities, then some side effects and performance issues are expected for large Q2A websites
Pick your poison, but if option 2 is chosen, then the following steps are needed:
- Keep the database as it is: it still uses utf8
- [Core hack] Modify `qa_remove_utf8mb4` according to this comment
- Create a filter plugin for replacing emojis with HTML entities in questions, answers, and comments
- Add a layer (or an override) for representing HTML entities correctly in question's title
- [Almost a core hack] Override qa-include/pages/search.php such that emojis are swapped out for HTML entities before invoking the search engine
- [Core hack] Make qa-include/qa-feed.php restore escaped HTML entities in questions' titles, like in step 4
While these are the steps for option 1:
- [Core hack] Update the database encoding as explained here and here
- [Core hack] Modify `qa_remove_utf8mb4` according to this comment