Title: Using original source instead embedding text for RAG context
Last modified: February 18, 2025

---

# Using original source instead embedding text for RAG context

 *  Resolved [Michael](https://wordpress.org/support/users/michael8888/)
 * (@michael8888)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/)
 * Is there a specific reason for choosing the text that creates the embedding vector
   instead of the original text as context for RAG?
 * We have been replacing line 542 ($context[“content”] .= $data[‘content’] . “\
   n”;) in embeddings.php with this:
 *     ```wp-block-code
         $post = get_post($data['refId']);
         $context["content"] .= $post->post_content . "\n";
       ```
   
 * Once we implemented these changes, the responses were significantly better. Using
   the embedding text as context can lead to losing important context if the original
   text is long.

Viewing 5 replies - 1 through 5 (of 5 total)

 *  Thread Starter [Michael](https://wordpress.org/support/users/michael8888/)
 * (@michael8888)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18312816)
 * Actually, this is the correct code:
 *     ```wp-block-code
       $post = get_post($data['refId']);$context["content"] .= $post->post_excerpt.$post->post_content . "\n";
       ```
   
 *  Plugin Support [Val Meow](https://wordpress.org/support/users/valwa/)
 * (@valwa)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18322939)
 * Hey [@michael8888](https://wordpress.org/support/users/michael8888/)! 👋
 * If you think your needs might benefit other users, you can submit a feature request.
   To learn more, please read this article: [Where Can I Submit A Feature Request ?](https://docs.meowapps.com/frequently-asked-questions/uSPfwM4p7r2J7uusWcPPBh/where-can-i-submit-a-feature-request-/bLMLNHQsTV84Z5uwrNoDJE)
   Thank you very much !
 *  Thread Starter [Michael](https://wordpress.org/support/users/michael8888/)
 * (@michael8888)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18322972)
 * Hey Val,
 * I don’t just think that my findings are very beneficial for Meow Apps; I am pretty
   sure because the embedding feature is exclusive to the commercial pro version.
   Many organizations who buy the pro version do so because they believe they get
   proper RAG support, which is definitely not the case. This is the reason why 
   I posted here.
 * I am also not submitting a feature request but rather highlighting a serious 
   design flaw. It’s unnecessary to post this again elsewhere. If you care about
   the success of Meo Apps, please pass this post along to the individual who developed
   the embedding feature. I believe they will understand right away the importance
   of this design flaw.
 *  Plugin Author [Jordy Meow](https://wordpress.org/support/users/tigroumeow/)
 * (@tigroumeow)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18323910)
 * Hi [@michael8888](https://wordpress.org/support/users/michael8888/),
 * Thanks for sharing your thoughts! But actually, it’s not a design flaw—on the
   contrary, using the raw post content directly _would_ be a design flaw in many
   cases. The reason is that post content isn’t always clean, readable text. If 
   you’re using page builders, for example, the content is often full of shortcodes
   and pseudo-code that wouldn’t work well for retrieval.
 * Also, articles can be long, repetitive, or contain details that you might not
   want to expose (like specific names). The goal of context in RAG is to be optimized—
   concise, relevant, and free of unnecessary repetition. That’s why AI Engine processes
   the content, cleaning and rewriting it to be shorter and more effective.
 * That being said, if you prefer to use the raw content, you _don’t_ need to modify
   the plugin’s code. You can simply disable the AI rewriting option in the Embeddings
   section and use the **mwai_post_content** filter to customize how content is 
   retrieved (more details here: [https://meowapps.com/ai-engine/api/](https://meowapps.com/ai-engine/api/)).
 * Adding the excerpt might help in some cases, but for most setups, using optimized
   content works better and gives faster, higher-quality responses. Hope that clarifies
   things! 😊
 * Cheers!
 *  Thread Starter [Michael](https://wordpress.org/support/users/michael8888/)
 * (@michael8888)
 * [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18324108)
 * Geordy, thanks for the reply!
 * To begin with, I want to emphasize that our solution is only a quick and practical
   workaround. However, it is still a significant improvement from your solution.
   The correct approach to RAG with lengthy articles involves breaking the text 
   into smaller segments and embedding these segments separately (Semantic Chunking).
 * Regarding your points:
 * > The reason is that post content isn’t always clean, readable text. If you’re
   > using page builders, for example, the content is often full of shortcodes and
   > pseudo-code that wouldn’t work well for retrieval.
 * Numerous tools exist for cleaning HTML or WordPress code, and I believe it’s 
   not a significant issue. I built an add-on for AI Engine that enhances chatbots
   with Google Search capabilities. This add-on scrapes the complete content from
   external sites and then inserts a cleaned version into the chatbot’s context.
   If we can accomplish this with unfamiliar third-party content, you can definitely
   do it with your own content as well. All model providers and AI apps like BoltAI
   do this when they integrate search or add context.
 * > Also, articles can be long, repetitive, or contain details that you might not
   > want to expose (like specific names). The goal of context in RAG is to be optimized—
   > concise, relevant, and free of unnecessary repetition. That’s why AI Engine
   > processes the content, cleaning and rewriting it to be shorter and more effective.
 * The goal for RAG is certainly not to provide concise context. Why do you believe
   Gemini supports a context window for up to 2 million tokens?
 * The goal of RAG is to provide as much relevant context as possible, and it is
   not the responsibility of the embedding model to ensure the relevancy of the 
   content that used during retrieval. You don’t want to lose information by filtering
   it before it even enters your RAG solution. For example, many of our articles
   include code that gets entirely lost if you summarize them.
 * In reality, a repetition of 10-20% is crucial to maintain overlap when employing
   semantic junking.
   The model can handle contexts containing repetitive or irrelevant
   information by simply disregarding such elements. This is why Perplexity or ChatGPT
   Search is effective. You need to avoid giving the model concise, incomplete context.
   It’s the model’s responsibility during inference to sift through the content.
 * > You can simply disable the AI rewriting option in the Embeddings section and
   > use the mwai_post_content filter to customize how content is retrieved
 * Disabling the AI rewriting option in the Embeddings section isn’t a good idea.
   We use 1536-dimensional embedding vectors, so the token count should stay between
   400–800. Increasing the number of tokens leads to semantic dilution.
 * So, as I mentioned earlier, the best way to tackle this issue is through semantic
   junking, but that would mean making substantial changes to your code. A practical
   workaround is to use the embedding model to generate the optimal embedding vector
   with minimal semantic loss, as these models are built for this purpose. Then,
   retrieve the full content for context to ensure no important information is missed
   during inference.
 * I am unsure how to use mwai_post_content to retrieve all the embedding IDs and
   their linked posts. I guess it’s feasible if we combine it with other filters.
 * To sum up, your embedding solution definitely has a design flaw. I post this 
   because AI Engine is by far the best and most important WordPress plugin I’ve
   seen in decades. This flaw is an eyesore.

Viewing 5 replies - 1 through 5 (of 5 total)

The topic ‘Using original source instead embedding text for RAG context’ is closed
to new replies.

 * ![](https://ps.w.org/ai-engine/assets/icon-256x256.png?rev=3431928)
 * [AI Engine - The Chatbot, AI Framework & MCP for WordPress](https://wordpress.org/plugins/ai-engine/)
 * [Frequently Asked Questions](https://wordpress.org/plugins/ai-engine/#faq)
 * [Support Threads](https://wordpress.org/support/plugin/ai-engine/)
 * [Active Topics](https://wordpress.org/support/plugin/ai-engine/active/)
 * [Unresolved Topics](https://wordpress.org/support/plugin/ai-engine/unresolved/)
 * [Reviews](https://wordpress.org/support/plugin/ai-engine/reviews/)

 * 5 replies
 * 3 participants
 * Last reply from: [Michael](https://wordpress.org/support/users/michael8888/)
 * Last activity: [1 year, 3 months ago](https://wordpress.org/support/topic/using-original-source-instead-embedding-text-for-rag-context/#post-18324108)
 * Status: resolved