Using original source instead embedding text for RAG context

Resolved Michael
(@michael8888)

1 year, 3 months ago
Is there a specific reason for choosing the text that creates the embedding vector instead of the original text as context for RAG?

We have been replacing line 542 ($context[“content”] .= $data[‘content’] . “\n”;) in embeddings.php with this:
```
  $post = get_post($data['refId']);
  $context["content"] .= $post->post_content . "\n";
```
Once we implemented these changes, the responses were significantly better. Using the embedding text as context can lead to losing important context if the original text is long.

Viewing 5 replies - 1 through 5 (of 5 total)

Thread Starter Michael
(@michael8888)

1 year, 3 months ago
Actually, this is the correct code:
```
$post = get_post($data['refId']);
$context["content"] .= $post->post_excerpt.$post->post_content . "\n";
```
Plugin Support Val Meow
(@valwa)

1 year, 3 months ago

Hey @michael8888! 👋

If you think your needs might benefit other users, you can submit a feature request. To learn more, please read this article: Where Can I Submit A Feature Request ? Thank you very much !

Thread Starter Michael
(@michael8888)

1 year, 3 months ago

Hey Val,

I don’t just think that my findings are very beneficial for Meow Apps; I am pretty sure because the embedding feature is exclusive to the commercial pro version. Many organizations who buy the pro version do so because they believe they get proper RAG support, which is definitely not the case. This is the reason why I posted here.

I am also not submitting a feature request but rather highlighting a serious design flaw. It’s unnecessary to post this again elsewhere. If you care about the success of Meo Apps, please pass this post along to the individual who developed the embedding feature. I believe they will understand right away the importance of this design flaw.

Plugin Author Jordy Meow
(@tigroumeow)

1 year, 3 months ago

Hi @michael8888,

Thanks for sharing your thoughts! But actually, it’s not a design flaw—on the contrary, using the raw post content directly would be a design flaw in many cases. The reason is that post content isn’t always clean, readable text. If you’re using page builders, for example, the content is often full of shortcodes and pseudo-code that wouldn’t work well for retrieval.

Also, articles can be long, repetitive, or contain details that you might not want to expose (like specific names). The goal of context in RAG is to be optimized—concise, relevant, and free of unnecessary repetition. That’s why AI Engine processes the content, cleaning and rewriting it to be shorter and more effective.

That being said, if you prefer to use the raw content, you don’t need to modify the plugin’s code. You can simply disable the AI rewriting option in the Embeddings section and use the mwai_post_content filter to customize how content is retrieved (more details here: https://meowapps.com/ai-engine/api/).

Adding the excerpt might help in some cases, but for most setups, using optimized content works better and gives faster, higher-quality responses. Hope that clarifies things! 😊

Cheers!

Thread Starter Michael
(@michael8888)

1 year, 3 months ago

Geordy, thanks for the reply!

To begin with, I want to emphasize that our solution is only a quick and practical workaround. However, it is still a significant improvement from your solution. The correct approach to RAG with lengthy articles involves breaking the text into smaller segments and embedding these segments separately (Semantic Chunking).

Regarding your points:

The reason is that post content isn’t always clean, readable text. If you’re using page builders, for example, the content is often full of shortcodes and pseudo-code that wouldn’t work well for retrieval.

Numerous tools exist for cleaning HTML or WordPress code, and I believe it’s not a significant issue. I built an add-on for AI Engine that enhances chatbots with Google Search capabilities. This add-on scrapes the complete content from external sites and then inserts a cleaned version into the chatbot’s context. If we can accomplish this with unfamiliar third-party content, you can definitely do it with your own content as well. All model providers and AI apps like BoltAI do this when they integrate search or add context.

Also, articles can be long, repetitive, or contain details that you might not want to expose (like specific names). The goal of context in RAG is to be optimized—concise, relevant, and free of unnecessary repetition. That’s why AI Engine processes the content, cleaning and rewriting it to be shorter and more effective.

The goal for RAG is certainly not to provide concise context. Why do you believe Gemini supports a context window for up to 2 million tokens?

The goal of RAG is to provide as much relevant context as possible, and it is not the responsibility of the embedding model to ensure the relevancy of the content that used during retrieval. You don’t want to lose information by filtering it before it even enters your RAG solution. For example, many of our articles include code that gets entirely lost if you summarize them.

In reality, a repetition of 10-20% is crucial to maintain overlap when employing semantic junking.
The model can handle contexts containing repetitive or irrelevant information by simply disregarding such elements. This is why Perplexity or ChatGPT Search is effective. You need to avoid giving the model concise, incomplete context. It’s the model’s responsibility during inference to sift through the content.

You can simply disable the AI rewriting option in the Embeddings section and use the mwai_post_content filter to customize how content is retrieved

Disabling the AI rewriting option in the Embeddings section isn’t a good idea. We use 1536-dimensional embedding vectors, so the token count should stay between 400–800. Increasing the number of tokens leads to semantic dilution.

So, as I mentioned earlier, the best way to tackle this issue is through semantic junking, but that would mean making substantial changes to your code. A practical workaround is to use the embedding model to generate the optimal embedding vector with minimal semantic loss, as these models are built for this purpose. Then, retrieve the full content for context to ensure no important information is missed during inference.

I am unsure how to use mwai_post_content to retrieve all the embedding IDs and their linked posts. I guess it’s feasible if we combine it with other filters.

To sum up, your embedding solution definitely has a design flaw. I post this because AI Engine is by far the best and most important WordPress plugin I’ve seen in decades. This flaw is an eyesore.

Viewing 5 replies - 1 through 5 (of 5 total)

The topic ‘Using original source instead embedding text for RAG context’ is closed to new replies.