bbole's Replies | ww.wp.xz.cn

Forum Replies Created

Viewing 6 replies - 1 through 6 (of 6 total)

Forum: Plugins
In reply to: [Website LLMs.txt] How to handle potential conflict: duplicated content

Thread Starter bbole
(@bbole)

11 months, 4 weeks ago

Absolutely. While discoverability refers to a file being found by a crawler (e.g., via links, sitemaps, or direct access), indexation is indeed a prerequisite for a page to appear in search engine results, which is a key component of discoverability by traditional search engines. The llms.txt specification is designed specifically for large language models (LLMs), not for traditional search engine indexing. Including these files in the sitemap index increases the likelihood that Googlebot will crawl and index them, treating them as regular web pages.

Since these files (especially .md versions of blog posts) often have content identical or near-identical to their HTML versions, Google may flag them as duplicate content. This could dilute our SEO rankings, confuse search algorithms, or even confuse people if they ever click on such results or lead to penalties in Google Search Console, as it may interpret the .md files as alternate versions of the same page without proper canonicalization.

By excluding llms.txt, llms-full.txt, and related .md files from the sitemap index, we reduce the risk of traditional search engines indexing them, thereby minimizing their discoverability in search results.

This aligns with the llms.txt proposal, which intends these files to be accessed directly by LLMs or AI agents (e.g., via https://example.com/llms.txt) rather than surfaced in Google’s search results. LLMs don’t rely on sitemaps for discovery.

So:
– Excluding from sitemap indexes
– Testing using robots.txt rules to allow specific AI crawlers while disallowing traditional crawlers.
– Testing canonicals

Reduce the likelihood of anything described above happening. To me, the first step is very straightforward: the plugin shouldn’t modify the sitemap index at all. Risk mitigation.

Forum: Plugins
In reply to: [Website LLMs.txt] How to handle potential conflict: duplicated content
Thread Starter bbole
(@bbole)

11 months, 4 weeks ago
Most people have sitemap index added to Google Search Console, etc. After activating this plugin, there’s a new sitemap for llms added to that sitemap, when it shouldn’t be. That’s why it’s picked by GSC and other webmasters tools.

I have deactivated the plugin for that reason.
- This reply was modified 11 months, 4 weeks ago by bbole.
Forum: Plugins
In reply to: [Website LLMs.txt] Why no single post .md files are generated?
bbole
(@bbole)

11 months, 4 weeks ago
Me too, I was expecting, as per Jeremy Howard webpage on this, 2 files:

llms.txt – The main index file that goes in our website’s root directory (/llms.txt). A structured markdown file providing:
- A project overview
- Links to detailed markdown files
- Organized sections of resources
llms-full.txt (or similar) – This is a processed/expanded version containing the actual content from all the URLs referenced in llms.txt.

We need then to create md versions of each blog article first before actually generating a llms.txt and a llms-full.txt = each article having a corresponding .md path/alternate (The specs suggests that each HTML page should have a corresponding .md version at the same URL with .md appended (e.g., page.html → page.html.md))

The plugin doesn’t generate these single md files.

Ref:
https://jina.ai/reader/

It’s actually the llms.txt what works like the ‘traditional’ sitemap, but without calling it sitemap. It’s a file that references the markdown versions you’ve created beforehand and provides structure

And the “full” file should be a concatenated version of all the content in those md pages.
Forum: Plugins
In reply to: [Website LLMs.txt] How to handle potential conflict: duplicated content

Thread Starter bbole
(@bbole)

11 months, 4 weeks ago

Why is such sitemap-llms created after all? it’s not part of the procedure suggested by Jeremy Howard https://llmstxt.org/

Forum: Plugins
In reply to: [Website LLMs.txt] How to handle potential conflict: duplicated content

Thread Starter bbole
(@bbole)

11 months, 4 weeks ago

Also @ryhowa asking because https://beebole.com/blog/sitemap_index.xml is added to GSC, Bing Webmasters, etc and the new llms sitemap is there.

Forum: Plugins
In reply to: [Website LLMs.txt] How to handle potential conflict: duplicated content

Thread Starter bbole
(@bbole)

11 months, 4 weeks ago

Hola Ryan,

Thanks a lot for responding! How do we know 1”% that Google’s and other ‘traditional’ bots won’t index such files?

Also regarding this plugin: does it generate a markdown version of each article too?

Thank YOU

Viewing 6 replies - 1 through 6 (of 6 total)