Yan
Forum Replies Created
-
On the subject, can you please elaborate on this note from the Advanced Publication blog post?
=================
Beyond controlling which URLs are crawled, this filter can also be used to manage how URLs are presented in the static version of your site. For example, when you configure Staatic to use relative URLs for portability, there might be cases where you need to maintain absolute URLs. This is common for canonical URL references or links in XML sitemaps, where the absolute URL is crucial for SEO purposes.
<?php
add_filter( 'staatic_should_crawl_url' , function ( $value, $url, $context ) {
if ( ( $context[ 'htmlTagName' ] ?? '' ) === 'link' &&
( $context[ 'htmlAttributeName' ] ?? '' ) === 'href' &&
( str_contains( $context[ 'htmlElement' ] ?? '', 'canonical' ) ) ) {
return false;
}
return $value;
}, 10, 3 );=================
How would excluding a URL via
staatic_should_crawl_urlpreserve the absolute URL if this filter would exclude the page entirely?Maybe there should be a separate
exclude_from_transformationhook or something?Perhaps that documentation is outdated, but then how could I achieve the following:
- Inside of pages, rewrite all paths to begin with
/except for-
<meta property="og:url" content="https:https://xyz.com/"> - the JSON inside of
yoast-schema-graph
-
- Preserve absolute URL inside of generated sitemap files.
So far the only solution that I can think of is to define a custom
Transformerto find & replace a placeholder URL based on my own criteria ?After searching through the plugin source, I guess the issue is happening in the
FallbackUrlTransformer//FallbackUrlExtractor?
I don’t understand the intended logic hereprotected function getPatterns(): array
{
$formats = ['plain' => ['encode' => function (string $value) {
return $value;
}, 'decode' => function (string $value) {
return $value;
}], 'jsonEncoded' => ['encode' => function (string $value) {
return str_replace('/', '\/', $value);
}, 'decode' => function (string $value) {
return str_replace('\/', '/', $value);
}], 'urlEncoded' => ['encode' => function (string $value) {
return rawurlencode($value);
}, 'decode' => function (string $value) {
return rawurldecode($value);
}]];
$patterns = [];
foreach ($formats as $format => $options) {
$slash = preg_quote($options['encode']('/'), '~');
$doubleColon = preg_quote($options['encode'](':'), '~');
$authority = preg_quote($options['encode']($this->baseUrl->getAuthority()), '~');
$filterBasePath = $this->filterBasePath === null ? '' : preg_quote($options['encode'](trim($this->filterBasePath, '/')), '~');
$patterns[] = ['pattern' => '~' . ($this->extendedUrlContext ? '(?P<before>.{0,100})' : '') . '(?P<url>
(?P<scheme>https?' . $doubleColon . ')?' . $slash . $slash . $authority . '
(?P<port>' . $doubleColon . '(?:80|443))?
(?P<path>' . (empty($filterBasePath) ? '' : $slash . $filterBasePath) . '
# Either the URL has an extra path or in the future it has a non-path char.
(' . $slash . '|(?![a-z0-9-._]))
# Rest of the path/query chars.
(?:' . $slash . '|[a-z0-9-._\~%])*
)
)' . ($this->extendedUrlContext ? '(?P<after>.{0,100})' : '') . '~ix', 'encode' => $options['encode'], 'decode' => $options['decode']];
}I added a log statement to
FallbackUrlTransformer& saw the following[05-Sep-2025 20:22:37 UTC] [FallbackUrlTransformer] - {
"effectiveUrl": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
"$transformResult": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
"$url": "https://mysite.local/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
"$foundOnUrl": "https://mysite.local/",
"$context": {
"before": "</style><link rel=\"modulepreload\" as=\"script\" crossorigin=\"\" href=\"https:",
"scheme": "",
"port": "",
"path": "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
"after": "\"><link rel=\"stylesheet\" href=\"https://mysite.local/wp-content/themes/aibms/src/dist/assets/maincss.c",
"extractor": "Staatic\\Crawler\\UrlExtractor\\FallbackUrlExtractor"
}
}Thanks for the quick reply.
Items 1 & 3 are seem to be related to each other, but it’s a bit of an edge case to reproduce:
For background, I am working on a pre-existing client WP installation built with a theme developed by somebody else, so I’m not super familiar with all of the nuances.
This website has hundreds of resources, so to troubleshoot the generation, I added the following to limit generation to just the index:// URL crawling control
add_filter('staatic_should_crawl_url', function ($value, $url, $context): mixed {
if (true) {
// TODO: enable to just debug 1 page
$sitePrefix = "https://mysite.local";
$allowedUrls = array(
"/"
// "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"
);
$value = false;
$path = $url->getPath();
foreach ($allowedUrls as $allowedPath) {
if ($path === $allowedPath || $sitePrefix . $path === $url) {
$value = true;
break;
}
}
}
return $value;
}, 5, 3);I configured
staatic_override_site_urlto be"/"
IFstaatic_extended_url_contextis enabled, then all links pointing to excluded resources inside of the static HTML output get prefixed withhttps:/
ex:script src="https:/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js
However, if I uncomment"/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"from the filter above then the output is correct:src="/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"NOTE: the same thing will happen to anchor
href& imagesrcattributes for excluded files<img src="https:/wp-content/uploads/2025/...."IF
staatic_override_site_urlis set to something like"https://xyz.com/", then the output for excluded resources becomes like this:<img src="https:https://xyz.com/wp-content/uploads/2025/06/...Setting
staatic_extended_url_contextthen fixes the output.
At this point, I am quite confused…
Can you help me understand whether I should enable/disableextendedContext? I had originally set it to be true after reading your guide here:
https://staatic.com/blog/tutorials/advanced-publication-process-customization/Thanks for the fast reply.
I can confirm that this error no longer appears - Inside of pages, rewrite all paths to begin with