Forum Replies Created

Viewing 4 replies - 1 through 4 (of 4 total)
  • Thread Starter Yan

    (@donamyk)

    On the subject, can you please elaborate on this note from the Advanced Publication blog post?

    =================

    Beyond controlling which URLs are crawled, this filter can also be used to manage how URLs are presented in the static version of your site. For example, when you configure Staatic to use relative URLs for portability, there might be cases where you need to maintain absolute URLs. This is common for canonical URL references or links in XML sitemaps, where the absolute URL is crucial for SEO purposes.

    <?php

    add_filter( 'staatic_should_crawl_url' , function ( $value, $url, $context ) {
    if ( ( $context[ 'htmlTagName' ] ?? '' ) === 'link' &&
    ( $context[ 'htmlAttributeName' ] ?? '' ) === 'href' &&
    ( str_contains( $context[ 'htmlElement' ] ?? '', 'canonical' ) ) ) {
    return false;
    }

    return $value;
    }, 10, 3 );

    =================

    How would excluding a URL via staatic_should_crawl_url preserve the absolute URL if this filter would exclude the page entirely?

    Maybe there should be a separate exclude_from_transformation hook or something?

    Perhaps that documentation is outdated, but then how could I achieve the following:

    • Inside of pages, rewrite all paths to begin with / except for
      • <meta property="og:url" content="https:https://xyz.com/">
      • the JSON inside of yoast-schema-graph
    • Preserve absolute URL inside of generated sitemap files.

    So far the only solution that I can think of is to define a custom Transformer to find & replace a placeholder URL based on my own criteria ?

    Thread Starter Yan

    (@donamyk)

    After searching through the plugin source, I guess the issue is happening in the FallbackUrlTransformer // FallbackUrlExtractor ?

    I don’t understand the intended logic here

     protected function getPatterns(): array
    {
    $formats = ['plain' => ['encode' => function (string $value) {
    return $value;
    }, 'decode' => function (string $value) {
    return $value;
    }], 'jsonEncoded' => ['encode' => function (string $value) {
    return str_replace('/', '\/', $value);
    }, 'decode' => function (string $value) {
    return str_replace('\/', '/', $value);
    }], 'urlEncoded' => ['encode' => function (string $value) {
    return rawurlencode($value);
    }, 'decode' => function (string $value) {
    return rawurldecode($value);
    }]];
    $patterns = [];
    foreach ($formats as $format => $options) {
    $slash = preg_quote($options['encode']('/'), '~');
    $doubleColon = preg_quote($options['encode'](':'), '~');
    $authority = preg_quote($options['encode']($this->baseUrl->getAuthority()), '~');
    $filterBasePath = $this->filterBasePath === null ? '' : preg_quote($options['encode'](trim($this->filterBasePath, '/')), '~');
    $patterns[] = ['pattern' => '~' . ($this->extendedUrlContext ? '(?P<before>.{0,100})' : '') . '(?P<url>
    (?P<scheme>https?' . $doubleColon . ')?' . $slash . $slash . $authority . '
    (?P<port>' . $doubleColon . '(?:80|443))?
    (?P<path>' . (empty($filterBasePath) ? '' : $slash . $filterBasePath) . '

    # Either the URL has an extra path or in the future it has a non-path char.
    (' . $slash . '|(?![a-z0-9-._]))

    # Rest of the path/query chars.
    (?:' . $slash . '|[a-z0-9-._\~%])*
    )

    )' . ($this->extendedUrlContext ? '(?P<after>.{0,100})' : '') . '~ix', 'encode' => $options['encode'], 'decode' => $options['decode']];
    }

    I added a log statement to FallbackUrlTransformer & saw the following

    [05-Sep-2025 20:22:37 UTC] [FallbackUrlTransformer] - {
    "effectiveUrl": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$transformResult": "https://xyz.com/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$url": "https://mysite.local/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "$foundOnUrl": "https://mysite.local/",
    "$context": {
    "before": "</style><link rel=\"modulepreload\" as=\"script\" crossorigin=\"\" href=\"https:",
    "scheme": "",
    "port": "",
    "path": "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js",
    "after": "\"><link rel=\"stylesheet\" href=\"https://mysite.local/wp-content/themes/aibms/src/dist/assets/maincss.c",
    "extractor": "Staatic\\Crawler\\UrlExtractor\\FallbackUrlExtractor"
    }
    }



    Thread Starter Yan

    (@donamyk)

    Thanks for the quick reply.

    Items 1 & 3 are seem to be related to each other, but it’s a bit of an edge case to reproduce:

    For background, I am working on a pre-existing client WP installation built with a theme developed by somebody else, so I’m not super familiar with all of the nuances.


    This website has hundreds of resources, so to troubleshoot the generation, I added the following to limit generation to just the index:

    // URL crawling control
    add_filter('staatic_should_crawl_url', function ($value, $url, $context): mixed {
    if (true) {
    // TODO: enable to just debug 1 page
    $sitePrefix = "https://mysite.local";
    $allowedUrls = array(
    "/"
    // "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"
    );
    $value = false;
    $path = $url->getPath();
    foreach ($allowedUrls as $allowedPath) {
    if ($path === $allowedPath || $sitePrefix . $path === $url) {
    $value = true;
    break;
    }
    }

    }
    return $value;
    }, 5, 3);

    I configured staatic_override_site_url to be "/"
    IF staatic_extended_url_context is enabled, then all links pointing to excluded resources inside of the static HTML output get prefixed with https:/


    ex: script src="https:/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js


    However, if I uncomment "/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js"from the filter above then the output is correct:

    src="/wp-content/themes/aibms/src/dist/script-CYL-dtLF.js" 

    NOTE: the same thing will happen to anchor href & image src attributes for excluded files

    <img src="https:/wp-content/uploads/2025/...."

    IF staatic_override_site_url is set to something like "https://xyz.com/" , then the output for excluded resources becomes like this:

    <img src="https:https://xyz.com/wp-content/uploads/2025/06/...

    Setting staatic_extended_url_context then fixes the output.

    At this point, I am quite confused…

    Can you help me understand whether I should enable/disable extendedContext ? I had originally set it to be true after reading your guide here:

    https://staatic.com/blog/tutorials/advanced-publication-process-customization/

    Thread Starter Yan

    (@donamyk)

    Thanks for the fast reply.

    I can confirm that this error no longer appears

Viewing 4 replies - 1 through 4 (of 4 total)