I temporarily added this to .htaccess:
<If "%{QUERY_STRING} =~ /^(s|search)=/">
AuthName "WordPress Admin"
AuthType Basic
AuthUserFile /home/admin/web/.htpasswd
require valid-user
</If>
This made it so that nobody can search (bot or human) without signing in.
And yet, those spam requests were STILL showing up in Algolia (about 1 search per second).
Any ideas would be appreciated.
Admittedly I am about in the same place as that other thread you pointed out, where I don’t have any solid leads on this.
Did you check out https://www.algolia.com/doc/faq/basics/too-many-false-unexplained-search-operations/ at all and try some of the suggestions?
Hey Michael,
Thank you for your reply!
Yes, I did try everything that I could. We can’t use the IP from “Search API Logs” in Algolia as it is always the IP of our hosting provider that’s making the request in Algolia. So we used our logs to track down the IP and it’s… Googlebot! Worth noting we had the same issue with Majestic’s bot but we could block that one.
This is strange because I am blocking all bots from accessing search pages with robots.txt and verified the block is in place through Google’s Search Console (inspected a search URL and confirmed it can’t be crawled and fetched).
Tried blocking with .htaccess but no luck. This is our latest attempt (in case it works for someone elese):
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot) [NC]
RewriteCond %{QUERY_STRING} ^s= [OR]
RewriteCond %{REQUEST_URI} ^/search/
RewriteRule ^ - [F]
This should be blocking Googlebot and Bingbot from accessing search pages but it’s not. They are still going through ignoring both that and robots.txt
Tried using an API key with rate limiting but that didn’t seem to do anything at all. We copied that key in the plugin and it still works (so it is valid) but rate limiting doesn’t seem to apply. Perhaps that’s because Googlebot is using far too many IPs.
You don’t have a way of excluding specific queries (contain keyword) with a custom function, do you? Or limiting the query characters (they are always long)?
You don’t have a way of excluding specific queries (contain keyword) with a custom function, do you? Or limiting the query characters (they are always long)?
None that I know of offhand, but I do know we kept and provide a lot of WordPress filters in the codebase to help with customization, and some in there may be good candidates to help out with this. I just don’t know them off the top of my head.
Michael,
Thank you, I’ll take a look!
Do you have a link to said codebase?
In the meantime, I’ll update this post if I figure this out.
@evita086 it’s just the plugin code itself. If you want the GitHub link, it can be found at https://github.com/WebDevStudios/wp-search-with-algolia/ otherwise just crack open the plugin in your IDE or text editor.
For anyone possibly facing the same issue, here’s what the spammers are doing:
1. They are not actually conducting searches. They are linking to the search pages from an indexed site they control. This, however, does trigger a query in Algolia.
2. This causes Google to index the pages even if you are blocking access to /?s or /search/ via robots.txt
Now, this won’t stop them from spamming your search bar but here’s what you can do to at least not let them achieve their goal.
1. Remove your Disallow rules for search pages from robots.txt
2. Make sure you are using noindex for search pages.
We were doing BOTH and that was the problem. Because of the fact our robots.txt was blocking search engines from accessing search pages, Google (for example) couldn’t see the “noindex” tag. And they indexed the pages even though they were being blocked by robots.txt
Hope this helps someone.
Worth noting spam searches are almost non-existant now but not due to an action we took. They just stopped. My guess is that’s because they were able to index the pages they wanted. They will get de-indexed with the new setup so they may come back; we’ll see.
-
This reply was modified 4 years, 8 months ago by
evita086.
-
This reply was modified 4 years, 8 months ago by
evita086.
Update: spammers are back, unfortunately. Getting thousands of queries again.
Sadly I haven’t determined anything new since last week when I last responded.
That is fine, thank you for trying. Just to clarify this code works, if you place it in your .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot) [NC]
RewriteCond %{QUERY_STRING} ^s= [OR]
RewriteCond %{REQUEST_URI} ^/search/
RewriteRule ^ - [F]
To anyone interested in that solution:
1. You may have to add more bots in the second line.
2. Keep in mind this will completely block the bots you specify from accessing your search pages which means they won’t get indexed even if you want them to. Make sure spam search queries haven’t been indexed by Google/Bing before you add these directives to your htaccess file. If you see spam queries indexed, it’d be best if you remove them first. In our case, search pages with spam queries got indexed because we were blocking bots with robots.txt and, therefore, they couldn’t see the “noindex” tag we were applying to search pages. What we did is: remove the robots.txt block and then allow Google time to crawl pages to see the noindex tag and deindex them and THEN add the above code to htaccess
Good luck!