Hi SaboSmith,
Going completely default with WordPress (no extra plugins as well as the default theme) while simulating X number of concurrent users on a TEST site (one that closely as possible mirrors your PRODUCTION site’s hardware/software) in a controlled environment would truly be my first step. Then, if the default site can handle your peak expected load (100,000 concurrent users?!), I would introduce the non-default theme and then one plugin at a time until you find the offender(s).
Moreover, in regards to the nuts and bolts of the issue, I’d read this http://codex.ww.wp.xz.cn/High_Traffic_Tips_For_WordPress if you haven’t already. 😉
P.S., For load testing, I’ve heard Tsung rocks (see http://tsung.erlang-projects.org) and I am currently learning how to test with Apache’s JMeter (see http://jmeter.apache.org).
Hope something here helps!
I’d also start off by asking you what sort of hosting account you’re running your site on?? The load that you’re seeing is pretty big, not huge, but still big. It’s more then enough to bring a shared hosting service to it’s knees pretty quickly, so if you’re not running at least a mid-range VPS or dedicated server, then it’s time to upgrade. From the sounds of it you don’t have any access to server logs, process logs, stats, etc, so that points to you being on a shared service, which would normally be either underpowered, or over-utilised by many other sites on the same system.
If you are running that now, and it’s looking like it’s not enough, then talk to your hosting company about what else they can offer to handle the load. If they are any good they’ll be able to help you out by offering a server with more resources. If they are really good they’ll be able to look at something like multiple servers and load balancing. If they are amazing they’ll be able to manage all of that for you so you don’t need to worry about it yourself.
Remember that optimising your site can only get you so far before you need to throw more hardware at it. Facebook is a prime example. The run that single site from multiple data centres, not just multiple servers, and all because they have to because no amunt of optimisation would let their system run on a single web server.
If you end up needing to look for a new host, maybe talk with the folks at Arvixe dot com. I ended up leaving there after a 30-day trial a few weeks ago because the CPU Usage ceiling was too low for me in their lowest-priced account on shared servers, but their downtime for me was zero and their servers are blazingly faster than what I have known at BlueHost (shared) over these past two years.
Thanks for your time everyone, some really helpful info here!
We were able to get the host to monitor the site during our attempts at fixing it, bringing it back up should it go down. We’ve narrowed it down to Jetpack and Next Gen (regretted paying for this a long time ago). All 3 are currently disabled and the site is operating perfectly. May not bother to test to find the exact one. There’s a 2/3 chance its Next Gen and it’s not the first problem we’ve had with it.
Thanks again guys. Especially for the info about the text environment, that was a big help.
Just for the record Cata. We’ve on a managed VPS with the following specs:
RAM: 4GB
CPU Cores: 2
CPU MHz: 5,000
SSD Cached Storage: 100GB
Bandwidth: 4TB
This might be helpful if you might not already know about it:
https://ww.wp.xz.cn/plugins/slimjetpack/
The regular JetPack must remain installed, but not activated.