First, I'll say that like anything else, synthetic performance tests are great tools, but they're just that: another tool for the toolbelt. You can gain some fantastic information from them, and like all tools, not all performance tests are created equally. Some are better than others for certain things, and vice-versa.
The main downfall of performance tests is that they're purely mechanical, and the technology just isn't available to automatically measure how a human being using a site perceives it. Here are a couple examples of different tests (Pingdom uses page speed as part of its scoring). One is for Dwell Phoenix (one of our sites), and the other is for a site Agent Image did.
At first glance ,both sites are identical based on the score. But, checking out the details there are some interesting differences. For example:
Dwell Phoenix is ranked faster than 77% of tested websites @ 1.72s load time. Egypt Sherrod is faster than 57% @ just over 2.8s load. That's a pretty significant 50% difference in load time (lower is better/faster).
Looking at the other marks, our low scores for Dwell are 'remove query strings from static files' and 'serve files from cookieless domain', F and C respectively. This is where test bias comes into play. They're scored poorly, but in actuality this was an infrastructure design decision. The reason we're not doing what Pingdom suggests is for security reasons. We are able to do a better job of protecting our sites from DDoS attacks and hacks if we do things how we're doing them. So, lower score, but a more reliable and safer hosting environment as a result with no actual decrease in site load speed -- arguably, site speed is faster in part because of this decision.
For Egypt Sherrod, they score high on these two points. Their low score is 'leverage browser caching'. This means that things like images, scripts, and so on aren't cached in the browser, so everything is loading fresh from the server every time you visit the same page. This increases server load, makes things load slower because it has to redownload everything, and most importantly it costs your visitors more money out of pocket to use the site. Most browsing is done on mobile these days, and most mobile devices have data caps, as well as slower connections. We chose to focus on this for the end-user's benefit, which is why we score an A/B on that point.
There are some things that we are working on that we did score lower on. The query strings on static resources is one of them. There are some resources that we can safely remove that from (and thus score higher), we're just playing it safe while we monitor. Better 'too safe' than sorry, in my opinion.
Not going to speak to this next set of links too much, but I wanted to give you an example of a totally different test:
Comparing the same sites again, but the scores are totally different. Pingdom made the sites look identical, but WebPageTest shows major differences in results.