As a scientist, significance is very important. Is using P300 more pleasant than using SSVEP? Does this analysis pipeline yield a higher detection accuracy than the other? Is for this mental task the brain activity different from general baseline activity?
Generally, we compute the chance p that the second set of samples is from the same population as the first. If p is small, we reject this hypothesis (Of course there is still a chance of p that we are wrong in this). As a convention, if p is smaller than 0.05 we say the difference is statistically significant. If p is smaller than 0.10, you can speak of a trend, which generally means that you should try to do more tests. Because with more samples it is easier to get a significant difference.
There is a nice article on measuringusability.com about this convention of needing a p < 0.05 for significance, which argues not to blindly follow this convention, but look at what your results really mean.
“With a large enough sample size almost any difference is statistically different. It’s more interesting to have a difference of 6% at a p=.06 than a difference of 1 % at p =.01 when comparing reading speeds. Only the latter would make it into a peer-review journal.”
“Strictly speaking, […], p is a statement about data rather than about any hypothesis, and hence it is not inferential. This raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many situations, p approximates some useful post-experimental probabilities about hypotheses, such as the post-experimental probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the post-experimental plausibility of a hypothesis.[…] Even so, this approximation does not eliminate the need for caution in interpreting p inferentially, as shown in the Jeffreys–Lindley paradox […].”
More criticism about statistical hypothesis testing can be found on this WikiPedia page. Definitely a recommended read.
Then the question arises: what would be a good way to determine whether things are different? My personal opinion for now is to keep using p-values, but to also take into account sample size and the magnitude of the difference, and be careful about the assumptions for the null hypothesis (that there is no difference). I’d love to hear alternatives!
The measuring usability article ends with an interesting call to scientists to post your results on the web and see if your readers are convinced. Starting the discussion in this manner might be a lot more valuable, and perhaps even read by more people, than if it would be published in a peer-reviewed journal. That sounds like a fun idea!
P.S. Another potential breaker for scientific results having to correct for doing multiple tests, for example to look into the differences for multiple variables. While the easy solution is to apply Bonferroni correction, it is the most strict adjustment for your significance level — perhaps more strict than is really necessary. Perhaps an article for another time?