Upping your game using methods borrowed from high-energy collider physics to test for existence of spikes and holes in distributions
Introduction
Unless you work in collider physics, you will have no idea what scaled factorial moments (SFMs) are — and there’s no reason why you should. In order to gain knowledge about elementary quark-gluon species created during/after collisions of accelerated atomic particles, histograms are generated for millions of “events” to assess single-particle kinematic variables such as rapidity, tranverse momentum, azimuthal angle, and event-shape variables. It’s therefore typical to have “histogram freaks” working on large collider collaborations who have decades-long experience on the effects of measurement precision, and mathematical rounding and truncation on histogram results. Essentially, what is being pursued is “bump-hunting,” or the search for spikes and holes in distributions — called intermittency.
Intermittency has been studied in a variety of forms including non-gaussian tails of distributions in turbulent fluid and heat transport, spikes and holes in quantum chromodynamic (QCD) rapidity distributions, 1/f flicker noise in electrical components, period doubling and tangent bifurcations, and fractals and long-range correlations in DNA sequences. The QCD formalism for intermittency was introduced by Bialas and Peschanski [1–3] for understanding spikes and holes in rapidity distributions, which were unexpected and difficult to explain with conventional models. This formalism led to the study of distributions which are discontinuous in the limit of very high resolution with spectacular features represented by a genuine physical (dynamical) effect rather than statistical fluctuation.
What Do SFMs Measure?
In order to understand what SFMs are used for, you first need to consider what you already know about statistical hypothesis tests to determine if an object’s value is significantly different from the mean, i.e., a Z-score. If you tell a physicist about a stock buy recommendation, the response will be: “I just need to know if the return is more than 2 standard deviations from the mean — that’s all.” This is no different from what you have been taught regarding a Gaussian or standard normal distribution, i.e., standard deviations from the mean. In molecular biology this is called “differential display,” that is, the pursuit of finding significant differences in expression of mRNA transcripts, proteins, or metabolites in e.g. tumor vs. normal or tissue vs. serum plasma that are beyond chance occurrence at p<0.05.
Next, unlike the above considerations, consider a method to identify if there is a significantly greater number of values of a feature in a narrow range or value, which would result in a spike at that narrow range/value in a histogram. In other words, are there for example, a significantly greater number of values causing a spike between 1.2 and 1.21 when compared with a smooth distribution of the same shape? SFMs can be used to identify the existence of a spike and a hole in a distribution. Spikes typically result in accompanying holes, so SFMs exploit both phenomena.
Calculation of SFMs
The fundamental equation for SFM is
When intermittency is present in the distribution, Fq will be proportional to M according to the power-law
By introducing a proportionality constant A and taking the natural logarithm we obtain the line-slope formula
If there is no intermittency in the distribution then ln(Fq) will be independent from ln(M) with slope equal to zero and ln(Fq) equal to the constant term ln(A). Another important consideration is that if the slope is non-vanishing in the limit then the distribution is discontinuous and should reveal an unusually rich structure of spikes and holes.
Calculating Fq as a Function of Number of Bins, M
Firstly, the sample size N of x-values should be in the thousands and preferably at least greater than 10,000. The probability distribution of x doesn’t really matter, and skewness and kurtosis are also not a problem. We have learned that reliable results can be obtained with values of M=30,32,…,512. Thus, bin counts are first obtained for the maximum value of M=512 using a bin width of d=(max(x)-min(x))/512. Once bin counts for the 512 bins are determined, bins are collapsed resulting in summing together neighboring bins. If collapsing bins is not desired, then simply calculate bin counts for M bins using a bin width d=(max(x)-min(x))/M.
Simulations
To reveal the behavior of Fq, 5 simulations of standard normal distributions were performed, and then plots of ln(Fq) vs ln(M) were constructed. The chart below shows the 5 distributions and a final plot of ln(Fq) vs. ln(M) for each of the 5 distributions:
In the lower right panel of the image above, results indicate that when using a standard normal distribution with a hole of width 0.64 and spike of width 0.02, the slope of ln(Fq) vs. ln(M) was the greatest (blue squares). For the same hole width but larger spike width (0.08) the slope was positive but lower (red squares). Lastly, regression models of ln(Fq) vs. ln(M) for unimodal, bimodal, and trimodal distributions yielded slopes of zero, but with different y-intercepts.
Overall, one or more prominent, strong, spikes need to be present in order for the slope to be non-zero. In 2004, We developed a statistical randomization test [4] to determine significance of slope which is described and available here.
References
[1] Bialas, A., Peschanksi, R. Nuc. Phys. B. 273, 1986, 703.
[2] Bialas, A. Peschanksi, R. Nuc. Phys. B. 308, 1988, 857.
[3] Bialas, A. Nuc. Phys. A. 525, 1991, 345c.
[4] Peterson, L.E. Statistical randomization test for QCD intermittency in a single-event distribution. 2004, arXiv:physics/0404016, https://doi.org/10.48550/arXiv.physics/0404016