What is the Mann-Whitney U test?

The Mann-Whitney U test is a non-parametric statistical procedure for determining whether two independent samples are drawn from populations with the same distribution. The test operates on the ranks of the observations rather than the raw values, making it applicable when the assumption of normality cannot be justified. It was developed by Henry B. Mann and Donald R. Whitney in 1947 as a generalization of Frank Wilcoxon's 1945 rank-sum procedure.

When should I use the Mann-Whitney U test instead of an independent samples t-test?

The Mann-Whitney U test is appropriate when the data violate the normality assumption required by the t-test, when the dependent variable is measured on an ordinal scale, when sample sizes are small and normality cannot be verified, or when outliers are present that cannot be legitimately removed. For data measured on at least an interval scale from a normal population, the independent samples t-test is preferred because it is more powerful. The asymptotic relative efficiency of the Mann-Whitney U test relative to the t-test is approximately 0.955 under normality, meaning it is only slightly less powerful when normality holds.

How do I report Mann-Whitney U test results in APA 7th edition format?

In APA 7th edition format, report the test as follows: 'A Mann-Whitney U test indicated that [Group 1] (Mdn = X, n = X) differed significantly from [Group 2] (Mdn = X, n = X), U = X, z = X.XX, p = .XXX (two-tailed), r = .XX.' Report the median for each group, the U statistic, the z approximation, the exact p-value to three decimal places (with a leading zero omitted), and the effect size r (rank-biserial correlation).

What is the null hypothesis of the Mann-Whitney U test?

The null hypothesis of the Mann-Whitney U test is that the two populations are stochastically equal, formally stated as P(X greater than Y) equals P(Y greater than X) equals one-half. This is the condition of stochastic homogeneity. Under the additional assumption that the two distributions have the same shape (differing only in location), this is equivalent to asserting equality of medians. Rejection of the null hypothesis indicates that observations from one population tend to exceed observations from the other.

What effect size should I report for a Mann-Whitney U test?

The recommended effect size for the Mann-Whitney U test is the rank-biserial correlation r, computed as r = z divided by the square root of the total sample size N. Cohen's (1988) benchmarks for r are: 0.10 (small), 0.30 (medium), and 0.50 (large). An alternative measure is the common language effect size (also called the probability of superiority), computed as U1 divided by (n1 times n2), which directly expresses the probability that a randomly selected observation from one group exceeds a randomly selected observation from the other group.

What is the difference between the exact and asymptotic Mann-Whitney U test?

The exact Mann-Whitney U test computes the p-value by enumerating the exact distribution of the U statistic under the null hypothesis, determining the proportion of all possible rank arrangements that produce a U value at least as extreme as the observed one. The asymptotic test uses a normal approximation to the distribution of U, which becomes increasingly accurate as sample sizes increase. For samples where both n1 and n2 are at most 20, the exact test is preferable. For larger samples, the asymptotic test with a correction for ties provides sufficiently accurate p-values.

Mann-Whitney U Test: Calculator, Philosophical Foundation, and APA Reporting

Theoretical and Philosophical Basis

Philosophical Foundations of the Mann-Whitney U Test

The Mann-Whitney U test occupies a distinctive position in the history of statistical methodology. It emerged at a moment when applied scientists had begun to interrogate the philosophical premises underlying the dominant tradition of parametric inference, and it offered a coherent alternative grounded in ordinal information rather than distributional assumptions about population moments. Understanding the test at the doctoral level requires not only facility with its mechanics but also an appreciation of the epistemological commitments that motivated its development and continue to govern its appropriate use.

Historical Origins and Intellectual Context

Frank Wilcoxon, working as a chemist and statistician at Lederle Laboratories, published a 1945 paper in Biometrics Bulletin introducing signed-rank and rank-sum procedures for paired and unpaired comparisons. Wilcoxon's motivation was practical and philosophically grounded: the Student t-test, which had been the standard tool for comparing two groups since Gosset's 1908 paper in Biometrika, rested on the assumption that the observed data arose from a normally distributed population. In biological research, this assumption was frequently unjustified, and the consequences of violation were poorly understood at the time. Wilcoxon's rank-based procedures dispensed with the normality requirement entirely, working instead with the ordinal structure of the data.

Two years later, Henry B. Mann, a mathematician at Ohio State University, and his doctoral student Donald R. Whitney published a more rigorous generalization of Wilcoxon's work. Their 1947 paper, "On a Test of Whether One of Two Random Variables is Stochastically Larger than the Other," published in The Annals of Mathematical Statistics, derived the exact distribution of a statistic U based on counting the number of times observations from one sample exceed observations from the other. Mann and Whitney proved the test's consistency (power approaching 1 as sample sizes increase under any fixed departure from the null hypothesis) and established the asymptotic normality of U, enabling practical use with moderate sample sizes.

The test has since been referred to variously as the Mann-Whitney U test, the Wilcoxon rank-sum test, and the Wilcoxon-Mann-Whitney test. These names reflect the same underlying procedure: Mann and Whitney's U statistic and Wilcoxon's W rank-sum statistic are linearly related, with W = U + n₁(n₁+1)/2, and they produce identical p-values. The equivalence was recognized early, and either formulation may appear in published software output.

The Non-Parametric Paradigm and Its Epistemological Basis

The term "non-parametric" requires careful examination. In strict usage, a non-parametric procedure is one whose validity does not depend on the specification of a finite-dimensional family of distributions for the population. The Mann-Whitney U test qualifies under this definition: the null distribution of U under the null hypothesis is the same regardless of whether the parent population is normal, exponential, logistic, or any other continuous distribution. The preferred alternative designation, "distribution-free," captures this property more precisely.

This distributional agnosticism carries epistemological weight. Parametric tests impose what philosophers of science might call strong background assumptions about the data-generating process. The assertion that a population is normally distributed is an ontological claim about the world, not merely a methodological convenience. When researchers fit a normal model to data that demonstrably violates it, they risk drawing inferences that rest on false premises. The Mann-Whitney test makes inference possible without these commitments, operating instead on the ordinal relationships among observations, relationships that are preserved under any monotonic transformation of the measurement scale.

S. S. Stevens's 1946 taxonomy of measurement scales is relevant here. Stevens distinguished nominal, ordinal, interval, and ratio scales, arguing that the permissible statistical operations on data depend on the scale of measurement. For ordinal-level data, where numbers encode only relative ordering without meaningful equal intervals, interval arithmetic operations such as computing means and standard deviations have no natural justification. The Mann-Whitney test's exclusive use of ranks respects ordinal-level information, making it the appropriate choice when measurements are intrinsically ordinal (such as Likert-scale ratings, satisfaction scores, or ranked preferences) or when interval-level measurements have been transformed in ways that might distort distributional properties.

The Null Hypothesis and Its Precise Formulation

The null hypothesis of the Mann-Whitney U test is more nuanced than its common presentation in applied textbooks. Mann and Whitney formulated it precisely as:

Null and Alternative Hypotheses

H₀: P(X > Y) = P(Y > X) = 1/2

H₁ (two-tailed): P(X > Y) ≠ P(Y > X)

H₁ (upper one-tailed): P(X > Y) > P(Y > X)

H₁ (lower one-tailed): P(X > Y) < P(Y > X)

Where X and Y are randomly selected observations from populations 1 and 2, respectively. Ties (P(X = Y) = 0) are excluded from the original formulation but accommodated in practice.

This formulation asserts stochastic equality, the condition that a randomly drawn observation from population 1 is equally likely to exceed or be exceeded by a randomly drawn observation from population 2. Rejection of this null hypothesis implies stochastic dominance: that observations from one population tend systematically to be larger than observations from the other.

A distinction of considerable practical importance is that stochastic dominance does not necessarily imply a difference in medians. This point was clarified by Fay and Proschan (2010), among others. If the two distributions have the same shape and differ only in a location parameter (the "location shift" model), then stochastic dominance is equivalent to a difference in medians, and the Mann-Whitney test can legitimately be interpreted as a test of median equality. If the shape of the distributions differs, however, the two medians might be equal while stochastic dominance still holds, because one distribution might have a heavier upper tail. Researchers who report the Mann-Whitney result as evidence of a median difference implicitly invoke the location shift assumption. This assumption is weaker than normality, but it is an assumption nonetheless, and its plausibility should be evaluated in each application.

A rigorous Mann-Whitney U test report specifies whether the location shift assumption is invoked. When distributions are expected to differ in shape, the result should be described in terms of stochastic dominance rather than as evidence of a median difference.

Mathematical Foundations

Let X₁, X₂, ..., X_n₁ be a random sample from population 1, and Y₁, Y₂, ..., Y_n₂ be an independent random sample from population 2. The Mann-Whitney U statistic is defined as:

Definition of U

U₁ = Σᵢ Σⱼ ψ(Xᵢ, Yⱼ)

where ψ(Xᵢ, Yⱼ) = 1 if Xᵢ > Yⱼ, = 0.5 if Xᵢ = Yⱼ, = 0 if Xᵢ < Yⱼ

U₂ = n₁n₂ - U₁

U = min(U₁, U₂)

U₁ counts the number of (X, Y) pairs in which the X observation exceeds the Y observation. U₁ and U₂ are complementary: their sum always equals n₁n₂.

The equivalence between U and the Wilcoxon rank-sum statistic W follows from recognizing that U₁ = W₁ − n₁(n₁+1)/2, where W₁ is the sum of the ranks of the X observations in the combined and sorted dataset. This connection provides a computationally convenient way to obtain U: rank all N = n₁ + n₂ observations jointly, sum the ranks for each group, and apply the formula.

Under the null hypothesis of stochastic equality, the distribution of U₁ is symmetric about n₁n₂/2. The mean and variance of U₁ are:

Expected Value and Variance of U (with Tie Correction)

E(U) = n₁n₂ / 2

Var(U) = (n₁n₂/12) × [n₁ + n₂ + 1 − Σⱼ(tⱼ³ − tⱼ) / ((n₁+n₂)(n₁+n₂−1))]

Without ties: Var(U) = n₁n₂(n₁+n₂+1) / 12

tⱼ is the number of observations in the j-th group of tied values. The correction term reduces the variance when ties are present.

The Z approximation for large samples is obtained by standardizing U:

Asymptotic Test Statistic

Z = (U − E(U)) / √Var(U)

With continuity correction: Z = (|U − E(U)| − 0.5) / √Var(U)

Z is approximately standard normally distributed as min(n₁, n₂) → ∞. The continuity correction improves accuracy for moderate sample sizes.

Exact versus Asymptotic Methods

The original Mann-Whitney paper derived the exact distribution of U under the null hypothesis. When the null hypothesis is true and all observations come from the same continuous distribution, each of the C(n₁+n₂, n₁) possible rank arrangements is equally likely. The number of arrangements yielding U₁ = u is denoted w(u, n₁, n₂) and satisfies the recursion:

Exact Distribution Recursion (Mann and Whitney, 1947)

w(u, n₁, n₂) = w(u − n₂, n₁−1, n₂) + w(u, n₁, n₂−1)

Boundary: w(0, n₁, n₂) = 1 for all n₁, n₂ ≥ 0

Boundary: w(u, 0, n₂) = w(u, n₁, 0) = 0 for u > 0

The recursion follows from considering whether the observation with the highest overall rank belongs to group 1 (contributing n₂ to U₁) or group 2 (contributing 0 to U₁).

The exact one-tailed p-value is then:

Exact p-value

p(one-tailed) = Σ_{k=0}^{U_obs} w(k, n₁, n₂) / C(n₁+n₂, n₁)

p(two-tailed) = 2 × p(one-tailed), not exceeding 1

U_obs is taken as min(U₁, U₂) when computing the two-tailed p-value. This calculator provides exact p-values when both n₁ ≤ 20 and n₂ ≤ 20.

For sample sizes beyond this range, the central limit theorem ensures that the distribution of U converges to a normal distribution, and the asymptotic Z test provides accurate p-values. The tie-corrected variance formula should be used whenever ties are present in the data, as the uncorrected variance overestimates the true variance under ties, resulting in a conservative test.

Assumptions of the Test

Independence of Observations

The two samples must be independent of each other, and observations within each sample must be independent. The test does not accommodate paired or clustered data; the Wilcoxon signed-rank test is appropriate for paired designs.

Ordinal or Continuous Measurement Scale

The dependent variable must be measured on at least an ordinal scale, meaning that observations can be meaningfully ranked. Nominal-scale data (unordered categories) do not satisfy this requirement.

Continuous Distribution (for Exact Test)

The exact test derivation assumes that the underlying distributions are continuous, which implies zero probability of tied observations. In practice, ties are handled by the average-rank convention and the variance correction.

Same Distributional Shape (for Median Interpretation)

Interpreting the test as a comparison of medians requires the location shift assumption: the two distributions must have the same shape and differ only in their location parameter. This assumption is not required for the stochastic dominance interpretation.

Effect Size Measurement

Statistical significance alone provides no information about the practical magnitude of an observed difference. Effect size measures are essential complements to hypothesis tests, and several are available for the Mann-Whitney U test.

Measure	Formula	Interpretation	Benchmarks (Cohen, 1988)
Rank-biserial correlation (r)	r = Z / √N, where N = n₁ + n₂	Direction and magnitude of group difference on the rank scale; ranges from −1 to +1	0.10 small, 0.30 medium, 0.50 large
Common language effect size (CL)	CL = U₁ / (n₁ × n₂)	Probability that a randomly drawn observation from group 1 exceeds a randomly drawn observation from group 2; ranges from 0 to 1	0.56 small, 0.64 medium, 0.71 large (Ruscio, 2008)
Eta-squared analogue (η²)	η² = r²	Proportion of variance in group membership explained by the rank variable; analogous to r² in correlation	0.01 small, 0.06 medium, 0.14 large

Among these measures, the rank-biserial correlation r is the most widely reported in published research and is recommended by the APA Publication Manual (7th edition). The common language effect size has the advantage of direct interpretability in probability terms, making it accessible to non-statistician audiences. Researchers reporting Mann-Whitney results should include at least one effect size measure alongside the test statistic and p-value.

Comparative Methodology

Statistical Power and the Mann-Whitney Test

The statistical power of the Mann-Whitney U test relative to its parametric counterpart, the independent samples t-test, is a matter of both theoretical elegance and practical importance. The relevant quantity is the asymptotic relative efficiency (ARE), defined as the ratio of the sample sizes required by the two tests to achieve the same power against the same alternative. When both tests are valid (i.e., when the data are drawn from a normal distribution), the Mann-Whitney test requires only n₁/(n₂ × 3/π) ≈ 1.047 observations per group to achieve the same power as the t-test. Stated differently, the Mann-Whitney test achieves approximately 95.5% of the power of the t-test under normality. This result, due to Pitman (1948), established that the Mann-Whitney test sacrifices very little statistical efficiency even in the best-case scenario for the t-test.

When the data are drawn from non-normal distributions, the Mann-Whitney test can be substantially more powerful than the t-test. For data from a Laplace distribution, the ARE is 2.0; for a logistic distribution, it is π²/9 ≈ 1.097. The t-test is more powerful than the Mann-Whitney test only for normal data and certain other light-tailed symmetric distributions. For heavy-tailed, skewed, or contaminated distributions, which are common in behavioral and social science research, the Mann-Whitney test is the superior choice.

The conventional wisdom that non-parametric tests are always less powerful than their parametric counterparts is false. The Mann-Whitney test achieves 95.5% of the t-test's power under normality and exceeds it for many other distributions commonly encountered in behavioral research.

Sample size planning for the Mann-Whitney test typically proceeds through simulation or by treating the test as approximately equivalent to a t-test on the ranks. Software such as R (via the pwr package) and G*Power support power calculations for the Mann-Whitney test using the ARE correction. Researchers conducting a priori power analysis should justify their effect size estimate from prior literature, preferably using the rank-biserial correlation r rather than Cohen's d, as r is the native effect size for the Mann-Whitney test.

Statistical Calculator

Mann-Whitney U Test Calculator

Produces validated results consistent with SPSS, SAS, and R output. Exact p-values are computed for n₁, n₂ ≤ 20 using the Mann-Whitney recursion. The tie-corrected normal approximation is used for larger samples.

Data Entry and Configuration

Enter numerical values separated by commas, spaces, or line breaks. At least 2 observations are required per group.

Group 1 Name (optional)

Group 1 Data

Enter values separated by commas or line breaks

Group 2 Name (optional)

Group 2 Data

Enter values separated by commas or line breaks

Test Direction

Significance Level (α)

Continuity Correction

Statistical Output

Observation	Group	Value	Rank

Narrative Results Reporting

APA 7th Edition Format

Copied to clipboard

Chicago 17th Edition Format

Copied to clipboard

MLA 9th Edition Format

Copied to clipboard

Plain Academic Format

Copied to clipboard

Academic Writing Standards

Guidelines for Reporting Mann-Whitney U Test Results

APA 7th Edition Reporting

The APA Publication Manual (7th edition, 2020) requires that reports of non-parametric tests include the test statistic, degrees of freedom (if applicable), p-value, and effect size. For the Mann-Whitney U test, the recommended elements are the U statistic, the standardized test statistic (Z), the p-value to three decimal places, and the effect size r. Medians and their ranges or interquartile ranges for each group should be reported in text or in a table.

APA 7th Edition Template

A Mann-Whitney U test was conducted to examine differences between [Group 1 label] and [Group 2 label]. Results indicated that [Group 1] (Mdn = XX.XX, IQR = [XX.XX, XX.XX], n = X) [differed significantly from / did not differ significantly from] [Group 2] (Mdn = XX.XX, IQR = [XX.XX, XX.XX], n = X), U = XX.XX, z = X.XX, p = .XXX, r = .XX.

Report p-values to three decimal places. If p < .001, write "p < .001." Omit the leading zero before the decimal point for p-values and effect sizes (write ".05" not "0.05"). Report U to two decimal places when ties produce a non-integer U.

Chicago 17th Edition Reporting

Chicago style does not specify a format for reporting statistical results as precisely as APA. In humanities and social science manuscripts following Chicago style, statistical results are typically presented in parentheses within the running text. The elements are the same as APA, but p-values may include the leading zero and the note symbol is used for footnote references rather than in-text author-date citations.

MLA 9th Edition Reporting

MLA style is used primarily in humanities disciplines and provides minimal guidance on statistical reporting. Researchers in fields that primarily use MLA (literary studies, linguistics) who report statistical findings typically adopt a modified version of APA numerical reporting conventions within MLA's general prose style. The Works Cited entry for statistical software used should be included.

Key Elements Required in All Formats

Element	What to Report	Example
Group descriptives	Median and interquartile range (or range) for each group, sample sizes	Mdn = 14.50, IQR = [11.25, 18.75], n = 24
U statistic	The obtained U value (use U, not U₁ or U₂)	U = 142.50
Test statistic	The Z approximation (or exact statistic if used)	z = 2.34
P-value	Exact p to three decimal places, or "< .001"	p = .019 (two-tailed)
Effect size	Rank-biserial r with interpretation	r = .34 (medium effect)
Test type	Specify whether exact or asymptotic, and whether ties are present	"asymptotic, with tie correction"

References

American Psychological Association. (2020). Publication manual of the American Psychological Association (7th ed.). American Psychological Association.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Fay, M. P., and Proschan, M. A. (2010). Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys, 4, 1–39. https://doi.org/10.1214/09-SS051

Hollander, M., Wolfe, D. A., and Chicken, E. (2013). Nonparametric statistical methods (3rd ed.). John Wiley and Sons.

Mann, H. B., and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60. https://doi.org/10.1214/aoms/1177730491

Pitman, E. J. G. (1948). Lecture notes on nonparametric statistics. Columbia University.

Ruscio, J. (2008). A probability-based measure of effect size: Robustness to base rates and other factors. Psychological Methods, 13(1), 19–30. https://doi.org/10.1037/1082-989X.13.1.19

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968

Zieffler, A. S., Harring, J. R., and Long, J. D. (2011). Comparing groups: Randomization and bootstrap methods using R. John Wiley and Sons.