抽样调查

Sampling Design Notes

Jack H

June 5, 2025


Introduction

Coefficient of Variation

cv=Var(θ^)E(θ^)\mathrm{cv} = \frac{\sqrt{\mathrm{Var}(\hat\theta)}}{E(\hat\theta)}

It measures the density of given data.

Mean Square Error

MSE(θ^)=Var(θ^)+(bias(θ^))2\mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + (\mathrm{bias}(\hat{\theta}))^2

Point Estimation Error Control: MSE(V,C)
When θ^\hat\theta is unbiased, MSE(θ^)=Var(θ^)\mathrm{MSE}(\hat\theta) = \mathrm{Var}(\hat\theta) , for a given upper bound of variance VV, or a given upper bound of cv:C\mathrm{cv}: C, such that $V \geq \mathrm{var}(\hat\theta) $ or Ccv(θ^)C \geq \mathrm{cv}(\hat\theta)

Margin of Error:
If θ^\hat \theta is a point estimator of θ\theta, for a given α(0,1)\alpha \in (0,1), if $$P(|\hat\theta - \theta| \leq d) = 1-\alpha$$ we call dd the margin of error of θ^\hat\theta at confidence level α\alpha.
Relative Margin of Error:
If $$P(\frac{|\hat\theta - \theta|}{\theta} \leq r) = 1-\alpha$$ we call rr the relative margin of error of θ^\hat\theta at confidence level α\alpha.

Error Limit (d,r) Estimation Control:
For a given α(0,1)\alpha \in (0,1), for given absolute margin of error dd or relative margin of error rr, such that P(θ^θd)=1αP(|\hat\theta - \theta| \leq d) = 1-\alpha or P(θ^θθr)=1αP(\frac{|\hat\theta - \theta|}{\theta} \leq r) = 1-\alpha.

Assumptions in this course:

  1. Consistent Estimation: θ^nPθ,n\hat \theta_n \xrightarrow{P}\theta,n\to \infty
    Definition: If θ^\hat\theta is a consistent estimator of θ\theta, then for any ϵ>0\epsilon > 0,

P(θ^θ>ϵ)0 as nP(|\hat\theta - \theta| > \epsilon) \to 0 \text{ as } n \to \infty

  1. Asymptotically Normal Distribution CAN: θ^nE(θ^)var(θ^n)dN(0,1)\frac{\hat\theta_n - E(\hat\theta)}{\sqrt{var(\hat\theta_n)}} \xrightarrow{d} N(0,1)
    Denote var(θ^n)\sqrt{var(\hat\theta_n)} as sd(θ^n)\mathrm{sd}(\hat\theta_n)

You need to prove UE or AUE in this course
Theorem: When θ^\hat\theta is consistent and asymptotically normal, if θ^\hat \theta is an unbiased estimator(UE) or asymptotically unbiased estimator(AUE), then the distribution of θ^\hat\theta is approximately normal.

θ^θvar(θ^)=θ^θsd(θ^)dN(0,1)as n\frac{\hat\theta - \theta}{\sqrt{var(\hat\theta)}} = \frac{\hat\theta - \theta}{\mathrm{sd}(\hat\theta)} \xrightarrow{d} N(0,1)\quad \text{as } n \to \infty

Therefore $$P\left(\frac{\hat\theta - \theta}{\mathrm{sd}(\hat\theta)} \leq z_{\alpha/2}\right) = 1-\alpha$$
d=zα/2sd(θ^),r=zα/2sd(θ^)θd = z_{\alpha/2} \mathrm{sd}(\hat\theta), r = z_{\alpha/2} \frac{\mathrm{sd}(\hat\theta)}{\theta}

θ^L=θ^zα/2sd(θ^),θ^R=θ^+zα/2sd(θ^)\hat\theta_L = \hat\theta - z_{\alpha/2} \mathrm{sd}(\hat\theta),\hat\theta_R = \hat\theta + z_{\alpha/2} \mathrm{sd}(\hat\theta)
If P(θ^Lθθ^R)=1αP(\hat\theta_L \leq \theta \leq \hat\theta_R) = 1-\alpha, then θ^L\hat\theta_L and θ^R\hat\theta_R are the endpoints of a (1α)(1-\alpha) confidence interval for θ\theta.
When nn is large, $$P(\theta \in [\hat\theta \pm z_{\alpha/2} \mathrm{sd}(\hat\theta)]) \approx 1-\alpha$$

In ci.r

1
conf.interval=function(para.hat, SD.hat, alpha)

Simple Random Sampling

In this cource, we consider picking nn units out of a population of NN without replacement, each pick has probability p=1/CNnp = 1/C_N^n

In srs sampling.r

1
2
3
4
5
6
## simple random sampling without replacement
mysrs=sample(1:N, n)
print(mysrs)
## simple random sampling with replacement
mysrs=sample(1:N, n, replace = TRUE)
print(mysrs)

Mean :

Yˉ=1Ni=1NYi\bar Y = \frac 1 N \sum_{i = 1} ^N Y_i

Total:

YT=NYˉY_T = N \bar Y

Variance:

S2=1N1i=1N(YiYˉ)2S^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar Y)^2

Estimation of Population Mean Yˉ\bar Y

  1. Point Estimation

yˉ=1ni=1nyi\bar y = \frac 1 n \sum_{i=1}^n y_i

  1. Unbiased estimator of Yˉ\bar Y

E(yˉ)=Yˉ(UE)\mathrm E (\bar y ) = \bar Y \quad (UE)

  1. Variance of estimation:

Var(yˉ)=1fnS2Var(\bar y ) = \frac{1-f}{n}S^2

where f=nNf = \frac{n}{N} and S2S^2 is the variance of population YY (unknown)
4. Estimation of variance:

Var^(yˉ)=1fns2\hat{\mathrm{Var} } (\bar y ) = \frac{1-f}{n} s^2

where

s2=1n1i=1n(yiyˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar y )^2

  1. Confidence Interval

[yˉ±zα/2Var^(yˉ)]\left[ \bar y \pm z_{\alpha /2 } \sqrt{\widehat {\mathrm{Var}} (\bar y )}\right]

d=Var^(yˉ)d = \sqrt{\widehat {\mathrm{Var}} (\bar y )}

r=dyˉr = \frac{d}{\bar y}

In srs.r

1
srs.mean=function(N, mysample, alpha) 

Proof of E(s2)=S2E(s^2) = S^2

Step 1: Express $ S^2 $ and $ s^2 $

The population variance $ S^2 $ is defined as:

S2=1N1i=1N(YiYˉ)2S^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2

The sample variance $ s^2 $ is:

s2=1n1i=1n(yiyˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2

Step 2: Expand the Sum of Squares

First, note that:

i=1n(yiyˉ)2=i=1nyi2nyˉ2\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n y_i^2 - n \bar{y}^2

Step 3: Take the Expectation of $ s^2 $

Compute $ E(s^2) $:

E(s2)=E(1n1[i=1nyi2nyˉ2])=1n1[i=1nE(yi2)nE(yˉ2)]E(s^2) = E\left( \frac{1}{n-1} \left[ \sum_{i=1}^n y_i^2 - n \bar{y}^2 \right] \right) = \frac{1}{n-1} \left[ \sum_{i=1}^n E(y_i^2) - n E(\bar{y}^2) \right]

Step 4: Compute $ E(y_i^2) $ and $ E(\bar{y}^2) $

For any $ y_i $:

E(yi2)=Var(yi)+[E(yi)]2=S2(11N)+Yˉ2E(y_i^2) = \text{Var}(y_i) + [E(y_i)]^2 = S^2 \left(1 - \frac{1}{N}\right) + \bar{Y}^2

For $ \bar{y} $:

E(yˉ2)=Var(yˉ)+[E(yˉ)]2=1fnS2+Yˉ2E(\bar{y}^2) = \text{Var}(\bar{y}) + [E(\bar{y})]^2 = \frac{1-f}{n} S^2 + \bar{Y}^2

Where $ f = \frac{n}{N} $.

Step 5: Substitute Back into $ E(s^2) $

E(s2)=1n1[n(S2(11N)+Yˉ2)n(1fnS2+Yˉ2)]E(s^2) = \frac{1}{n-1} \left[ n \left( S^2 \left(1 - \frac{1}{N}\right) + \bar{Y}^2 \right) - n \left( \frac{1-f}{n} S^2 + \bar{Y}^2 \right) \right]

Simplify the expression:

E(s2)=1n1[nS2(11N)(1f)S2]E(s^2) = \frac{1}{n-1} \left[ n S^2 \left(1 - \frac{1}{N}\right) - (1 - f) S^2 \right]

=1n1[nS2nNS2S2+nNS2]= \frac{1}{n-1} \left[ n S^2 - \frac{n}{N} S^2 - S^2 + \frac{n}{N} S^2 \right]

=1n1[(n1)S2]=S2= \frac{1}{n-1} \left[ (n - 1) S^2 \right] = S^2

Conclusion

Thus, we have shown that:

E(s2)=S2E(s^2) = S^2

Estimation of Population Total YT=NYˉ=i=1NYiY_T = N \bar Y = \sum_{i=1}^N Y_i

  1. Point Estimation

y^T=Nyˉ\hat y_T = N \bar y

  1. Unbiased Estimator

E(y^T)=NE(yˉ)=NYˉ=YT\mathrm E (\hat y_T) = N\cdot \mathrm E (\bar y) = N \bar Y = Y_T

  1. Variance of Estimation$$Var(\hat y_T) = N^2 Var (\bar y ) = N^2 \frac{1-f}n S^2$$
  2. Estimation of Variance $$\hat Var(\hat Y_T) = N^2 \frac{1-f}n s^2$$
  3. Confidence Interval

[y^T±zα/2Var^(y^T)]\left[\hat y_T \pm z_{\alpha /2 } \sqrt{\widehat {\mathrm{Var}} (\hat y_T)}\right]

d=zα/2Var^(y^T)d = z_{\alpha /2 }\sqrt{\widehat {\mathrm{Var}} (\hat y_T)}

r=dy^Tr = \frac{d}{\hat y_T}

In srs.r

1
srs.total=function(N, mysample, alpha)

Estimation of Population Proportion PP

Define:

  • Population Proportion $P = \frac{1}N \sum_{i=1}^N Y_i = \bar Y $
  • Population Total A=Yi=NPA = \sum Y_i = NP
  • Population Variance $$S^2 = \frac{N}{N-1} P(1-P) =\frac{N}{N-1} PQ \quad \text{where } Q = 1-P$$

Let the observed y1,,yny_1 ,\ldots, y_n have property with count aa

  1. p^=yˉ=an\hat p = \bar y = \frac a n
  2. UE
  3. Variance of Estimation $$Var(\hat p) = \frac{1-f}n (\frac{N}{N-1}PQ)$$
  4. Estimation of Variance $$\hat Var(\hat p ) = \frac{1-f}{n-1}\hat p \hat q$$

In srs.r

1
srs.prop=function(N=NULL, n, event.num, alpha)

Estimation of Population total AA

  1. A^=Nyˉ=Np^\hat A = N\bar y = N \hat p
  2. UE
  3. Var(A^)=N21fnNN1PQVar(\hat A) = N^2 \frac{1-f}{n} \frac {N}{N-1} PQ

  4. V^ar(A^)=N21fnnn1p^q^\hat Var(\hat A) = N^2 \frac{1-f}{n} \frac {n}{n-1} \hat p \hat q

In srs.r

1
srs.num=function(N=NULL, n, event.num, alpha)

Determining the Sample size

The sample size is determined by the accuracy needed

(V,C,d,r)    nmin(V,C,d,r) \implies n_{\min}

V: Variance upper bound
C: CV upper bound
d: Error upper bound
r: Relative error upper bound

Sample Size nminn_{\min} for Estimating Population Mean Yˉ\bar Y

Step 1 Calculate n0n_0
Here S2S^2 and Yˉ\bar Y are given from historical data.

n0=S2V={S2VV=VS2C2Yˉ2C=V/Yˉzα/22S2d2d=zα/2Vzα/22S2r2Yˉ2r=zα/2V/Yˉn_0 = \frac{S^2}{V} = \begin{cases} \frac{S^2}V & V=V\\ \frac{S^2}{C^2\bar Y ^2 } & C = \sqrt V/\bar Y \\ \frac{z_{\alpha/2}^2 S^2}{d^2} & d = z_{\alpha /2}\sqrt V\\ \frac{z_{\alpha / 2 } ^2 S^2}{r^2 \bar Y ^2 } & r = z_{\alpha /2} \sqrt V / \bar Y \end{cases}

Step 2

nmin={n01+N0Ngiven reasonable Nn0when N is very bign_{\min} = \begin{cases}\frac{n_0}{1+ \frac{N_0}{N} } & \text{given reasonable } N \\ n_0 & \text{when $N$ is very big}\end{cases}

In srs size.r

1
size.mean=function(N=NULL, Mean.his=NULL, Var.his, method, bound, alpha)

Sample Size for Estimating Proportion PP

Here PP and Q=1PQ = 1 - P are given from historical data.

n0=PQV={PQVQC2Pzα/22PQd2zα/22Qr2Pn_0 = \frac{PQ}{V} = \begin{cases} \frac{PQ}V \\ \frac{Q}{C^2P} \\\frac{z_{\alpha /2}^2 PQ }{d^2}\\ \frac{z_{\alpha /2}^2 Q }{r^2 P } \end{cases}

nmin={n01+n01NGiven Nn0N>>n0n_{\min} = \begin{cases} \frac{n_0}{1 + \frac{n_0- 1}{N} } & \text{Given } N\\ n_0 & N >> n_0 \end{cases}

In srs size.r

1
size.prop=function(N=NULL, Prop.his, method, bound, alpha)

Sample Size for Estimating Population Total YTY_T

Use size.mean and adjust inputs
Apply the Sample Number nminn_{\min} for Estimating Population Mean Yˉ\bar Y Methods
Bounding Total is the same as bounding Yˉ\bar Y with different bounds:

Var(y^T)V    Var(yˉ)VN2Var(\hat y_T) \leq V \iff Var(\bar y ) \leq \frac{V}{N^2}

CV(y^T)C    CV(yˉ)CCV(\hat y_T) \leq C \iff CV(\bar y ) \leq C

Error(y^T)d    Error(yˉ)dN\text{Error}(\hat y_T)\leq d\iff\text{Error} (\bar y)\leq \frac d N

Absolute Error(y^T)r    Absolute Error(yˉ)r\text{Absolute Error}(\hat y_T) \leq r \iff\text{Absolute Error}(\bar y) \leq r

Stratified Random Sampling

Stratified Random Sampling Formulas

Concept Population (Yh1,,YhNhY_{h1}, \ldots, Y_{hN_h}) Sample (yh1,,yhnhy_{h1}, \ldots, y_{hn_h})
Size (Size) NhN_h (h=1LNh=N\sum_{h=1}^L N_h = N) nhn_h (h=1Lnh=n\sum_{h=1}^L n_h = n)
Mean Yh=1Nhi=1NhYhi\overline{Y}_h = \frac{1}{N_h} \sum_{i=1}^{N_h} Y_{hi} yh=1nhi=1nhyhi\overline{y}_h = \frac{1}{n_h} \sum_{i=1}^{n_h} y_{hi}
Variance Sh2=1Nh1i=1Nh(YhiYh)2S_h^2 = \frac{1}{N_h - 1} \sum_{i=1}^{N_h} (Y_{hi} - \overline{Y}_h)^2 sh2=1nh1i=1nh(yhiyh)2s_h^2 = \frac{1}{n_h - 1} \sum_{i=1}^{n_h} (y_{hi} - \overline{y}_h)^2
Stratum weight Wh=NhNW_h = \frac{N_h}{N} fh=nhNhf_h = \frac{n_h}{N_h}

Esimation of Population Mean Yˉ\bar Y

  1. yˉst=h=1LWhyˉh\bar y_{st} = \sum_{h = 1}^ L W_h \bar y_h

  2. E(yˉst)=YˉE(\bar y_{st}) = \bar Y

  3. Var(yˉst)=h=1LWh21fhnhSh2Var(\bar y_{st}) = \sum_{h=1}^L W_h^2 \frac{1- f_h }{n_h} S_h^2

  4. V^ar(yˉst)=h=1LWh21fhnhsh2\hat Var (\bar y_{st}) =\sum_{h=1}^L W_h^2 \frac{1- f_h }{n_h} s_h^2

See stratified mean.r

1
2
stra.srs.mean1=function(Nh, nh, yh, s2h, alpha)
stra.srs.mean2=function(Nh, mysample, stra.index, alpha)

Estimation of Population Total YTY_T

  1. Total Estimate:

y^st=Nyst=N(h=1LWhyh)\hat{y}_{st} = N \cdot \overline{y}_{st} = N \left( \sum_{h=1}^{L} W_h \overline{y}_h \right)

  1. Expected Value of the Estimator:

E(y^st)=Y^E(\hat{y}_{st}) = \hat{Y}

  1. Variance of the Estimator:

Var(y^st)=N2(h=1LWh21fhnhSh2)\operatorname{Var}(\hat{y}_{st}) = N^2 \left( \sum_{h=1}^{L} W_h^2 \cdot \frac{1 - f_h}{n_h} \cdot S_h^2 \right)

  1. Estimated Variance:

Var^(y^st)=N2(h=1LWh21fhnhsh2)\widehat{\operatorname{Var}}(\hat{y}_{st}) = N^2 \left( \sum_{h=1}^{L} W_h^2 \cdot \frac{1 - f_h}{n_h} \cdot s_h^2 \right)

See stratified mean.r


Estimation of Proportion

Symbol Population (Yh1,,YhNhY_{h1}, \ldots, Y_{hN_h}) Sample (yh1,,yhnhy_{h1}, \ldots, y_{hn_h})
Size NhN_h (N=h=1LNhN = \sum_{h=1}^L N_h) nhn_h (n=h=1Lnhn = \sum_{h=1}^L n_h)
Proportion with attribute AhA_h aha_h
Proportion Ph=AhNhP_h = \frac{A_h}{N_h} p^h=ahnh\hat{p}_h = \frac{a_h}{n_h}
Variance Sh2=NhNh1PhQhS_h^2 = \frac{N_h}{N_h - 1} P_h Q_h sh2=nhnh1p^hq^ns_h^2 = \frac{n_h}{n_h - 1} \hat{p}_h \hat q_n
Weight Wh=NhNW_h = \frac{N_h}{N} fh=nhNhf_h = \frac{n_h}{N_h}

Stratified Sampling Estimation of Population Proportion PP

  1. Estimator for Population Proportion:

p^st=h=1LWhp^h=h=1LWhahnh \hat{p}_{st} = \sum_{h=1}^L W_h \hat{p}_h = \sum_{h=1}^L W_h \cdot \frac{a_h}{n_h}

  1. Expected Value:

E(p^st)=P E(\hat{p}_{st}) = P

  1. Variance:

Var(p^st)=h=1LWh2Var(p^h)=h=1LWh2(1fhnhNhNh1PhQh) Var(\hat{p}_{st}) = \sum_{h=1}^L W_h^2 Var(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{N_h}{N_h - 1} P_h Q_h \right)

  1. Estimated Variance:

Var^(p^st)=h=1LWh2Var^(p^h)=h=1LWh2(1fhnhnhnh1p^hq^h) \widehat{Var}(\hat{p}_{st}) = \sum_{h=1}^L W_h^2 \widehat{Var}(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{n_h}{n_h - 1} \hat{p}_h \hat{q}_h \right)

  1. Confidence Interval (CI):

CId.r. CI \quad d.r.

Stratified Sampling Estimation for Total AA

  1. Estimator for Population Total:

A^st=N(h=1LWhp^h)=h=1LWhahnh \hat{A}_{st} = N \left( \sum_{h=1}^L W_h \hat{p}_h \right) = \sum_{h=1}^L W_h \cdot \frac{a_h}{n_h}

  1. Expected Value:

E(A^st)=A E(\hat{A}_{st}) = A

  1. Variance:

Var(A^st)=N2h=1LWh2Var(p^h)=h=1LWh2(1fhnhNhNh1PhQh) Var(\hat{A}_{st}) = N^2 \sum_{h=1}^L W_h^2 Var(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{N_h}{N_h - 1} P_h Q_h \right)

  1. Estimated Variance:

Var^(A^st)=N2h=1LWh2Var^(p^h)=h=1LWh2(1fhnhnhnh1p^hq^h) \widehat{Var}(\hat{A}_{st}) = N^2 \sum_{h=1}^L W_h^2 \widehat{Var}(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{n_h}{n_h - 1} \hat{p}_h \hat q_h \right)

  1. Confidence Interval (CI):

CId.r. CI \quad d.r.

Determining Sample Size

When given nn, determine nhn_h for each stratum

Use

1
2
3
strata.weight=function(Wh, S2h, Ch=NULL, allocation)

return(wh)

allocation = "Prop" or "Opt" or "Neyman"

Use

1
2
3
strata.size=function(n, Wh, S2h, Ch=NULL, allocation)

return(list(n=n, allocation=allocation, wh=wh, nh=ceiling(nh)))

The sample size for each stratum, $ n_h $, can be determined using different allocation methods. The general formula is:

nh=Whnn_h = W_h \cdot n

where $ W_h $ is the stratum weight and $ n $ is the total sample size.

1. Proportional Allocation (Prop):

The stratum weight $ W_h $ is proportional to the stratum size $ N_h $:

Wh=NhNW_h = \frac{N_h}{N}

Thus, the sample size for stratum $ h $ is:

nh=NhNnn_h = \frac{N_h}{N} \cdot n

This method ensures that the sample size in each stratum is proportional to the stratum’s size in the population.


2. Optimal Allocation (Opt):

The stratum weight $ W_h $ is adjusted based on the stratum’s variability and cost. The formula is:

Wh=NhShchh=1LNhShchW_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}}

Thus, the sample size for stratum $ h $ is:

nh=NhShchh=1LNhShchnn_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}} \cdot n

This method minimizes the variance of the estimator by allocating more samples to strata with higher variability or lower costs.


3. Neyman Allocation:

The stratum weight $ W_h $ is adjusted based on the stratum’s variability. The formula is:

Wh=NhShchh=1LNhShchW_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}}

If the cost per unit is the same across all strata ($ c_h = c $), this simplifies to:

Wh=NhShch=1LNhShc=NhShh=1LNhShW_h = \frac{\frac{N_h S_h}{\sqrt{c}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c}}} = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h}

Thus, the sample size for stratum $ h $ is:

nh=NhShh=1LNhShnn_h = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h} \cdot n

This method minimizes the variance of the estimator by allocating more samples to strata with higher variability.


Summary

  • Proportional Allocation: Simple and easy to implement, but does not account for variability.
  • Optimal Allocation: Minimizes variance by considering both variability and cost.
  • Neyman Allocation: A special case of optimal allocation when costs are equal across strata.

Proportional Allocation: nh=NhNnOptimal Allocation: nh=NhShchh=1LNhShchnNeyman Allocation: nh=NhShh=1LNhShn(when ch=c)\boxed{ \begin{aligned} &\text{Proportional Allocation: } n_h = \frac{N_h}{N} \cdot n \\ &\text{Optimal Allocation: } n_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}} \cdot n \\ &\text{Neyman Allocation: } n_h = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h} \cdot n \quad (\text{when } c_h = c) \end{aligned} }

See stratified size.r


When given V,C,d,rV,C,d,r of Yˉ\bar Y , determine nn and nhn_{h}

Use

1
strata.mean.size=function(Nh, S2h, Ch=NULL, allocation, method, bound, Ybar=NULL, alpha=NULL)

Step 1 Calculate whw_h with different allocation methods.

nh=whnn_h = w_h n

wh={WhpropWhSh/ChhWhSh/ChoptWhShWhShNeymanw_h = \begin{cases}W_h & \text{prop} \\ \frac{Wh S_h /\sqrt {C_h} }{\sum_h W_h Sh/\sqrt {C_h} } & \text{opt} \\ \frac{W_hS_h}{\sum W_h S_h } & \text{Neyman}\end{cases}

Step 2
Calculate nminn_{\min} :

nmin=hWh2Sh2/whV+1NhWhSh2n_{\min} = \frac{\sum_h W_h^2 S_h^2/w_h }{V + \frac 1 N \sum_h W_h S_h^2}

where

V={VVC2Yˉ2C(d/zα/2)2d(rYˉ/zα/2)2rV = \begin{cases} V & V\\ C^2\bar Y ^2 & C \\ (d/z_{\alpha/2})^2 & d \\ (r\bar Y / z_{\alpha/2})^2 & r\end{cases}

Sh2,YˉS_h^2, \bar Y are given from historical data.

Step 3 $$n_{h{\min}} = w_h n_{\min}$$

Given V,C,d,rV,C,d,r of PP, determine nn and nhn_h

Use

1
strata.prop.size=function(Nh, Ph, Ch=NULL, allocation, method, bound, Ybar=NULL, alpha=NULL)

Here

Sh2=NhNh1PhQhS_h^2 = \frac{N_h}{N_h - 1} P_h Q_h

Given V,C,d,rV,C,d,r of Total YTY_T, determine nn and nhn_h

Adjust the input bound parameter of calculating nn and nhn_h of given parameters of Yˉ\bar Y.
Use

1
strata.mean.size=function(Nh, S2h, Ch=NULL, allocation, method, bound, Ybar=NULL, alpha=NULL)
Population Total (YT)y^T(Y_T)\quad \hat y_T V C d r
Population Mean (Yˉ)yˉst(\bar Y)\quad \bar y_{st} VN2\frac{V}{N^2} C dN\frac d N r

Given V,C,d,rV,C,d,r of Total AA, determine nn and nhn_h

Adjust the input bound parameter of calculating nn and nhn_h of given parameters of Yˉ\bar Y.
Use

1
strata.prop.size=function(Nh, Ph, Ch=NULL, allocation, method, bound, Ybar=NULL, alpha=NULL)
Population Total (A)a^(A)\quad \hat a V C d r
Population Mean (P)p^st(P)\quad \hat p_{st} VN2\frac{V}{N^2} C dN\frac d N r

Design Efficiency - Comparison of Sampling Methods

Comparing the variance of your method versus Simple Random Sampling under the same sampling size, the design efficiency is defined as the fraction.

Deff=Var(θ^p)Var(θ^SRS)\text{Deff} = \frac{Var(\hat \theta_p)}{Var(\hat \theta_{SRS})}

Ratio Estimation and Regression Estimation

Notations
For population use UPPER CASE characters and for sample use lower case.

Sy2=1N1i=1N(YiYˉ)2(Y Total Variance)Sx2=1N1i=1N(XiXˉ)2(X Total Variance)Syx=1N1i=1N(YiYˉ)(XiXˉ)(Y, X Total Covariance)ρ=SyxSy2Sx2=SyxSySx(Y, X Total Correlation)Cy2=Sy2Yˉ2(Y Total Relative Variance)Cx2=Sx2Xˉ2(X Total Relative Variance)Cyx=ρSyYˉSxXˉ(Y X Relative Covariance)\begin{aligned} & S_y^2 = \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - \bar{Y})^2 \quad &\text{($Y$ Total Variance)} \\ & S_x^2 = \frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})^2 \quad &\text{($X$ Total Variance)} \\ & S_{yx} = \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - \bar{Y})(X_i - \bar{X}) \quad &\text{(Y, X Total Covariance)} \\ & \rho = \frac{S_{yx}}{\sqrt{S_y^2 \cdot S_x^2}} = \frac{S_{yx}}{S_y \cdot S_x} \quad &\text{(Y, X Total Correlation)} \\ & C_y^2 = \frac{S_y^2}{\bar{Y}^2} \quad &\text{(Y Total Relative Variance)}\\ &C_x^2 = \frac{S_x^2}{\bar{X}^2} \quad &\text{(X Total Relative Variance)}\\ &C_{yx} = \rho \cdot \frac{S_y}{\bar{Y}} \cdot \frac{S_x}{\bar{X}} \quad &\text{(Y X Relative Covariance)} \end{aligned}

Estimation of Ratio

Ratio is defined as

R=YˉXˉ=YTXTR = \frac{\bar Y}{\bar X} = \frac{Y_T}{X_T}

  1. Point Estimation$$\hat R = \frac{\bar y}{\bar x}$$
  2. AUE $$\lim_{n\to \infty} E(\hat R) = R$$
  3. Variance of Estimation

    Proposition :$$ MSE(\hat R) \overset{AUE}{\simeq} Var(\hat R) \overset{n\to \infty}\simeq \frac{1-f}{n\bar X^2} \frac{1}{N-1}\sum_{i=1}^N (Y_i - RX_i)^2$$
    Where: $$\begin{aligned}S_g^2 &\overset{0}{=} \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - R X_i)^2\&\overset{1}{=} S_y^2 + R^2 S_x^2 - 2R S_{yx}\&\overset{2}{=} \bar Y^2 (C_y^2 + C_x^2 - 2C_{yx})\end{aligned}$$

  4. Estimation of Variance
    Method 1 When Xˉ\bar X is given

Var^1(R^)=01fn1Xˉ21n1i=1n(yiR^xi)2=11fn1Xˉ2(Sy2+R^2Sx22R^Syx)\begin{aligned} \widehat{\text{Var}}_1(\hat{R}) &\overset{0}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \hat{R} x_i)^2\\ &\overset{1}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot (S_y^2 + \hat{R}^2 S_x^2 - 2 \hat{R} S_{yx})\end{aligned}

**Method 2** When $\bar X$ is unknown, we use $\bar x$ from the sample

Var^2(R^)=01fn1Xˉ21n1i=1n(yiR^xi)2=11fn1Xˉ(Sy2+R^2Sx22R^Syx)\begin{aligned} \widehat{\text{Var}}_2(\hat{R}) &\overset{0}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \hat{R} x_i)^2\\ &\overset{1}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}} \cdot (S_y^2 + \hat{R}^2 S_x^2 - 2 \hat{R} S_{yx})\end{aligned}

Note: When Xˉ\bar X is given, we can use both methods 1 and 2. When Xˉ\bar X is unknown, use method 2.

  1. Confidence Interval
    CI1, CI2, CI3

placeholder for confidence interval

Use ratio.r

1
2
ratio = function(y.sample, x.sample, N=NULL, auxiliary=FALSE, Xbar=NULL, alpha)
#when auxiliary = false , Xbar =null ; when auxiliary = true, Xbar = Xbar

Ratio Estimation of Population Mean Yˉ\bar Y and Total YTY_T

SRSF (Simple Random Sampling with Fixed Ratio Estimation) of Population Mean Yˉ\bar Y

  1. Estimator for the Population Mean:

yˉR=yˉxˉXˉ=R^Xˉ \bar{y}_R = \frac{\bar{y}}{\bar{x}} \cdot \bar{X} = \hat{R} \cdot \bar{X}

  1. Expected Value of the Estimator:

E(yˉR)=E(R^)XˉRXˉ=Yˉ(AUE) E(\bar{y}_R) = E(\hat{R}) \cdot \bar{X} \approx R \cdot \bar{X} = \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator:

Var(yˉR)=Xˉ2Var(R^) \text{Var}(\bar{y}_R) = \bar{X}^2 \cdot \text{Var}(\hat{R})

  1. Estimated Variance of the Estimator:

Var^(yˉR)=Xˉ2Var^1(R^) \widehat{\text{Var}}(\bar{y}_R) = \bar{X}^2 \cdot \widehat{\text{Var}}_1(\hat{R})

  1. Confidence Interval:

CI=[Xˉleft,Xˉright] \text{CI} = \left[ \bar{X} \cdot \text{left}, \, \bar{X} \cdot \text{right} \right]

In ratio.r use:

1
ratio.mean=function(y.sample, x.sample, N=NULL, Xbar, alpha)

Example:

1
2
3
4
5
6
7
8
9
mean.simple.result=srs.mean(N, y.sample, alpha)

mean.ratio.result=ratio.mean(y.sample, x.sample, N, Xbar, alpha)

var.result=c(mean.simple.result$ybar.var, mean.ratio.result$ybarR.var)
deff.result=deff(var.result)

rownames(deff.result)=c("Simple", "Ratio")
print(deff.result)

SRSF Estimation of Population Total $ Y_T $

  1. Estimator for the Population Total:

Y^R=NyˉR \hat{Y}_R = N \cdot \bar{y}_R

  1. Approximately Unbiased Estimator (AUE):

E(Y^R)YT E(\hat{Y}_R) \approx Y_T

  1. Variance of the Estimator:

Var(Y^R)=N2Var(yˉR) \text{Var}(\hat{Y}_R) = N^2 \cdot \text{Var}(\bar{y}_R)

  1. Estimated Variance of the Estimator:

Var^(Y^R)=N2Var^(yˉR) \widehat{\text{Var}}(\hat{Y}_R) = N^2 \cdot \widehat{\text{Var}}(\bar{y}_R)

  1. Confidence Interval:

CI=[Nleft,Nright] \text{CI} = \left[ N \cdot \text{left}, \, N \cdot \text{right} \right]

use:

1
ratio.total=function(y.sample, x.sample, N, Xbar, alpha)

Example

1
2
3
4
5
6
7
8
total.simple.result=srs.total(N, y.sample, alpha)

total.ratio.result=ratio.total(y.sample, x.sample, N, Xbar, alpha)

var.result=c(total.simple.result$ytot.var, total.ratio.result$ytotal.var)
deff.result=deff(var.result)
rownames(deff.result)=c("Simple", "Ratio")
print(deff.result)

Design Efficiency

Ratio and Regression Estimation are called complex estimation methods, while Simple Random Sampling is called simple estimation method. When comparing complex methods to simple methods, design efficiency is defined as the fraction.

Deff=Var(yˉR)Var(yˉ)={<1yˉR is more efficient1yˉ is more efficient\text{Deff} = \frac{Var(\bar y_R)}{Var(\bar y)} = \begin{cases} <1 & \bar y_R \text{ is more efficient}\\ \geq 1 & \bar y \text{ is more efficient} \end{cases}

When $$\rho > \frac{C_x}{2C_y}$$
yˉR\bar y_R if more efficient then yˉ\bar y.
When YY and XX are highly correlated, yˉR\bar y_R is more efficient than yˉ\bar y.

Determining Sample Size

Step 1
When given bound (V,C,d,r)(V,C,d,r) of Yˉ\bar Y, determine the simple sample size nsimplen_{\text{simple}}
Using the function size.mean
Step 2
Determine the ratio sample size nRn_{R}

nR=Deffnsimplen_R = \text{Deff} \cdot n_{\text{simple} }

Use deff=function(var.result) to calculate the design efficiency and use deff.size=function(deff.result, n.simple) to calculate the size.

Example

1
2
3
4
5
6
7
8
9
10
11
mean.simple.result=srs.mean(N, y.sample, alpha)
mean.ratio.result=ratio.mean(y.sample, x.sample, N, Xbar, alpha)

var.result=c(mean.simple.result$ybar.var, mean.ratio.result$ybarR.var)
deff.result=deff(var.result)

n.simple=size.mean(N, Mean.his=NULL, Var.his=var(y.sample), method="d", bound=0.05, alpha)$size
size.result=deff.size(deff.result, n.simple)

rownames(size.result)=c("Simple", "Ratio")
print(size.result)

Regression Estimation of Population Mean Yˉ\bar Y and Total YTY_T

The Linear Regression Estimator is defined as

yˉlr=yˉ+β(Xˉxˉ)\bar y_{lr} = \bar y + \beta(\bar X - \bar x)

Y^lr=Nyˉlr\hat Y_{lr} = N \bar y_{lr}

Normally β\beta is either constant or the regression coefficient BB of XX on YY.

When β=1\beta = 1, we obtain the difference estimator Diff $$\bar y_d = \bar y + (\bar X - \bar x)$$

When β=0\beta = 0, it degenerates to the simple estimator yˉ\bar y.
When β=yˉ/xˉ=R^\beta = {\bar y}/{\bar x} = \hat R, we obtain the ratio estimator yˉR\bar y_R.

Regression Estimation of Population Mean Yˉ\bar Y

Case 1: β=β0\beta = \beta_0 is constant

  1. Estimator for the Population Mean

yˉlr(β0)=yˉ+β0(Xˉxˉ) \bar{y}_{lr}(\beta_0) = \bar{y} + \beta_0 (\bar{X} - \bar{x})

  1. Unbiased Estimator

E(yˉlr)=Yˉ+β0(XˉE(xˉ))=Yˉ(UE) E(\bar{y}_{lr}) = \bar{Y} + \beta_0 (\bar{X} - E(\bar{x})) = \bar{Y} \quad \text{(UE)}

  1. Variance of the Estimator

Var(yˉlr)=1fn(Sy2+β02Sx22β0Syx) \text{Var}(\bar{y}_{lr}) = \frac{1-f}{n} \left( S_y^2 + \beta_0^2 S_x^2 - 2 \beta_0 S_{yx} \right)

Minimum Variance Condition

Minimum when β0=B=SyxSx2Varmin=1fnSe2\text{Minimum when } \beta_0 = B = \frac{S_{yx}}{S_x^2}\Rightarrow \text{Var}_{\text{min}} = \frac{1-f}{n} S_e^2

Here BB is the population regression coefficient of XX on YY. $$B = \frac{S_{yx}}{S_x^2}$$

Varmin(yˉlr)=1fnSy2(1ρ2)\text{Var}_{\text{min}}(\bar y_{lr}) = \frac{1-f}{n} S_y^2(1-\rho^2)

Se2Sy2(1ρ2),ρ=SyxSySxS_e^2 \triangleq S_y^2(1-\rho^2),\qquad \rho = \frac{S_{yx}}{Sy S_x}

  1. Estimated Variance of the Estimator

Var^(yˉlr)=1fn(sy2+β02sx22β0syx) \widehat{\text{Var}}(\bar{y}_{lr}) = \frac{1-f}{n} \left( s_y^2 + \beta_0^2 s_x^2 - 2 \beta_0 s_{yx} \right)

  1. Confidence Interval

[yˉlr±zαVar^(yˉlr)] \left[ \bar{y}_{lr} \pm z_\alpha \sqrt{\widehat{\text{Var}}(\bar{y}_{lr})} \right]


Case 2: β=b^\beta = \hat b is the sample regression coefficient of xx and yy

β=b^=syxsx2\beta = \hat b = \frac{s_{yx}}{s_x^2}

  1. Estimator for the Population Mean

yˉlr=yˉ+b^(Xˉxˉ) \bar{y}_{lr} = \bar{y} + \hat{b} (\bar{X} - \bar{x})

  1. Approximate Unbiased Estimator

E(yˉlr)Yˉ(AUE) E(\bar{y}_{lr}) \approx \bar{Y} \quad \text{(AUE)}

  1. Mean Squared Error (MSE) and Variance

MSE(yˉlr)Var(yˉlr)1fnSe2 \text{MSE}(\bar{y}_{lr}) \approx \text{Var}(\bar{y}_{lr}) \approx \frac{1-f}{n} S_e^2

This is the theoretically minimum variance estimator.

  1. Estimated Variance of the Estimator

Var^(yˉlr)=1fnse2=1fnn1n2(sy2syx2sx2) \widehat{\text{Var}}(\bar{y}_{lr}) = \frac{1-f}{n} s_e^2 = \frac{1-f}{n} \cdot \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)

where

se2=n1n2(sy2syx2sx2) s_e^2 = \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)

In regression.r :

1
regression.mean=function(y.sample, x.sample, N=NULL, Xbar, alpha, method="Min", beta0=NULL)

Regression Estimation of Population Total YTY_T

Notice that $$\text{mean }\quad \bar y_{lr} \overset{N}{\longrightarrow} \hat y_{lr} \quad \text{total}$$

1
regression.total=function(y.sample, x.sample, N=NULL, Xbar, alpha, method="Min", beta0=NULL)

Comparison of Simple, Ratio, and Regression Estimation

Their corresponding variances are:

Var(yˉ)=1fnSy2Var(yˉR)1fn(Sy2+R2Sx22RρSySx)Var(yˉlr)1fnSy2(1ρ2)\begin{aligned} \text{Var}(\bar y) &= \frac{1-f}{n} \cdot S_y^2 \\ \text{Var}(\bar y_R) &\approx \frac{1-f}{n} \cdot (S_y^2 + R^2 S_x^2 - 2 R\rho S_yS_x) \\ \text{Var}(\bar y_{lr}) &\approx \frac{1-f}{n} \cdot S_y^2 (1-\rho^2) \end{aligned}

The condition for the regression estimator to be more efficient than the ratio estimator is:

(BR)20(B-R)^2 \geq 0

When nn is not large, the estimations might be biased. Real-life experiments show that when nn is small, the regression estimator can be more biased than the ratio estimator.

Example: Comparing Simple, Ratio, and Regression Estimation of Population Total YTY_T

1
2
3
4
5
6
7
8
9
10
11
12
13
total.simple.reult=srs.total(N, y.sample, alpha)
print(total.simple.result)

total.ratio.result=ratio.total(y.sample, x.sample, N, Xbar, alpha)
print(total.ratio.result)

total.reg.result=regression.total(y.sample, x.sample, N, Xbar, alpha, method="Min", beta0=NULL)
print(total.reg.result)

var.result=c(total.simple.reult$ytot.var, total.ratio.result$ytotal.var, total.reg.result$Var)
deff.result=deff(var.result)
rownames(deff.result)=c("Simple", "Ratio", "Regression")
print(deff.result)

Determining Sample Size

The design efficiency is defined as the fraction.

Deff=Var(yˉlr)Var(yˉ)\text{Deff} = \frac{Var(\bar y_{lr})}{Var(\bar y)}

Given the bound (V,C,d,r)(V,C,d,r) of yˉ\bar y, determine the simple sample size nsimplen_{\text{simple}}, then

nlr=Deffnsimplen_{lr} = \text{Deff} \cdot n_{\text{simple}}

Which is similar to ratio estimation.

Stratified Ratio and Regression Estimation

Two approaches for stratified estimation.

  1. Separated Estimation First estimate for each stratum, then take weighted average or sum.
  2. Combined Estimation First take the weighted average or sum, then estimate for the combined sample.

Stratified Ratio Estimation

For the $ h thstratum(-th stratum ( h = 1, \dots, L $):

Notations Population SRS\qquad \overset{SRS}{\longrightarrow} Sample
$$\begin{pmatrix}Y_{h1} & \cdots & Y_{hN_h} \X_{h1} & \cdots & X_{hN_h}\end{pmatrix}$$ $$\begin{pmatrix}y_{h1} & \cdots & y_{hn_h} \x_{h1} & \cdots & x_{hn_h}\end{pmatrix}$$
Mean $$\bar{Y}_h\quad \bar X_h$$ $$\bar y_h \quad \bar{x}_h$$
Var, Cov, ρ\rho $$S_{yh}^2, S_{xh}^2, S_{yxh}, \rho_h$$ $$s_{yh}^2, s_{xh}^2, s_{yxh}, \hat \rho_h$$
Seperate Ratio Estimation for each stratum $$R_h = \frac{\bar{Y}_h}{\bar{X}_h}$$ $$\hat R_h = \frac{\bar y_h}{\bar x_h}$$
Combined Ratio Estimation $$R_c = \frac{\bar{Y}}{\bar{X}}$$ $$\hat R_c = \frac{\bar y_{st}}{\bar x_{st}}$$

Separate Ratio Estimation of Population Mean Yˉ\bar Y

  1. Estimator for the Population Mean

yˉRS=hWhyˉRh=hWh(yˉhxˉhXˉh) \bar{y}_{RS} = \sum_h W_h \bar{y}_{Rh} = \sum_h W_h \left( \frac{\bar{y}_h}{\bar{x}_h} \cdot \bar{X}_h\right)

Notice that $$\bar y_{Rh} = \frac{\bar{y}_h}{\bar{x}_h} \cdot \bar{X}_h$$ is the ratio estimator of the $ h $-th stratum.

  1. Approximate Unbiasedness

E(yˉRS)Yˉ(AUE) E(\bar{y}_{RS}) \approx \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator

Var(yˉRS)hWh21fhnh(Syh2+Rh2Sxh22RhSyxh) \text{Var}(\bar{y}_{RS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + R_h^2 S_{x_h}^2 - 2R_h S_{yxh} \right)

  1. Estimated Variance of the Estimator

Var^(yˉRS)hWh21fhnh(syh2+R^h2sxh22R^hsyxh) \widehat{\text{Var}}(\bar{y}_{RS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{R}_h^2 s_{x_h}^2 - 2\hat{R}_h s_{yxh} \right)

where $ \hat{R}_h = \frac{\bar{y}_h}{\bar{x}_h} $

In stra ratio.r

1
separate.ratio.mean=function(Nh, y.sample, x.sample, stra.index, Xbarh, alpha)

Combined Ratio Estimation of Population Mean Yˉ\bar Y

  1. Estimator for the Population Mean

yˉRC=yˉstxˉstXˉ=R^cXˉ \bar{y}_{RC} = \frac{\bar{y}_{st}}{\bar{x}_{st}} \cdot \bar{X} = \hat{R}_c \cdot \bar{X}

  1. Approximate Unbiasedness

E(yˉRC)Yˉ(AUE) E(\bar{y}_{RC}) \approx \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator

Var(yˉRC)=hWh21fhnh(Syh2+Rh2Sxh22RhSyxh) \text{Var}(\bar{y}_{RC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + R_h^2 S_{x_h}^2 - 2R_h S_{yx_h} \right)

  1. Estimated Variance of the Estimator

Var^(yˉRC)=hWh21fhnh(syh2+R^c2sxh22R^csyxh) \widehat{\text{Var}}(\bar{y}_{RC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{R}_c^2 s_{x_h}^2 - 2\hat{R}_c s_{yx_h} \right)

where:

R^c=yˉstxˉst \hat{R}_c = \frac{\bar{y}_{st}}{\bar{x}_{st}}

In stra ratio.r

1
combined.ratio.mean=function(Nh, y.sample, x.sample, stra.index, Xbar, alpha)

Stratified Regression Estimation

Separate Regression Estimation of Population Mean Yˉ\bar Y

Case I: When βh\beta_h is constant

  1. Estimator for the Population Mean

yˉlrS=hWhyˉlrh=hWh(yˉh+βh(Xˉhxˉh)) \bar{y}_{lrS} = \sum_hW_h \bar y_{lrh }=\sum_h W_h \left( \bar{y}_h + \beta_h (\bar{X}_h - \bar{x}_h) \right)

Notice that $$\bar y_{lrh} = \bar{y}_h + \beta_h (\bar{X}_h - \bar{x}_h)$$ is the regression estimator of the $ h $-th stratum.

  1. Unbiasedness

E(yˉlrS)=Yˉ(UE) E(\bar{y}_{lrS}) = \bar{Y} \quad (\text{UE})

  1. Variance of the Estimator

Var(yˉlrS)=hWh21fhnh(Syh2+βh2Sxh22βhSyxh)\text{Var}(\bar{y}_{lrS}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + \beta_h^2 S_{x_h}^2 - 2\beta_h S_{yx_h} \right)

Minimum Variance Condition
When $ \beta_h = B_h = \frac{S_{yx_h}}{S_{x_h}^2} $:

Varmin=hWh21fhnhSeh2\text{Var}_{\text{min}} = \sum_h W_h^2 \frac{1-f_h}{n_h} S_{eh}^2

where:

Seh2=Syh2(1ρh2)S_{eh}^2 = S_{y_h}^2 (1 - \rho_h^2)

  1. Estimated Variance of the Estimator

Var^(yˉlrS)=hWh21fhnh(syh2+β^h2sxh22β^hsyxh) \widehat{\text{Var}}(\bar{y}_{lrS}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{\beta}_h^2 s_{x_h}^2 - 2\hat{\beta}_h s_{yx_h} \right)


Case II: When $ \beta_h = \hat{b}h = \frac{S{yx_h} }{S_{x_h}^2} $ (Regression Coefficient)

  1. Estimator for the Population Mean

yˉlrS=hWh(yˉh+b^h(Xˉhxˉh)) \bar{y}_{lrS} = \sum_h W_h \left( \bar{y}_h + \hat{b}_h (\bar{X}_h - \bar{x}_h) \right)

  1. Asymptotically Unbiased Estimator

E(yˉlrS)Yˉ(AUE) E(\bar{y}_{lrS}) \approx \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator

Var(yˉlrS)hWh21fhnhSyh2(1ρh2) \text{Var}(\bar{y}_{lrS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} S_{y_h}^2 (1 - \rho_h^2)

  1. Estimated Variance of the Estimator

Var^(yˉlrS)hWh21fhnhnh1nh2(syh2syxhsxh2)2 \widehat{\text{Var}}(\bar{y}_{lrS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \frac{n_h-1}{n_h-2} \left( s_{y_h}^2 - \frac{s_{yx_h}}{s_{x_h}^2} \right)^2

In stra regression.r

1
seperate.regression.mean=function(Nh, y.sample, x.sample, stra.index, Xbarh, alpha, method = "Min", beta0 = NULL)

Combined Regression Estimation of Population Mean Yˉ\bar Y

Case I: When β\beta is constant

  1. Estimator for the Population Mean

yˉlrC=yˉst+β(Xˉxˉst) \bar{y}_{lrC} = \bar{y}_{st} + \beta (\bar{X} - \bar{x}_{st})

  1. Unbiasedness

E(yˉlrC)=Yˉ(UE) E(\bar{y}_{lrC}) = \bar{Y} \quad (\text{UE})

  1. Variance of the Estimator

Var(yˉlrC)=hWh21fhnh(Syh2+β2Sxh22βSyxh) \text{Var}(\bar{y}_{lrC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{yh}^2 + \beta^2 S_{xh}^2 - 2\beta S_{yxh} \right)

Minimum Variance Condition
When $$ \beta = B_c = \frac{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{yxh}}{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{xh}^2} $$

The Variance achieves its minimum.

  1. Estimated Variance of the Estimator

Var^(yˉlrC)=hWh21fhnh(syh2+β^2sxh22β^syxh) \widehat{\text{Var}}(\bar{y}_{lrC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{yh}^2 + \hat{\beta}^2 s_{xh}^2 - 2\hat{\beta} s_{yxh} \right)


Case II

When $$ \beta = \hat{b}c = \frac{\sum_h W_h^2 \frac{1-f_h}{n_h} S{yx_h}}{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{x_h}^2} $$

  1. Estimator for the Population Mean

yˉlrC=yˉst+b^c(Xˉxˉst) \bar{y}_{lrC} = \bar{y}_{st} + \hat{b}_c (\bar{X} - \bar{x}_{st})

  1. Approximate Unbiasedness

E(yˉlrC)Yˉ(AUE) E(\bar{y}_{lrC}) \approx \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator

Var(yˉlrC)hWh21fhnh(Syh2+Bc2Sxh22BcSyxh) \text{Var}(\bar{y}_{lrC}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + B_c^2 S_{x_h}^2 - 2B_c S_{yx_h} \right)

  1. Estimated Variance of the Estimator

Var^(yˉlrC)hWh21fhnh(syh2+b^c2sxh22b^csyxh) \widehat{\text{Var}}(\bar{y}_{lrC}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{b}_c^2 s_{x_h}^2 - 2\hat{b}_c s_{yx_h} \right)

In stra regression.r

1
combined.regression.mean = function(Nh, y.sample, x.sample, stra.index, Xbar, alpha, method = "Min", beta0 = NULL)

Estimation of Population Total YTY_T

Notice that $$\text{mean }\quad \bar Y\overset{N}{\longrightarrow} Y_T \quad \text{total}$$

In stra ratio.r

1
2
3
separate.ratio.total=function(Nh, y.sample, x.sample, stra.index, Xbarh, alpha)

combined.ratio.total=function(Nh, y.sample, x.sample, stra.index, Xbar, alpha)

In stra regression.r

1
2
3
seperate.regression.total = function(Nh, y.sample, x.sample, stra.index, Xbarh, alpha, method = "Min", beta0 = NULL)

combined.regression.total = function(Nh, y.sample, x.sample, stra.index, Xbar, alpha, method = "Min", beta0 = NULL)

Determining Sample Size

Deff=Var(yˉprop)Var(yˉst)    nprop=Deffnyˉst\text{Deff} = \frac{Var(\bar y_{\text{prop}})}{Var(\bar y_{\text{st}})} \implies n_{\text{prop}} = \text{Deff} \cdot n_{\bar y_{st}}

Where given bound (V,C,d,r)(V,C,d,r) of yˉst\bar y_{st}, calculate sample size nyˉstn_{\bar y_{st}}
yˉprop\bar y _{\text{prop}} is estimated by the methods (RS, RC, lrS, lrC).

Double Sampling

Double Sampling or Two-phase Sampling is a method with two phases.

  1. First, sample from the population to obtain a big sample to obtain auxiliary information. In this course, the first-phase sampling is always SRS.
  2. Second, sample with a small size. In this book, the second-phase is always sampled from the first-phase.

Process

Population:

Y1,,YNY_1, \dots, Y_N

Step 1:

  • Sampling Method: SRS (Simple Random Sampling)
  • Sample Drawn:

y1,,yn(First Sample) y_1', \dots, y_{n'}' \quad \text{(First Sample)}

Step 2:

  • Second Sample:

y1,,yn(Second Sample) y_1, \dots, y_n \quad \text{(Second Sample)}

Estimation:

  • Estimator:

θ^=θ^(y1,,yn) \hat{\theta} = \hat{\theta}(y_1, \dots, y_n)

Expectation and Variance Decomposition

Expectation of the Estimator

E(θ^)=E(θ^(y1,,yn))=E1[E2(θ^(y1,,yn)y1,,yn)]=E1[E2(θ^)]\begin{aligned} E(\hat{\theta}) &= E\left( \hat{\theta}(y_1, \dots, y_n) \right)\\ &= E_1 \left[ E_2 \left( \hat{\theta} (y_1,\ldots,y_n) \big| y_1', \dots, y_{n'}' \right) \right]\\ &= E_1 \left[ E_2 (\hat \theta)\right]\end{aligned}

Variance of the Estimator

The variance of the estimator $ \hat{\theta} $ can be decomposed as:

Var(θ^)=Var1(E2(θ^y1,,yn))+E1(Var2(θ^y1,,yn))\text{Var}(\hat{\theta}) = \text{Var}_1 \left( E_2 \left( \hat{\theta} \mid y_1', \dots, y_{n'}' \right) \right) + E_1 \left( \text{Var}_2 \left( \hat{\theta} \mid y_1', \dots, y_{n'}' \right) \right)

Double Stratified Sampling

Sampling Process

Step 1

SRS sample from the population to obtain the first-phase samples. For known NN and given nn':

(Y1,,YN)SRS(y1,,yn)(Y_1,\ldots,Y_N) \overset{SRS}{\longrightarrow} (y_1',\ldots,y_{n'}' )

Step 2

Stratify the first-phase samples (y1,,yn)(y_1',\ldots,y_{n'}' ) into LL strata. The unit count for stratum hh is nhn_h'.
The samples are: (yn1,,ynh),h=1,,L(y_{n_1'}',\ldots, y_{n_h'}'),\quad h = 1,\ldots, L

Step 3

Estimate the stratum weight of stratum hh, since Wh=NhNW_h = \frac{N_h}{N} is unkown.
Using samples from the first-phase, we have:

wh=nhn,h=1,,Lw_h' = \frac{n_h'}{n'},\quad h = 1,\ldots ,L

Step 4

Perform a stratified sampling from the first-phase samples (y1,,yn)(y_1',\ldots,y_{n'}' ) to obtain the second-phase samples.

(yn1,,ynh)(yn1,,ynnh)(y_{n_1'}',\ldots, y_{n_h'}') \longrightarrow (y_{n_1},\ldots, y_{n_{n_h}})


Two-Phase Sampling Formulas

1. Second-phase sampling proportion:

vh=nhnhv_h = \frac{n_h}{n_h'}

  • $ n_h $: Size of the second-phase sample.
  • $ n_h’ $: Size of the first-phase sample.

2. Second-phase Sample Mean for hh-th stratum ($ \bar{y}_h $):

yˉh=1nhj=1nhyhj\bar{y}_h = \frac{1}{n_h} \sum_{j=1}^{n_h} y_{hj}

  • $ y_{hj} $: Value of the target variable $ y $ for the $ j $-th unit in the second-phase sample of stratum $ h $.

3. Second-phase Variance for hh-th stratum ($ S_h^2 $):

Sh2=1nh1j=1nh(yhjyˉh)2S_h^2 = \frac{1}{n_h - 1} \sum_{j=1}^{n_h} (y_{hj} - \bar{y}_h)^2

  • $ \bar{y}_h $: Sample mean of the target variable $ y $ for the second-phase sample of stratum $ h $.

Double Stratified Sampling Estimation of Population Mean $ \bar{Y}$

  1. Estimator for the Population Mean:

yˉstD=h=1Lwhyˉh \bar{y}_{stD} = \sum_{h=1}^{L} w_h' \cdot \bar{y}_h

  1. Unbiased Estimation:

E(yˉstD)=Yˉ(UE) E(\bar{y}_{stD}) = \bar{Y} \quad (\text{UE})

  1. Variance of the Estimator:

Var(yˉstD)=(1n1N)S2+h=1L1nh(vh1)whSh2 \text{Var}(\bar{y}_{stD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S^2 + \sum_{h=1}^{L} \frac{1}{n_h'} (v_h' - 1) w_h' S_h^2

  1. Estimated Variance of the Estimator:

Var^(yˉstD)=h=1L(1nh1nh)whsh2+(1n1N)h=1Lwh(yˉhyˉstD)2 \widehat{\text{Var}}(\bar{y}_{stD}) = \sum_{h=1}^{L} \left( \frac{1}{n_h} - \frac{1}{n_h'} \right) w_h' s_h^2 + \left( \frac{1}{n'} - \frac{1}{N} \right) \sum_{h=1}^{L} w_h' \left( \bar{y}_h - \bar{y}_{stD} \right)^2

In two phase stra.r

1
2
3
twophase.stra.mean1=function(N=NULL, nh.1st, nh.2nd, ybarh, s2h, alpha)

twophase.stra.total1=function(N, nh.1st, nh.2nd, ybarh, s2h, alpha)

Double Ratio Estimation and Double Regression Estimation

Sampling Process

YY is the target property and XX is the auxiliary property.

Step 1

SRS sample from the population to obtain the first-phase samples. For known NN and given nn':

(Y1YNX1XN)SRS(y1ynx1xn)\begin{pmatrix}Y_1 &\cdots &Y_N\\X_1 & \cdots &X_N \end{pmatrix} \overset{SRS}{\longrightarrow} \begin{pmatrix}y_1' &\cdots &y_{n'}'\\x_1 & \cdots &x_{n'}' \end{pmatrix}

Step 2

Since the auxiliary information Xˉ\bar X is unknown, use the first-phase samples to estimate Xˉ\bar X:

Xˉ=1nj=1nxj\bar X = \frac{1}{n'} \sum_{j=1}^{n'} x_{j}'

Step 3

SRS from the first-phase samples to obtain the second-phase samples:

(y1ynx1xn)SRS(y1ynx1xn)\begin{pmatrix}y_1' &\cdots &y_{n'}'\\x_1 & \cdots &x_{n'}' \end{pmatrix} \overset{SRS}{\longrightarrow} \begin{pmatrix}y_1 &\cdots &y_{n}\\x_1 & \cdots &x_{n} \end{pmatrix}

Notations for second-phase samples:

yˉ,xˉ,sy2,sx2,syx\bar y , \bar x, s_y^2, s_x^2, s_{yx}

Double Ratio Estimation of Population Mean $ \bar{Y}$

  1. Estimator for the Population Mean:

yˉRD=R^xˉ=yˉxˉxˉ \bar{y}_{RD} = \hat{R} \cdot \bar{x}' = \frac{\bar{y}'}{\bar{x}'} \cdot \bar{x}'

  1. Asymptotically Unbiased Estimation:

E(yˉRD)YˉAUE E(\bar{y}_{RD}) {\approx} \bar{Y}\quad \text{AUE}

  1. Variance of the Estimator:

Var(yˉRD)=(1n1N)Sy2+(1n1n)(Sy2+R2Sx22RSyx) \text{Var}(\bar{y}_{RD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) (S_y^2 + R^2 S_x^2 - 2RS_{yx})

  1. Estimated Variance of the Estimator:

Var^(yˉRD)=(1n1N)sy2+(1n1n)(sy2+R^2sx22R^syx) \widehat{\text{Var}}(\bar{y}_{RD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( s_y^2 + \hat{R}^2 s_x^2 - 2\hat{R}s_{yx} \right)

In twophase ratio.r

1
2
3
twophase.ratio.mean=function(N=NULL, n.1st, xbar.1st, y.sample, x.sample, alpha)

twophase.ratio.total=function(N, n.1st, xbar.1st, y.sample, x.sample, alpha)

Double Regression Estimation of Population Mean $ \bar{Y}$

Case 1: When $ \beta $ is a Constant ($ \beta = \beta_0 $, i.e., $ \beta = 1 $)

  1. Estimator for the Population Mean:

yˉlrD=yˉ+β(xˉxˉ) \bar{y}_{lrD} = \bar{y} + \beta (\bar{x}' - \bar{x})

  1. Unbiasedness:

E(yˉlrD(β0))=Yˉ(UE) E(\bar{y}_{lrD}(\beta_0)) = \bar{Y} \quad (\text{UE})

  1. Variance of the Estimator:

Var(yˉlrD(β0))=(1n1N)Sy2+(1n1n)(Sy2+β02Sx22β0Syx) \text{Var}(\bar{y}_{lrD}(\beta_0)) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( S_y^2 + \beta_0^2 S_x^2 - 2\beta_0 S_{yx} \right)

  1. Estimated Variance of the Estimator:

Var^(yˉlrD(β0))=(1n1N)sy2+(1n1n)(sy2+β02sx22β0syx) \widehat{\text{Var}}(\bar{y}_{lrD}(\beta_0)) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( s_y^2 + {\beta}_0^2 s_x^2 - 2{\beta}_0 s_{yx} \right)

Here is the Markdown representation of the given mathematical expressions:


Case II: When $ \beta$ is the regression coefficient of the second-phase sample

β=b^=SyxSx2\beta = \hat{b} = \frac{S_{yx}}{S_x^2}

  1. Estimator for the Population Mean

yˉlrD=yˉ+b^(xˉxˉ) \bar{y}_{lrD} = \bar{y} + \hat{b} (\bar{x}' - \bar{x})

  1. Asymptotically Unbiased Estimation

E(yˉlrD)Yˉ(AUE) E(\bar{y}_{lrD}) \approx \bar{Y} \quad (\text{AUE})

  1. Variance of the Estimator

Var(yˉlrD)=(1n1N)Sy2+(1n1n)Sy2(1ρ2) \text{Var}(\bar{y}_{lrD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) S_y^2 (1 - \rho^2)

  1. Estimated Variance of the Estimator

Var^(yˉlrD)=(1n1N)sy2+(1n1n)se2 \widehat{\text{Var}}(\bar{y}_{lrD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) s_e^2

where:

se2=n1n2(sy2syx2sx2) s_e^2 = \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)

In twophase regression.r

1
2
3
twophase.regression.mean=function(N=NULL, n.1st, xbar.1st, y.sample, x.sample, alpha, beta0=NULL)

twophase.regression.total=function(N=NULL, n.1st, xbar.1st, y.sample, x.sample, alpha, beta0=NULL)

Cluster Sampling

The population is formed by clusters. Cluster Sampling is to sample clusters and examine all the smaller units within the clusters.

Cluseter Sampling Estimation of Population Mean $ \bar{Y}$

Sampling Process

Population is formed by NN clusters:

Y11,,Y1M11Yi1,,YiMiiYN1,,YNMNN\boxed{Y_{11},\ldots ,Y_{1M_1}}_1\quad \cdots\quad \boxed{Y_{i1},\ldots ,Y_{iM_i}}_i\quad \cdots\quad \boxed{Y_{N1},\ldots ,Y_{NM_N}}_N

SRS from the cluster indices:

(1,,N)SRS(1,,n)(1,\ldots,N) \overset{SRS}{\longrightarrow} (1,\ldots,n)

We obtain the samples:

y11,,y1m11yi1,,yimiiyn1,,ynmnn\boxed{y_{11},\ldots,y_{1m_1}}_1\quad \cdots\quad \boxed{y_{i1},\ldots,y_{im_i}}_i\quad \cdots\quad \boxed{y_{n1},\ldots,y_{nm_n}}_n

For a given nn, the sample rate: $$f = \frac n N$$

Clusters with the same size (Mi=M=mi)(M_i = M = m_i)

Notations

UPPER CASE: population; lower case: sample.

Yi=1Mj=1MYij(cluster mean)Y=1MN(i=1Nj=1MYij)(unit mean)Y=1N(i=1N1Mj=1MYij)(mean by cluster)Sw2=1N(M1)i=1Nj=1M(YijYi)2(Within-cluster variance)S2=1NM1i=1Nj=1M(YijY)2(Total variance)Sb2=MN1i=1N(YiYˉ)2(Between-cluster variance)\begin{aligned} \overline{Y}_{i} &= \frac{1}{M} \sum_{j=1}^{M} Y_{ij} & \text{(cluster mean)} \\ \overline{\overline{Y}} &= \frac{1}{MN} \left( \sum_{i=1}^{N} \sum_{j=1}^{M} Y_{ij} \right) & \text{(unit mean)} \\ \overline{Y} &= \frac{1}{N} \left( \sum_{i=1}^{N} \frac{1}{M} \sum_{j=1}^{M} Y_{ij} \right) & \text{(mean by cluster)}\\ S_{w}^{2} &= \frac{1}{N(M-1)} \sum_{i=1}^{N} \sum_{j=1}^{M} \left( Y_{ij} - \overline{Y}_{i} \right)^{2} & \text{(Within-cluster variance)} \\ S^{2} &= \frac{1}{NM-1} \sum_{i=1}^{N} \sum_{j=1}^{M} \left( Y_{ij} - \overline{Y} \right)^{2} & \text{(Total variance)} \\ S_{b}^{2} &= \frac{M}{N-1} \sum_{i=1}^{N} \left( \overline{Y}_{i} - \bar{Y} \right)^{2} & \text{(Between-cluster variance)} \end{aligned}

yˉi,yˉˉ,yˉ,sw2,s2,Sb2\bar y_i , \bar{\bar{y}}, \bar y , s_w^2, s^2,S_b^2 can be defined similiarly.

Proposition:

  • Decomposition of population variance S2S^2:

S2=1NM1[(N1)Sb2+N(M1)S22]=N1NM1Sb2+N(M1)NM1Sw2\begin{aligned}S^2 &= \frac{1}{NM-1} \left[(N-1)S_b^2 + N(M-1) S_2^2\right]\\&=\frac{N-1}{NM-1}S_b^2 + \frac{N(M-1)}{NM-1}S_w^2\end{aligned}

  • Decomposition of sample variance s2s^2:

s2=n1nM1sb2+n(M1)nM1sw2\begin{aligned}s^2 &=\frac{n-1}{nM-1}s_b^2 + \frac{n(M-1)}{nM-1}s_w^2\end{aligned}

Estimation of the unit mean Y\overline{\overline{Y} }

  1. Estimation

y=1nMi=1nj=1Myij=(1M)yˉ \overline{\overline{y}} = \frac{1}{nM} \sum_{i=1}^n \sum_{j=1}^M y_{ij} = \left( \frac{1}{M} \right) \bar{y}

(=1ni=1nyˉi=mean(yˉ1,,yˉn)) \left( = \frac{1}{n} \sum_{i=1}^n \bar{y}_i = \text{mean}(\bar{y}_1, \ldots, \bar{y}_n) \right)

  1. Unbiased

E(y)=Y(UE) E(\overline{\overline{y}}) = \overline{\overline{Y}} \quad (\text{UE})

  1. Variance of Estimation

Var(y)=1fnMSb2 \text{Var}(\overline{\overline{y}}) = \frac{1-f}{nM} S_b^2

  1. Estimation of Variance

Var^(y)=1fnMSb2 \widehat{\text{Var}}(\overline{\overline{y}}) = \frac{1-f}{nM} S_b^2

In cluster srs.r

1
cluster.srs.mean = function(N, M.ith, ybar.ith, s2.ith, alpha)

Estimation of Population Variance S2S^2

Proposition
Recall:

S2=N1NM1Sb2+N(M1)NM1Sw2\begin{aligned}S^2 &=\frac{N-1}{NM-1}S_b^2 + \frac{N(M-1)}{NM-1}S_w^2\end{aligned}

s2=n1nM1sb2+n(M1)nM1sw2\begin{aligned}s^2 &=\frac{n-1}{nM-1}s_b^2 + \frac{n(M-1)}{nM-1}s_w^2\end{aligned}

We have:

  1. sb2s_b^2 is an Unbiased Estimator of Sb2S_b^2
  2. sw2s_w^2 is an Unbiased Estimator of Sw2S_w^2
  3. s2s^2 is NOT an Unbiased Estimator of S2S^2

The Unbiased Estimator of S2S^2 is:

S2N1NM1sb2+N(M1)NM1sw2(N is given)S21Msb2+M1Msw2(N=+)\begin{aligned}S^2 &\approx \frac{N-1}{NM-1}s_b^2 + \frac{N(M-1)}{NM-1}s_w^2 \quad &(N\text{ is given})\\S^2& \approx \frac{1}M s_b^2 + \frac{M-1}M s_w^2 \quad &(N = +\infty)\end{aligned}


Design Efficiency

Definition
Within Cluster Correlation Coefficient:

ρc=def2i=1Nj<kM(YijY)(YikY)(M1)(NM1)S2=1NMSw2(NM1)S2\begin{aligned}\rho_c &\overset{\text{def}}{=} \frac{2\sum_{i=1}^N \sum_{j<k}^M (Y_{ij} - \overline{\overline{Y}})(Y_{ik} - \overline{\overline{Y}})}{(M-1)(NM-1) S^2}\\&= 1 - \frac{NM S_w^2}{(NM-1) S^2}\end{aligned}

Note that:

ρc[1M1,1]\rho_c \in \left[ -\frac{1}{M-1}, 1 \right]

To Calculate the Design Efficiency, we need to calculate the variance of our cluster estimator versus the variance of SRS.

  1. Variance of the Cluster Estimator

Var(y)=1fnMSb2=1fnMNM1M(N1)S2(1+(M1)ρc)(N is given)=1fnMS2(1+(M1)ρc)(N=+) \begin{aligned}\text{Var}(\overline{\overline{y}}) &= \frac{1-f}{nM} S_b^2\\ &=\frac{1-f}{nM} \cdot \frac{NM-1}{M(N-1)} S^2 (1 + (M-1)\rho_c) \quad &(N \text{ is given})\\ &=\frac{1-f}{nM} S^2 (1 + (M-1)\rho_c) \quad & (N = +\infty)\end{aligned}

Now lets tackle ρc\rho_c, the estimation of ρc\rho_c is:

ρc=1NMs22(NM1)S^2N is given=sb2sw2sb2+(M1)sw2N=+\begin{aligned}\rho_c &= 1 - \frac{NMs_2^2}{(NM-1)\hat S^2} \quad & N \text{ is given}\\ & = \frac{s_b^2 - s_w^2}{s_b^2 + (M-1)s_w^2} \quad & N = +\infty\end{aligned}

  1. Variance for SRS from a population of NMNM with sample size nMnM

Var(ySRS)=1fnMS2\text{Var}(\overline{y}_{SRS}) = \frac{1 - f}{nM} S^2

Hence the Design Efficiency can be derived as:

Deff^=Var(y)Var(ySRS)={NM1M(N1)(1+(M1)ρ^c) N is limitied1+(M1)ρ^c N=+\widehat{\text{Deff}} = \frac{\text{Var}(\overline{\overline{y}})}{\text{Var}(\overline{y}_{SRS})} = \begin{cases} \frac{NM-1}{M(N-1)} (1 + (M-1)\hat\rho_c) & \text{ $N$ is limitied} \\ 1 + (M-1)\hat\rho_c & \text{ $N = +\infty$} \end{cases}

Determining the Sample Size

Given (V,C,d,r)(V,C,d,r) for yˉSRS\bar y_{SRS}, we can determine the sample size nSRSn_{SRS}, therefore:

nmin=Deff^nSRSn_{\min} = \widehat{\text{Deff}} \cdot n_{SRS}

The minimum number of clusters is:

ncluster=nminMn_{\text{cluster}} = \frac{n_{\min}}M

Clusters with different sizes

  1. If MiM_i are close enough, use the mean Mˉ\bar M as a proxy for MM:

Mˉ=1Ni=1NMi\bar M = \frac 1 N \sum_{i=1}^N M_i

  1. When MiM_i are widely apart, use the stratified method for each cluster to obtain a similar stratum size. Then use the mean as a proxy.

抽样调查
https://blog.jacklit.com/2025/06/07/抽样调查/
作者
Jack H
发布于
2025年6月7日
许可协议