Sampling Design Notes
Jack H
June 5, 2025
Introduction
Coefficient of Variation
c v = V a r ( θ ^ ) E ( θ ^ ) \mathrm{cv} = \frac{\sqrt{\mathrm{Var}(\hat\theta)}}{E(\hat\theta)}
cv = E ( θ ^ ) Var ( θ ^ )
It measures the density of given data.
Mean Square Error
M S E ( θ ^ ) = V a r ( θ ^ ) + ( b i a s ( θ ^ ) ) 2 \mathrm{MSE}(\hat{\theta}) = \mathrm{Var}(\hat{\theta}) + (\mathrm{bias}(\hat{\theta}))^2
MSE ( θ ^ ) = Var ( θ ^ ) + ( bias ( θ ^ ) ) 2
Point Estimation Error Control: MSE(V,C)
When θ ^ \hat\theta θ ^ is unbiased, M S E ( θ ^ ) = V a r ( θ ^ ) \mathrm{MSE}(\hat\theta) = \mathrm{Var}(\hat\theta) MSE ( θ ^ ) = Var ( θ ^ ) , for a given upper bound of variance V V V , or a given upper bound of c v : C \mathrm{cv}: C cv : C , such that $V \geq \mathrm{var}(\hat\theta) $ or C ≥ c v ( θ ^ ) C \geq \mathrm{cv}(\hat\theta) C ≥ cv ( θ ^ )
Margin of Error :
If θ ^ \hat \theta θ ^ is a point estimator of θ \theta θ , for a given α ∈ ( 0 , 1 ) \alpha \in (0,1) α ∈ ( 0 , 1 ) , if $$P(|\hat\theta - \theta| \leq d) = 1-\alpha$$ we call d d d the margin of error of θ ^ \hat\theta θ ^ at confidence level α \alpha α .
Relative Margin of Error:
If $$P(\frac{|\hat\theta - \theta|}{\theta} \leq r) = 1-\alpha$$ we call r r r the relative margin of error of θ ^ \hat\theta θ ^ at confidence level α \alpha α .
Error Limit (d,r) Estimation Control:
For a given α ∈ ( 0 , 1 ) \alpha \in (0,1) α ∈ ( 0 , 1 ) , for given absolute margin of error d d d or relative margin of error r r r , such that P ( ∣ θ ^ − θ ∣ ≤ d ) = 1 − α P(|\hat\theta - \theta| \leq d) = 1-\alpha P ( ∣ θ ^ − θ ∣ ≤ d ) = 1 − α or P ( ∣ θ ^ − θ ∣ θ ≤ r ) = 1 − α P(\frac{|\hat\theta - \theta|}{\theta} \leq r) = 1-\alpha P ( θ ∣ θ ^ − θ ∣ ≤ r ) = 1 − α .
Assumptions in this course:
Consistent Estimation: θ ^ n → P θ , n → ∞ \hat \theta_n \xrightarrow{P}\theta,n\to \infty θ ^ n P θ , n → ∞
Definition: If θ ^ \hat\theta θ ^ is a consistent estimator of θ \theta θ , then for any ϵ > 0 \epsilon > 0 ϵ > 0 ,
P ( ∣ θ ^ − θ ∣ > ϵ ) → 0 as n → ∞ P(|\hat\theta - \theta| > \epsilon) \to 0 \text{ as } n \to \infty
P ( ∣ θ ^ − θ ∣ > ϵ ) → 0 as n → ∞
Asymptotically Normal Distribution CAN: θ ^ n − E ( θ ^ ) v a r ( θ ^ n ) → d N ( 0 , 1 ) \frac{\hat\theta_n - E(\hat\theta)}{\sqrt{var(\hat\theta_n)}} \xrightarrow{d} N(0,1) v a r ( θ ^ n ) θ ^ n − E ( θ ^ ) d N ( 0 , 1 )
Denote v a r ( θ ^ n ) \sqrt{var(\hat\theta_n)} v a r ( θ ^ n ) as s d ( θ ^ n ) \mathrm{sd}(\hat\theta_n) sd ( θ ^ n )
You need to prove UE or AUE in this course
Theorem: When θ ^ \hat\theta θ ^ is consistent and asymptotically normal, if θ ^ \hat \theta θ ^ is an unbiased estimator(UE) or asymptotically unbiased estimator(AUE), then the distribution of θ ^ \hat\theta θ ^ is approximately normal.
θ ^ − θ v a r ( θ ^ ) = θ ^ − θ s d ( θ ^ ) → d N ( 0 , 1 ) as n → ∞ \frac{\hat\theta - \theta}{\sqrt{var(\hat\theta)}} = \frac{\hat\theta - \theta}{\mathrm{sd}(\hat\theta)} \xrightarrow{d} N(0,1)\quad \text{as } n \to \infty
v a r ( θ ^ ) θ ^ − θ = sd ( θ ^ ) θ ^ − θ d N ( 0 , 1 ) as n → ∞
Therefore $$P\left(\frac{\hat\theta - \theta}{\mathrm{sd}(\hat\theta)} \leq z_{\alpha/2}\right) = 1-\alpha$$
d = z α / 2 s d ( θ ^ ) , r = z α / 2 s d ( θ ^ ) θ d = z_{\alpha/2} \mathrm{sd}(\hat\theta), r = z_{\alpha/2} \frac{\mathrm{sd}(\hat\theta)}{\theta} d = z α /2 sd ( θ ^ ) , r = z α /2 θ sd ( θ ^ )
θ ^ L = θ ^ − z α / 2 s d ( θ ^ ) , θ ^ R = θ ^ + z α / 2 s d ( θ ^ ) \hat\theta_L = \hat\theta - z_{\alpha/2} \mathrm{sd}(\hat\theta),\hat\theta_R = \hat\theta + z_{\alpha/2} \mathrm{sd}(\hat\theta) θ ^ L = θ ^ − z α /2 sd ( θ ^ ) , θ ^ R = θ ^ + z α /2 sd ( θ ^ )
If P ( θ ^ L ≤ θ ≤ θ ^ R ) = 1 − α P(\hat\theta_L \leq \theta \leq \hat\theta_R) = 1-\alpha P ( θ ^ L ≤ θ ≤ θ ^ R ) = 1 − α , then θ ^ L \hat\theta_L θ ^ L and θ ^ R \hat\theta_R θ ^ R are the endpoints of a ( 1 − α ) (1-\alpha) ( 1 − α ) confidence interval for θ \theta θ .
When n n n is large, $$P(\theta \in [\hat\theta \pm z_{\alpha/2} \mathrm{sd}(\hat\theta)]) \approx 1-\alpha$$
In ci.r
1 conf.interval= function ( para.hat, SD.hat, alpha)
Simple Random Sampling
In this cource, we consider picking n n n units out of a population of N N N without replacement, each pick has probability p = 1 / C N n p = 1/C_N^n p = 1/ C N n
In srs sampling.r
1 2 3 4 5 6 mysrs= sample( 1 : N, n) print( mysrs) mysrs= sample( 1 : N, n, replace = TRUE ) print( mysrs)
Mean :
Y ˉ = 1 N ∑ i = 1 N Y i \bar Y = \frac 1 N \sum_{i = 1} ^N Y_i
Y ˉ = N 1 i = 1 ∑ N Y i
Total:
Y T = N Y ˉ Y_T = N \bar Y
Y T = N Y ˉ
Variance:
S 2 = 1 N − 1 ∑ i = 1 N ( Y i − Y ˉ ) 2 S^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar Y)^2
S 2 = N − 1 1 i = 1 ∑ N ( Y i − Y ˉ ) 2
Estimation of Population Mean Y ˉ \bar Y Y ˉ
Point Estimation
y ˉ = 1 n ∑ i = 1 n y i \bar y = \frac 1 n \sum_{i=1}^n y_i
y ˉ = n 1 i = 1 ∑ n y i
Unbiased estimator of Y ˉ \bar Y Y ˉ
E ( y ˉ ) = Y ˉ ( U E ) \mathrm E (\bar y ) = \bar Y \quad (UE)
E ( y ˉ ) = Y ˉ ( U E )
Variance of estimation:
V a r ( y ˉ ) = 1 − f n S 2 Var(\bar y ) = \frac{1-f}{n}S^2
Va r ( y ˉ ) = n 1 − f S 2
where f = n N f = \frac{n}{N} f = N n and S 2 S^2 S 2 is the variance of population Y Y Y (unknown)
4. Estimation of variance:
V a r ^ ( y ˉ ) = 1 − f n s 2 \hat{\mathrm{Var} } (\bar y ) = \frac{1-f}{n} s^2
Var ^ ( y ˉ ) = n 1 − f s 2
where
s 2 = 1 n − 1 ∑ i = 1 n ( y i − y ˉ ) 2 s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar y )^2
s 2 = n − 1 1 i = 1 ∑ n ( y i − y ˉ ) 2
Confidence Interval
[ y ˉ ± z α / 2 V a r ^ ( y ˉ ) ] \left[ \bar y \pm z_{\alpha /2 } \sqrt{\widehat {\mathrm{Var}} (\bar y )}\right]
[ y ˉ ± z α /2 Var ( y ˉ ) ]
d = V a r ^ ( y ˉ ) d = \sqrt{\widehat {\mathrm{Var}} (\bar y )}
d = Var ( y ˉ )
r = d y ˉ r = \frac{d}{\bar y}
r = y ˉ d
In srs.r
1 srs.mean= function ( N, mysample, alpha)
Proof of E ( s 2 ) = S 2 E(s^2) = S^2 E ( s 2 ) = S 2
Step 1: Express $ S^2 $ and $ s^2 $
The population variance $ S^2 $ is defined as:
S 2 = 1 N − 1 ∑ i = 1 N ( Y i − Y ˉ ) 2 S^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2
S 2 = N − 1 1 i = 1 ∑ N ( Y i − Y ˉ ) 2
The sample variance $ s^2 $ is:
s 2 = 1 n − 1 ∑ i = 1 n ( y i − y ˉ ) 2 s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2
s 2 = n − 1 1 i = 1 ∑ n ( y i − y ˉ ) 2
Step 2: Expand the Sum of Squares
First, note that:
∑ i = 1 n ( y i − y ˉ ) 2 = ∑ i = 1 n y i 2 − n y ˉ 2 \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n y_i^2 - n \bar{y}^2
i = 1 ∑ n ( y i − y ˉ ) 2 = i = 1 ∑ n y i 2 − n y ˉ 2
Step 3: Take the Expectation of $ s^2 $
Compute $ E(s^2) $:
E ( s 2 ) = E ( 1 n − 1 [ ∑ i = 1 n y i 2 − n y ˉ 2 ] ) = 1 n − 1 [ ∑ i = 1 n E ( y i 2 ) − n E ( y ˉ 2 ) ] E(s^2) = E\left( \frac{1}{n-1} \left[ \sum_{i=1}^n y_i^2 - n \bar{y}^2 \right] \right) = \frac{1}{n-1} \left[ \sum_{i=1}^n E(y_i^2) - n E(\bar{y}^2) \right]
E ( s 2 ) = E ( n − 1 1 [ i = 1 ∑ n y i 2 − n y ˉ 2 ] ) = n − 1 1 [ i = 1 ∑ n E ( y i 2 ) − n E ( y ˉ 2 ) ]
Step 4: Compute $ E(y_i^2) $ and $ E(\bar{y}^2) $
For any $ y_i $:
E ( y i 2 ) = Var ( y i ) + [ E ( y i ) ] 2 = S 2 ( 1 − 1 N ) + Y ˉ 2 E(y_i^2) = \text{Var}(y_i) + [E(y_i)]^2 = S^2 \left(1 - \frac{1}{N}\right) + \bar{Y}^2
E ( y i 2 ) = Var ( y i ) + [ E ( y i ) ] 2 = S 2 ( 1 − N 1 ) + Y ˉ 2
For $ \bar{y} $:
E ( y ˉ 2 ) = Var ( y ˉ ) + [ E ( y ˉ ) ] 2 = 1 − f n S 2 + Y ˉ 2 E(\bar{y}^2) = \text{Var}(\bar{y}) + [E(\bar{y})]^2 = \frac{1-f}{n} S^2 + \bar{Y}^2
E ( y ˉ 2 ) = Var ( y ˉ ) + [ E ( y ˉ ) ] 2 = n 1 − f S 2 + Y ˉ 2
Where $ f = \frac{n}{N} $.
Step 5: Substitute Back into $ E(s^2) $
E ( s 2 ) = 1 n − 1 [ n ( S 2 ( 1 − 1 N ) + Y ˉ 2 ) − n ( 1 − f n S 2 + Y ˉ 2 ) ] E(s^2) = \frac{1}{n-1} \left[ n \left( S^2 \left(1 - \frac{1}{N}\right) + \bar{Y}^2 \right) - n \left( \frac{1-f}{n} S^2 + \bar{Y}^2 \right) \right]
E ( s 2 ) = n − 1 1 [ n ( S 2 ( 1 − N 1 ) + Y ˉ 2 ) − n ( n 1 − f S 2 + Y ˉ 2 ) ]
Simplify the expression:
E ( s 2 ) = 1 n − 1 [ n S 2 ( 1 − 1 N ) − ( 1 − f ) S 2 ] E(s^2) = \frac{1}{n-1} \left[ n S^2 \left(1 - \frac{1}{N}\right) - (1 - f) S^2 \right]
E ( s 2 ) = n − 1 1 [ n S 2 ( 1 − N 1 ) − ( 1 − f ) S 2 ]
= 1 n − 1 [ n S 2 − n N S 2 − S 2 + n N S 2 ] = \frac{1}{n-1} \left[ n S^2 - \frac{n}{N} S^2 - S^2 + \frac{n}{N} S^2 \right]
= n − 1 1 [ n S 2 − N n S 2 − S 2 + N n S 2 ]
= 1 n − 1 [ ( n − 1 ) S 2 ] = S 2 = \frac{1}{n-1} \left[ (n - 1) S^2 \right] = S^2
= n − 1 1 [ ( n − 1 ) S 2 ] = S 2
Conclusion
Thus, we have shown that:
E ( s 2 ) = S 2 E(s^2) = S^2
E ( s 2 ) = S 2
Estimation of Population Total Y T = N Y ˉ = ∑ i = 1 N Y i Y_T = N \bar Y = \sum_{i=1}^N Y_i Y T = N Y ˉ = ∑ i = 1 N Y i
Point Estimation
y ^ T = N y ˉ \hat y_T = N \bar y
y ^ T = N y ˉ
Unbiased Estimator
E ( y ^ T ) = N ⋅ E ( y ˉ ) = N Y ˉ = Y T \mathrm E (\hat y_T) = N\cdot \mathrm E (\bar y) = N \bar Y = Y_T
E ( y ^ T ) = N ⋅ E ( y ˉ ) = N Y ˉ = Y T
Variance of Estimation $$Var(\hat y_T) = N^2 Var (\bar y ) = N^2 \frac{1-f}n S^2$$
Estimation of Variance $$\hat Var(\hat Y_T) = N^2 \frac{1-f}n s^2$$
Confidence Interval
[ y ^ T ± z α / 2 V a r ^ ( y ^ T ) ] \left[\hat y_T \pm z_{\alpha /2 } \sqrt{\widehat {\mathrm{Var}} (\hat y_T)}\right]
[ y ^ T ± z α /2 Var ( y ^ T ) ]
d = z α / 2 V a r ^ ( y ^ T ) d = z_{\alpha /2 }\sqrt{\widehat {\mathrm{Var}} (\hat y_T)}
d = z α /2 Var ( y ^ T )
r = d y ^ T r = \frac{d}{\hat y_T}
r = y ^ T d
In srs.r
1 srs.total= function ( N, mysample, alpha)
Estimation of Population Proportion P P P
Define:
Population Proportion $P = \frac{1}N \sum_{i=1}^N Y_i = \bar Y $
Population Total A = ∑ Y i = N P A = \sum Y_i = NP A = ∑ Y i = NP
Population Variance $$S^2 = \frac{N}{N-1} P(1-P) =\frac{N}{N-1} PQ \quad \text{where } Q = 1-P$$
Let the observed y 1 , … , y n y_1 ,\ldots, y_n y 1 , … , y n have property with count a a a
p ^ = y ˉ = a n \hat p = \bar y = \frac a n p ^ = y ˉ = n a
UE
Variance of Estimation $$Var(\hat p) = \frac{1-f}n (\frac{N}{N-1}PQ)$$
Estimation of Variance $$\hat Var(\hat p ) = \frac{1-f}{n-1}\hat p \hat q$$
In srs.r
1 srs.prop= function ( N= NULL , n, event.num, alpha)
Estimation of Population total A A A
A ^ = N y ˉ = N p ^ \hat A = N\bar y = N \hat p A ^ = N y ˉ = N p ^
UE
V a r ( A ^ ) = N 2 1 − f n N N − 1 P Q Var(\hat A) = N^2 \frac{1-f}{n} \frac {N}{N-1} PQ
Va r ( A ^ ) = N 2 n 1 − f N − 1 N PQ
V ^ a r ( A ^ ) = N 2 1 − f n n n − 1 p ^ q ^ \hat Var(\hat A) = N^2 \frac{1-f}{n} \frac {n}{n-1} \hat p \hat q
V ^ a r ( A ^ ) = N 2 n 1 − f n − 1 n p ^ q ^
In srs.r
1 srs.num= function ( N= NULL , n, event.num, alpha)
Determining the Sample size
The sample size is determined by the accuracy needed
( V , C , d , r ) ⟹ n min (V,C,d,r) \implies n_{\min}
( V , C , d , r ) ⟹ n m i n
V: Variance upper bound
C: CV upper bound
d: Error upper bound
r: Relative error upper bound
Sample Size n min n_{\min} n m i n for Estimating Population Mean Y ˉ \bar Y Y ˉ
Step 1 Calculate n 0 n_0 n 0
Here S 2 S^2 S 2 and Y ˉ \bar Y Y ˉ are given from historical data.
n 0 = S 2 V = { S 2 V V = V S 2 C 2 Y ˉ 2 C = V / Y ˉ z α / 2 2 S 2 d 2 d = z α / 2 V z α / 2 2 S 2 r 2 Y ˉ 2 r = z α / 2 V / Y ˉ n_0 = \frac{S^2}{V} = \begin{cases} \frac{S^2}V & V=V\\
\frac{S^2}{C^2\bar Y ^2 } & C = \sqrt V/\bar Y \\
\frac{z_{\alpha/2}^2 S^2}{d^2} & d = z_{\alpha /2}\sqrt V\\
\frac{z_{\alpha / 2 } ^2 S^2}{r^2 \bar Y ^2 } & r = z_{\alpha /2} \sqrt V / \bar Y
\end{cases} n 0 = V S 2 = ⎩ ⎨ ⎧ V S 2 C 2 Y ˉ 2 S 2 d 2 z α /2 2 S 2 r 2 Y ˉ 2 z α /2 2 S 2 V = V C = V / Y ˉ d = z α /2 V r = z α /2 V / Y ˉ
Step 2
n min = { n 0 1 + N 0 N given reasonable N n 0 when N is very big n_{\min} = \begin{cases}\frac{n_0}{1+ \frac{N_0}{N} } & \text{given reasonable } N \\ n_0 & \text{when $N$ is very big}\end{cases}
n m i n = { 1 + N N 0 n 0 n 0 given reasonable N when N is very big
In srs size.r
1 size.mean= function ( N= NULL , Mean.his= NULL , Var.his, method, bound, alpha)
Sample Size for Estimating Proportion P P P
Here P P P and Q = 1 − P Q = 1 - P Q = 1 − P are given from historical data.
n 0 = P Q V = { P Q V Q C 2 P z α / 2 2 P Q d 2 z α / 2 2 Q r 2 P n_0 = \frac{PQ}{V} = \begin{cases} \frac{PQ}V \\ \frac{Q}{C^2P} \\\frac{z_{\alpha /2}^2 PQ }{d^2}\\ \frac{z_{\alpha /2}^2 Q }{r^2 P } \end{cases}
n 0 = V PQ = ⎩ ⎨ ⎧ V PQ C 2 P Q d 2 z α /2 2 PQ r 2 P z α /2 2 Q
n min = { n 0 1 + n 0 − 1 N Given N n 0 N > > n 0 n_{\min} = \begin{cases} \frac{n_0}{1 + \frac{n_0- 1}{N} } & \text{Given } N\\ n_0 & N >> n_0 \end{cases}
n m i n = { 1 + N n 0 − 1 n 0 n 0 Given N N >> n 0
In srs size.r
1 size.prop= function ( N= NULL , Prop.his, method, bound, alpha)
Sample Size for Estimating Population Total Y T Y_T Y T
Use size.mean
and adjust inputs
Apply the Sample Number n min n_{\min} n m i n for Estimating Population Mean Y ˉ \bar Y Y ˉ Methods
Bounding Total is the same as bounding Y ˉ \bar Y Y ˉ with different bounds:
V a r ( y ^ T ) ≤ V ⟺ V a r ( y ˉ ) ≤ V N 2 Var(\hat y_T) \leq V \iff Var(\bar y ) \leq \frac{V}{N^2}
Va r ( y ^ T ) ≤ V ⟺ Va r ( y ˉ ) ≤ N 2 V
C V ( y ^ T ) ≤ C ⟺ C V ( y ˉ ) ≤ C CV(\hat y_T) \leq C \iff CV(\bar y ) \leq C
C V ( y ^ T ) ≤ C ⟺ C V ( y ˉ ) ≤ C
Error ( y ^ T ) ≤ d ⟺ Error ( y ˉ ) ≤ d N \text{Error}(\hat y_T)\leq d\iff\text{Error} (\bar y)\leq \frac d N
Error ( y ^ T ) ≤ d ⟺ Error ( y ˉ ) ≤ N d
Absolute Error ( y ^ T ) ≤ r ⟺ Absolute Error ( y ˉ ) ≤ r \text{Absolute Error}(\hat y_T) \leq r \iff\text{Absolute Error}(\bar y) \leq r
Absolute Error ( y ^ T ) ≤ r ⟺ Absolute Error ( y ˉ ) ≤ r
Stratified Random Sampling
Concept
Population (Y h 1 , … , Y h N h Y_{h1}, \ldots, Y_{hN_h} Y h 1 , … , Y h N h )
Sample (y h 1 , … , y h n h y_{h1}, \ldots, y_{hn_h} y h 1 , … , y h n h )
Size (Size)
N h N_h N h (∑ h = 1 L N h = N \sum_{h=1}^L N_h = N ∑ h = 1 L N h = N )
n h n_h n h (∑ h = 1 L n h = n \sum_{h=1}^L n_h = n ∑ h = 1 L n h = n )
Mean
Y ‾ h = 1 N h ∑ i = 1 N h Y h i \overline{Y}_h = \frac{1}{N_h} \sum_{i=1}^{N_h} Y_{hi} Y h = N h 1 ∑ i = 1 N h Y hi
y ‾ h = 1 n h ∑ i = 1 n h y h i \overline{y}_h = \frac{1}{n_h} \sum_{i=1}^{n_h} y_{hi} y h = n h 1 ∑ i = 1 n h y hi
Variance
S h 2 = 1 N h − 1 ∑ i = 1 N h ( Y h i − Y ‾ h ) 2 S_h^2 = \frac{1}{N_h - 1} \sum_{i=1}^{N_h} (Y_{hi} - \overline{Y}_h)^2 S h 2 = N h − 1 1 ∑ i = 1 N h ( Y hi − Y h ) 2
s h 2 = 1 n h − 1 ∑ i = 1 n h ( y h i − y ‾ h ) 2 s_h^2 = \frac{1}{n_h - 1} \sum_{i=1}^{n_h} (y_{hi} - \overline{y}_h)^2 s h 2 = n h − 1 1 ∑ i = 1 n h ( y hi − y h ) 2
Stratum weight
W h = N h N W_h = \frac{N_h}{N} W h = N N h
f h = n h N h f_h = \frac{n_h}{N_h} f h = N h n h
Esimation of Population Mean Y ˉ \bar Y Y ˉ
y ˉ s t = ∑ h = 1 L W h y ˉ h \bar y_{st} = \sum_{h = 1}^ L W_h \bar y_h
y ˉ s t = h = 1 ∑ L W h y ˉ h
E ( y ˉ s t ) = Y ˉ E(\bar y_{st}) = \bar Y
E ( y ˉ s t ) = Y ˉ
V a r ( y ˉ s t ) = ∑ h = 1 L W h 2 1 − f h n h S h 2 Var(\bar y_{st}) = \sum_{h=1}^L W_h^2 \frac{1- f_h }{n_h} S_h^2
Va r ( y ˉ s t ) = h = 1 ∑ L W h 2 n h 1 − f h S h 2
V ^ a r ( y ˉ s t ) = ∑ h = 1 L W h 2 1 − f h n h s h 2 \hat Var (\bar y_{st}) =\sum_{h=1}^L W_h^2 \frac{1- f_h }{n_h} s_h^2
V ^ a r ( y ˉ s t ) = h = 1 ∑ L W h 2 n h 1 − f h s h 2
See stratified mean.r
1 2 stra.srs.mean1= function ( Nh, nh, yh, s2h, alpha) stra.srs.mean2= function ( Nh, mysample, stra.index, alpha)
Estimation of Population Total Y T Y_T Y T
Total Estimate:
y ^ s t = N ⋅ y ‾ s t = N ( ∑ h = 1 L W h y ‾ h ) \hat{y}_{st} = N \cdot \overline{y}_{st} = N \left( \sum_{h=1}^{L} W_h \overline{y}_h \right)
y ^ s t = N ⋅ y s t = N ( h = 1 ∑ L W h y h )
Expected Value of the Estimator:
E ( y ^ s t ) = Y ^ E(\hat{y}_{st}) = \hat{Y}
E ( y ^ s t ) = Y ^
Variance of the Estimator:
Var ( y ^ s t ) = N 2 ( ∑ h = 1 L W h 2 ⋅ 1 − f h n h ⋅ S h 2 ) \operatorname{Var}(\hat{y}_{st}) = N^2 \left( \sum_{h=1}^{L} W_h^2 \cdot \frac{1 - f_h}{n_h} \cdot S_h^2 \right)
Var ( y ^ s t ) = N 2 ( h = 1 ∑ L W h 2 ⋅ n h 1 − f h ⋅ S h 2 )
Estimated Variance:
Var ^ ( y ^ s t ) = N 2 ( ∑ h = 1 L W h 2 ⋅ 1 − f h n h ⋅ s h 2 ) \widehat{\operatorname{Var}}(\hat{y}_{st}) = N^2 \left( \sum_{h=1}^{L} W_h^2 \cdot \frac{1 - f_h}{n_h} \cdot s_h^2 \right)
Var ( y ^ s t ) = N 2 ( h = 1 ∑ L W h 2 ⋅ n h 1 − f h ⋅ s h 2 )
See stratified mean.r
Estimation of Proportion
Symbol
Population (Y h 1 , … , Y h N h Y_{h1}, \ldots, Y_{hN_h} Y h 1 , … , Y h N h )
Sample (y h 1 , … , y h n h y_{h1}, \ldots, y_{hn_h} y h 1 , … , y h n h )
Size
N h N_h N h (N = ∑ h = 1 L N h N = \sum_{h=1}^L N_h N = ∑ h = 1 L N h )
n h n_h n h (n = ∑ h = 1 L n h n = \sum_{h=1}^L n_h n = ∑ h = 1 L n h )
Proportion with attribute
A h A_h A h
a h a_h a h
Proportion
P h = A h N h P_h = \frac{A_h}{N_h} P h = N h A h
p ^ h = a h n h \hat{p}_h = \frac{a_h}{n_h} p ^ h = n h a h
Variance
S h 2 = N h N h − 1 P h Q h S_h^2 = \frac{N_h}{N_h - 1} P_h Q_h S h 2 = N h − 1 N h P h Q h
s h 2 = n h n h − 1 p ^ h q ^ n s_h^2 = \frac{n_h}{n_h - 1} \hat{p}_h \hat q_n s h 2 = n h − 1 n h p ^ h q ^ n
Weight
W h = N h N W_h = \frac{N_h}{N} W h = N N h
f h = n h N h f_h = \frac{n_h}{N_h} f h = N h n h
Stratified Sampling Estimation of Population Proportion P P P
Estimator for Population Proportion :
p ^ s t = ∑ h = 1 L W h p ^ h = ∑ h = 1 L W h ⋅ a h n h \hat{p}_{st} = \sum_{h=1}^L W_h \hat{p}_h = \sum_{h=1}^L W_h \cdot \frac{a_h}{n_h}
p ^ s t = h = 1 ∑ L W h p ^ h = h = 1 ∑ L W h ⋅ n h a h
Expected Value :
E ( p ^ s t ) = P E(\hat{p}_{st}) = P
E ( p ^ s t ) = P
Variance :
V a r ( p ^ s t ) = ∑ h = 1 L W h 2 V a r ( p ^ h ) = ∑ h = 1 L W h 2 ( 1 − f h n h ⋅ N h N h − 1 P h Q h ) Var(\hat{p}_{st}) = \sum_{h=1}^L W_h^2 Var(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{N_h}{N_h - 1} P_h Q_h \right)
Va r ( p ^ s t ) = h = 1 ∑ L W h 2 Va r ( p ^ h ) = h = 1 ∑ L W h 2 ( n h 1 − f h ⋅ N h − 1 N h P h Q h )
Estimated Variance :
V a r ^ ( p ^ s t ) = ∑ h = 1 L W h 2 V a r ^ ( p ^ h ) = ∑ h = 1 L W h 2 ( 1 − f h n h ⋅ n h n h − 1 p ^ h q ^ h ) \widehat{Var}(\hat{p}_{st}) = \sum_{h=1}^L W_h^2 \widehat{Var}(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{n_h}{n_h - 1} \hat{p}_h \hat{q}_h \right)
Va r ( p ^ s t ) = h = 1 ∑ L W h 2 Va r ( p ^ h ) = h = 1 ∑ L W h 2 ( n h 1 − f h ⋅ n h − 1 n h p ^ h q ^ h )
Confidence Interval (CI) :
C I d . r . CI \quad d.r.
C I d . r .
Stratified Sampling Estimation for Total A A A
Estimator for Population Total :
A ^ s t = N ( ∑ h = 1 L W h p ^ h ) = ∑ h = 1 L W h ⋅ a h n h \hat{A}_{st} = N \left( \sum_{h=1}^L W_h \hat{p}_h \right) = \sum_{h=1}^L W_h \cdot \frac{a_h}{n_h}
A ^ s t = N ( h = 1 ∑ L W h p ^ h ) = h = 1 ∑ L W h ⋅ n h a h
Expected Value :
E ( A ^ s t ) = A E(\hat{A}_{st}) = A
E ( A ^ s t ) = A
Variance :
V a r ( A ^ s t ) = N 2 ∑ h = 1 L W h 2 V a r ( p ^ h ) = ∑ h = 1 L W h 2 ( 1 − f h n h ⋅ N h N h − 1 P h Q h ) Var(\hat{A}_{st}) = N^2 \sum_{h=1}^L W_h^2 Var(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{N_h}{N_h - 1} P_h Q_h \right)
Va r ( A ^ s t ) = N 2 h = 1 ∑ L W h 2 Va r ( p ^ h ) = h = 1 ∑ L W h 2 ( n h 1 − f h ⋅ N h − 1 N h P h Q h )
Estimated Variance :
V a r ^ ( A ^ s t ) = N 2 ∑ h = 1 L W h 2 V a r ^ ( p ^ h ) = ∑ h = 1 L W h 2 ( 1 − f h n h ⋅ n h n h − 1 p ^ h q ^ h ) \widehat{Var}(\hat{A}_{st}) = N^2 \sum_{h=1}^L W_h^2 \widehat{Var}(\hat{p}_h) = \sum_{h=1}^L W_h^2 \left( \frac{1 - f_h}{n_h } \cdot \frac{n_h}{n_h - 1} \hat{p}_h \hat q_h \right)
Va r ( A ^ s t ) = N 2 h = 1 ∑ L W h 2 Va r ( p ^ h ) = h = 1 ∑ L W h 2 ( n h 1 − f h ⋅ n h − 1 n h p ^ h q ^ h )
Confidence Interval (CI) :
C I d . r . CI \quad d.r.
C I d . r .
Determining Sample Size
When given n n n , determine n h n_h n h for each stratum
Use
1 2 3 strata.weight= function ( Wh, S2h, Ch= NULL , allocation) return ( wh)
allocation = "Prop"
or "Opt"
or "Neyman"
Use
1 2 3 strata.size= function ( n, Wh, S2h, Ch= NULL , allocation) return ( list ( n= n, allocation= allocation, wh= wh, nh= ceiling ( nh) ) )
The sample size for each stratum, $ n_h $, can be determined using different allocation methods. The general formula is:
n h = W h ⋅ n n_h = W_h \cdot n
n h = W h ⋅ n
where $ W_h $ is the stratum weight and $ n $ is the total sample size.
1. Proportional Allocation (Prop) :
The stratum weight $ W_h $ is proportional to the stratum size $ N_h $:
W h = N h N W_h = \frac{N_h}{N}
W h = N N h
Thus, the sample size for stratum $ h $ is:
n h = N h N ⋅ n n_h = \frac{N_h}{N} \cdot n
n h = N N h ⋅ n
This method ensures that the sample size in each stratum is proportional to the stratum’s size in the population.
2. Optimal Allocation (Opt) :
The stratum weight $ W_h $ is adjusted based on the stratum’s variability and cost. The formula is:
W h = N h S h c h ∑ h = 1 L N h S h c h W_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}}
W h = ∑ h = 1 L c h N h S h c h N h S h
Thus, the sample size for stratum $ h $ is:
n h = N h S h c h ∑ h = 1 L N h S h c h ⋅ n n_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}} \cdot n
n h = ∑ h = 1 L c h N h S h c h N h S h ⋅ n
This method minimizes the variance of the estimator by allocating more samples to strata with higher variability or lower costs.
3. Neyman Allocation :
The stratum weight $ W_h $ is adjusted based on the stratum’s variability. The formula is:
W h = N h S h c h ∑ h = 1 L N h S h c h W_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}}
W h = ∑ h = 1 L c h N h S h c h N h S h
If the cost per unit is the same across all strata ($ c_h = c $), this simplifies to:
W h = N h S h c ∑ h = 1 L N h S h c = N h S h ∑ h = 1 L N h S h W_h = \frac{\frac{N_h S_h}{\sqrt{c}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c}}} = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h}
W h = ∑ h = 1 L c N h S h c N h S h = ∑ h = 1 L N h S h N h S h
Thus, the sample size for stratum $ h $ is:
n h = N h S h ∑ h = 1 L N h S h ⋅ n n_h = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h} \cdot n
n h = ∑ h = 1 L N h S h N h S h ⋅ n
This method minimizes the variance of the estimator by allocating more samples to strata with higher variability.
Summary
Proportional Allocation : Simple and easy to implement, but does not account for variability.
Optimal Allocation : Minimizes variance by considering both variability and cost.
Neyman Allocation : A special case of optimal allocation when costs are equal across strata.
Proportional Allocation: n h = N h N ⋅ n Optimal Allocation: n h = N h S h c h ∑ h = 1 L N h S h c h ⋅ n Neyman Allocation: n h = N h S h ∑ h = 1 L N h S h ⋅ n ( when c h = c ) \boxed{
\begin{aligned}
&\text{Proportional Allocation: } n_h = \frac{N_h}{N} \cdot n \\
&\text{Optimal Allocation: } n_h = \frac{\frac{N_h S_h}{\sqrt{c_h}}}{\sum_{h=1}^L \frac{N_h S_h}{\sqrt{c_h}}} \cdot n \\
&\text{Neyman Allocation: } n_h = \frac{N_h S_h}{\sum_{h=1}^L N_h S_h} \cdot n \quad (\text{when } c_h = c)
\end{aligned}
}
Proportional Allocation: n h = N N h ⋅ n Optimal Allocation: n h = ∑ h = 1 L c h N h S h c h N h S h ⋅ n Neyman Allocation: n h = ∑ h = 1 L N h S h N h S h ⋅ n ( when c h = c )
See stratified size.r
When given V , C , d , r V,C,d,r V , C , d , r of Y ˉ \bar Y Y ˉ , determine n n n and n h n_{h} n h
Use
1 strata.mean.size= function ( Nh, S2h, Ch= NULL , allocation, method, bound, Ybar= NULL , alpha= NULL )
Step 1 Calculate w h w_h w h with different allocation methods.
n h = w h n n_h = w_h n
n h = w h n
w h = { W h prop W h S h / C h ∑ h W h S h / C h opt W h S h ∑ W h S h Neyman w_h = \begin{cases}W_h & \text{prop} \\ \frac{Wh S_h /\sqrt {C_h} }{\sum_h W_h Sh/\sqrt {C_h} } & \text{opt} \\ \frac{W_hS_h}{\sum W_h S_h } & \text{Neyman}\end{cases}
w h = ⎩ ⎨ ⎧ W h ∑ h W h S h / C h Wh S h / C h ∑ W h S h W h S h prop opt Neyman
Step 2
Calculate n min n_{\min} n m i n :
n min = ∑ h W h 2 S h 2 / w h V + 1 N ∑ h W h S h 2 n_{\min} = \frac{\sum_h W_h^2 S_h^2/w_h }{V + \frac 1 N \sum_h W_h S_h^2}
n m i n = V + N 1 ∑ h W h S h 2 ∑ h W h 2 S h 2 / w h
where
V = { V V C 2 Y ˉ 2 C ( d / z α / 2 ) 2 d ( r Y ˉ / z α / 2 ) 2 r V = \begin{cases} V & V\\ C^2\bar Y ^2 & C \\ (d/z_{\alpha/2})^2 & d \\ (r\bar Y / z_{\alpha/2})^2 & r\end{cases}
V = ⎩ ⎨ ⎧ V C 2 Y ˉ 2 ( d / z α /2 ) 2 ( r Y ˉ / z α /2 ) 2 V C d r
S h 2 , Y ˉ S_h^2, \bar Y S h 2 , Y ˉ are given from historical data.
Step 3 $$n_{h{\min}} = w_h n_{\min}$$
Given V , C , d , r V,C,d,r V , C , d , r of P P P , determine n n n and n h n_h n h
Use
1 strata.prop.size= function ( Nh, Ph, Ch= NULL , allocation, method, bound, Ybar= NULL , alpha= NULL )
Here
S h 2 = N h N h − 1 P h Q h S_h^2 = \frac{N_h}{N_h - 1} P_h Q_h
S h 2 = N h − 1 N h P h Q h
Given V , C , d , r V,C,d,r V , C , d , r of Total Y T Y_T Y T , determine n n n and n h n_h n h
Adjust the input bound parameter of calculating n n n and n h n_h n h of given parameters of Y ˉ \bar Y Y ˉ .
Use
1 strata.mean.size= function ( Nh, S2h, Ch= NULL , allocation, method, bound, Ybar= NULL , alpha= NULL )
Population Total ( Y T ) y ^ T (Y_T)\quad \hat y_T ( Y T ) y ^ T
V
C
d
r
Population Mean ( Y ˉ ) y ˉ s t (\bar Y)\quad \bar y_{st} ( Y ˉ ) y ˉ s t
V N 2 \frac{V}{N^2} N 2 V
C
d N \frac d N N d
r
Given V , C , d , r V,C,d,r V , C , d , r of Total A A A , determine n n n and n h n_h n h
Adjust the input bound parameter of calculating n n n and n h n_h n h of given parameters of Y ˉ \bar Y Y ˉ .
Use
1 strata.prop.size= function ( Nh, Ph, Ch= NULL , allocation, method, bound, Ybar= NULL , alpha= NULL )
Population Total ( A ) a ^ (A)\quad \hat a ( A ) a ^
V
C
d
r
Population Mean ( P ) p ^ s t (P)\quad \hat p_{st} ( P ) p ^ s t
V N 2 \frac{V}{N^2} N 2 V
C
d N \frac d N N d
r
Design Efficiency - Comparison of Sampling Methods
Comparing the variance of your method versus Simple Random Sampling under the same sampling size, the design efficiency is defined as the fraction.
Deff = V a r ( θ ^ p ) V a r ( θ ^ S R S ) \text{Deff} = \frac{Var(\hat \theta_p)}{Var(\hat \theta_{SRS})}
Deff = Va r ( θ ^ SRS ) Va r ( θ ^ p )
Ratio Estimation and Regression Estimation
Notations
For population use UPPER CASE characters and for sample use lower case.
S y 2 = 1 N − 1 ∑ i = 1 N ( Y i − Y ˉ ) 2 ( Y Total Variance) S x 2 = 1 N − 1 ∑ i = 1 N ( X i − X ˉ ) 2 ( X Total Variance) S y x = 1 N − 1 ∑ i = 1 N ( Y i − Y ˉ ) ( X i − X ˉ ) (Y, X Total Covariance) ρ = S y x S y 2 ⋅ S x 2 = S y x S y ⋅ S x (Y, X Total Correlation) C y 2 = S y 2 Y ˉ 2 (Y Total Relative Variance) C x 2 = S x 2 X ˉ 2 (X Total Relative Variance) C y x = ρ ⋅ S y Y ˉ ⋅ S x X ˉ (Y X Relative Covariance) \begin{aligned}
& S_y^2 = \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - \bar{Y})^2 \quad &\text{($Y$ Total Variance)} \\
& S_x^2 = \frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})^2 \quad &\text{($X$ Total Variance)} \\
& S_{yx} = \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - \bar{Y})(X_i - \bar{X}) \quad &\text{(Y, X Total Covariance)} \\
& \rho = \frac{S_{yx}}{\sqrt{S_y^2 \cdot S_x^2}} = \frac{S_{yx}}{S_y \cdot S_x} \quad &\text{(Y, X Total Correlation)} \\
& C_y^2 = \frac{S_y^2}{\bar{Y}^2} \quad &\text{(Y Total Relative Variance)}\\
&C_x^2 = \frac{S_x^2}{\bar{X}^2} \quad &\text{(X Total Relative Variance)}\\
&C_{yx} = \rho \cdot \frac{S_y}{\bar{Y}} \cdot \frac{S_x}{\bar{X}} \quad
&\text{(Y X Relative Covariance)}
\end{aligned}
S y 2 = N − 1 1 i = 1 ∑ N ( Y i − Y ˉ ) 2 S x 2 = N − 1 1 i = 1 ∑ N ( X i − X ˉ ) 2 S y x = N − 1 1 i = 1 ∑ N ( Y i − Y ˉ ) ( X i − X ˉ ) ρ = S y 2 ⋅ S x 2 S y x = S y ⋅ S x S y x C y 2 = Y ˉ 2 S y 2 C x 2 = X ˉ 2 S x 2 C y x = ρ ⋅ Y ˉ S y ⋅ X ˉ S x ( Y Total Variance) ( X Total Variance) (Y, X Total Covariance) (Y, X Total Correlation) (Y Total Relative Variance) (X Total Relative Variance) (Y X Relative Covariance)
Estimation of Ratio
Ratio is defined as
R = Y ˉ X ˉ = Y T X T R = \frac{\bar Y}{\bar X} = \frac{Y_T}{X_T}
R = X ˉ Y ˉ = X T Y T
Point Estimation$$\hat R = \frac{\bar y}{\bar x}$$
AUE $$\lim_{n\to \infty} E(\hat R) = R$$
Variance of Estimation
Proposition :$$ MSE(\hat R) \overset{AUE}{\simeq} Var(\hat R) \overset{n\to \infty}\simeq \frac{1-f}{n\bar X^2} \frac{1}{N-1}\sum_{i=1}^N (Y_i - RX_i)^2$$
Where: $$\begin{aligned}S_g^2 &\overset{0}{=} \frac{1}{N-1} \sum_{i=1}^{N} (Y_i - R X_i)^2\&\overset{1}{=} S_y^2 + R^2 S_x^2 - 2R S_{yx}\&\overset{2}{=} \bar Y^2 (C_y^2 + C_x^2 - 2C_{yx})\end{aligned}$$
Estimation of Variance
Method 1 When X ˉ \bar X X ˉ is given
Var ^ 1 ( R ^ ) = 0 1 − f n ⋅ 1 X ˉ 2 ⋅ 1 n − 1 ∑ i = 1 n ( y i − R ^ x i ) 2 = 1 1 − f n ⋅ 1 X ˉ 2 ⋅ ( S y 2 + R ^ 2 S x 2 − 2 R ^ S y x ) \begin{aligned}
\widehat{\text{Var}}_1(\hat{R}) &\overset{0}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \hat{R} x_i)^2\\
&\overset{1}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot (S_y^2 + \hat{R}^2 S_x^2 - 2 \hat{R} S_{yx})\end{aligned} Var 1 ( R ^ ) = 0 n 1 − f ⋅ X ˉ 2 1 ⋅ n − 1 1 i = 1 ∑ n ( y i − R ^ x i ) 2 = 1 n 1 − f ⋅ X ˉ 2 1 ⋅ ( S y 2 + R ^ 2 S x 2 − 2 R ^ S y x )
**Method 2** When $\bar X$ is unknown, we use $\bar x$ from the sample
Var ^ 2 ( R ^ ) = 0 1 − f n ⋅ 1 X ˉ 2 ⋅ 1 n − 1 ∑ i = 1 n ( y i − R ^ x i ) 2 = 1 1 − f n ⋅ 1 X ˉ ⋅ ( S y 2 + R ^ 2 S x 2 − 2 R ^ S y x ) \begin{aligned}
\widehat{\text{Var}}_2(\hat{R}) &\overset{0}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}^2} \cdot \frac{1}{n-1} \sum_{i=1}^{n} (y_i - \hat{R} x_i)^2\\
&\overset{1}{=} \frac{1-f}{n} \cdot \frac{1}{\bar{X}} \cdot (S_y^2 + \hat{R}^2 S_x^2 - 2 \hat{R} S_{yx})\end{aligned}
Var 2 ( R ^ ) = 0 n 1 − f ⋅ X ˉ 2 1 ⋅ n − 1 1 i = 1 ∑ n ( y i − R ^ x i ) 2 = 1 n 1 − f ⋅ X ˉ 1 ⋅ ( S y 2 + R ^ 2 S x 2 − 2 R ^ S y x )
Note: When X ˉ \bar X X ˉ is given, we can use both methods 1 and 2. When X ˉ \bar X X ˉ is unknown, use method 2.
Confidence Interval
CI1, CI2, CI3
placeholder for confidence interval
Use ratio.r
1 2 ratio = function ( y.sample, x.sample, N= NULL , auxiliary= FALSE , Xbar= NULL , alpha)
Ratio Estimation of Population Mean Y ˉ \bar Y Y ˉ and Total Y T Y_T Y T
SRSF (Simple Random Sampling with Fixed Ratio Estimation) of Population Mean Y ˉ \bar Y Y ˉ
Estimator for the Population Mean :
y ˉ R = y ˉ x ˉ ⋅ X ˉ = R ^ ⋅ X ˉ \bar{y}_R = \frac{\bar{y}}{\bar{x}} \cdot \bar{X} = \hat{R} \cdot \bar{X}
y ˉ R = x ˉ y ˉ ⋅ X ˉ = R ^ ⋅ X ˉ
Expected Value of the Estimator :
E ( y ˉ R ) = E ( R ^ ) ⋅ X ˉ ≈ R ⋅ X ˉ = Y ˉ ( AUE ) E(\bar{y}_R) = E(\hat{R}) \cdot \bar{X} \approx R \cdot \bar{X} = \bar{Y} \quad (\text{AUE})
E ( y ˉ R ) = E ( R ^ ) ⋅ X ˉ ≈ R ⋅ X ˉ = Y ˉ ( AUE )
Variance of the Estimator :
Var ( y ˉ R ) = X ˉ 2 ⋅ Var ( R ^ ) \text{Var}(\bar{y}_R) = \bar{X}^2 \cdot \text{Var}(\hat{R})
Var ( y ˉ R ) = X ˉ 2 ⋅ Var ( R ^ )
Estimated Variance of the Estimator :
Var ^ ( y ˉ R ) = X ˉ 2 ⋅ Var ^ 1 ( R ^ ) \widehat{\text{Var}}(\bar{y}_R) = \bar{X}^2 \cdot \widehat{\text{Var}}_1(\hat{R})
Var ( y ˉ R ) = X ˉ 2 ⋅ Var 1 ( R ^ )
Confidence Interval :
CI = [ X ˉ ⋅ left , X ˉ ⋅ right ] \text{CI} = \left[ \bar{X} \cdot \text{left}, \, \bar{X} \cdot \text{right} \right]
CI = [ X ˉ ⋅ left , X ˉ ⋅ right ]
In ratio.r
use:
1 ratio.mean= function ( y.sample, x.sample, N= NULL , Xbar, alpha)
Example:
1 2 3 4 5 6 7 8 9 mean.simple.result= srs.mean( N, y.sample, alpha) mean.ratio.result= ratio.mean( y.sample, x.sample, N, Xbar, alpha) var.result= c ( mean.simple.result$ ybar.var, mean.ratio.result$ ybarR.var) deff.result= deff( var.result) rownames( deff.result) = c ( "Simple" , "Ratio" ) print( deff.result)
SRSF Estimation of Population Total $ Y_T $
Estimator for the Population Total :
Y ^ R = N ⋅ y ˉ R \hat{Y}_R = N \cdot \bar{y}_R
Y ^ R = N ⋅ y ˉ R
Approximately Unbiased Estimator (AUE) :
E ( Y ^ R ) ≈ Y T E(\hat{Y}_R) \approx Y_T
E ( Y ^ R ) ≈ Y T
Variance of the Estimator :
Var ( Y ^ R ) = N 2 ⋅ Var ( y ˉ R ) \text{Var}(\hat{Y}_R) = N^2 \cdot \text{Var}(\bar{y}_R)
Var ( Y ^ R ) = N 2 ⋅ Var ( y ˉ R )
Estimated Variance of the Estimator :
Var ^ ( Y ^ R ) = N 2 ⋅ Var ^ ( y ˉ R ) \widehat{\text{Var}}(\hat{Y}_R) = N^2 \cdot \widehat{\text{Var}}(\bar{y}_R)
Var ( Y ^ R ) = N 2 ⋅ Var ( y ˉ R )
Confidence Interval :
CI = [ N ⋅ left , N ⋅ right ] \text{CI} = \left[ N \cdot \text{left}, \, N \cdot \text{right} \right]
CI = [ N ⋅ left , N ⋅ right ]
use:
1 ratio.total= function ( y.sample, x.sample, N, Xbar, alpha)
Example
1 2 3 4 5 6 7 8 total.simple.result= srs.total( N, y.sample, alpha) total.ratio.result= ratio.total( y.sample, x.sample, N, Xbar, alpha) var.result= c ( total.simple.result$ ytot.var, total.ratio.result$ ytotal.var) deff.result= deff( var.result) rownames( deff.result) = c ( "Simple" , "Ratio" ) print( deff.result)
Design Efficiency
Ratio and Regression Estimation are called complex estimation methods, while Simple Random Sampling is called simple estimation method. When comparing complex methods to simple methods, design efficiency is defined as the fraction.
Deff = V a r ( y ˉ R ) V a r ( y ˉ ) = { < 1 y ˉ R is more efficient ≥ 1 y ˉ is more efficient \text{Deff} = \frac{Var(\bar y_R)}{Var(\bar y)} = \begin{cases} <1 & \bar y_R \text{ is more efficient}\\ \geq 1 & \bar y \text{ is more efficient} \end{cases}
Deff = Va r ( y ˉ ) Va r ( y ˉ R ) = { < 1 ≥ 1 y ˉ R is more efficient y ˉ is more efficient
When $$\rho > \frac{C_x}{2C_y}$$
y ˉ R \bar y_R y ˉ R if more efficient then y ˉ \bar y y ˉ .
When Y Y Y and X X X are highly correlated, y ˉ R \bar y_R y ˉ R is more efficient than y ˉ \bar y y ˉ .
Determining Sample Size
Step 1
When given bound ( V , C , d , r ) (V,C,d,r) ( V , C , d , r ) of Y ˉ \bar Y Y ˉ , determine the simple sample size n simple n_{\text{simple}} n simple
Using the function size.mean
Step 2
Determine the ratio sample size n R n_{R} n R
n R = Deff ⋅ n simple n_R = \text{Deff} \cdot n_{\text{simple} }
n R = Deff ⋅ n simple
Use deff=function(var.result)
to calculate the design efficiency and use deff.size=function(deff.result, n.simple)
to calculate the size.
Example
1 2 3 4 5 6 7 8 9 10 11 mean.simple.result= srs.mean( N, y.sample, alpha) mean.ratio.result= ratio.mean( y.sample, x.sample, N, Xbar, alpha) var.result= c ( mean.simple.result$ ybar.var, mean.ratio.result$ ybarR.var) deff.result= deff( var.result) n.simple= size.mean( N, Mean.his= NULL , Var.his= var( y.sample) , method= "d" , bound= 0.05 , alpha) $ size size.result= deff.size( deff.result, n.simple) rownames( size.result) = c ( "Simple" , "Ratio" ) print( size.result)
Regression Estimation of Population Mean Y ˉ \bar Y Y ˉ and Total Y T Y_T Y T
The Linear Regression Estimator is defined as
y ˉ l r = y ˉ + β ( X ˉ − x ˉ ) \bar y_{lr} = \bar y + \beta(\bar X - \bar x)
y ˉ l r = y ˉ + β ( X ˉ − x ˉ )
Y ^ l r = N y ˉ l r \hat Y_{lr} = N \bar y_{lr}
Y ^ l r = N y ˉ l r
Normally β \beta β is either constant or the regression coefficient B B B of X X X on Y Y Y .
When β = 1 \beta = 1 β = 1 , we obtain the difference estimator Diff $$\bar y_d = \bar y + (\bar X - \bar x)$$
When β = 0 \beta = 0 β = 0 , it degenerates to the simple estimator y ˉ \bar y y ˉ .
When β = y ˉ / x ˉ = R ^ \beta = {\bar y}/{\bar x} = \hat R β = y ˉ / x ˉ = R ^ , we obtain the ratio estimator y ˉ R \bar y_R y ˉ R .
Regression Estimation of Population Mean Y ˉ \bar Y Y ˉ
Case 1: β = β 0 \beta = \beta_0 β = β 0 is constant
Estimator for the Population Mean
y ˉ l r ( β 0 ) = y ˉ + β 0 ( X ˉ − x ˉ ) \bar{y}_{lr}(\beta_0) = \bar{y} + \beta_0 (\bar{X} - \bar{x})
y ˉ l r ( β 0 ) = y ˉ + β 0 ( X ˉ − x ˉ )
Unbiased Estimator
E ( y ˉ l r ) = Y ˉ + β 0 ( X ˉ − E ( x ˉ ) ) = Y ˉ (UE) E(\bar{y}_{lr}) = \bar{Y} + \beta_0 (\bar{X} - E(\bar{x})) = \bar{Y} \quad \text{(UE)}
E ( y ˉ l r ) = Y ˉ + β 0 ( X ˉ − E ( x ˉ )) = Y ˉ (UE)
Variance of the Estimator
Var ( y ˉ l r ) = 1 − f n ( S y 2 + β 0 2 S x 2 − 2 β 0 S y x ) \text{Var}(\bar{y}_{lr}) = \frac{1-f}{n} \left( S_y^2 + \beta_0^2 S_x^2 - 2 \beta_0 S_{yx} \right)
Var ( y ˉ l r ) = n 1 − f ( S y 2 + β 0 2 S x 2 − 2 β 0 S y x )
Minimum Variance Condition
Minimum when β 0 = B = S y x S x 2 ⇒ Var min = 1 − f n S e 2 \text{Minimum when } \beta_0 = B = \frac{S_{yx}}{S_x^2}\Rightarrow \text{Var}_{\text{min}} = \frac{1-f}{n} S_e^2
Minimum when β 0 = B = S x 2 S y x ⇒ Var min = n 1 − f S e 2
Here B B B is the population regression coefficient of X X X on Y Y Y . $$B = \frac{S_{yx}}{S_x^2}$$
Var min ( y ˉ l r ) = 1 − f n S y 2 ( 1 − ρ 2 ) \text{Var}_{\text{min}}(\bar y_{lr}) = \frac{1-f}{n} S_y^2(1-\rho^2)
Var min ( y ˉ l r ) = n 1 − f S y 2 ( 1 − ρ 2 )
S e 2 ≜ S y 2 ( 1 − ρ 2 ) , ρ = S y x S y S x S_e^2 \triangleq S_y^2(1-\rho^2),\qquad \rho = \frac{S_{yx}}{Sy S_x}
S e 2 ≜ S y 2 ( 1 − ρ 2 ) , ρ = S y S x S y x
Estimated Variance of the Estimator
Var ^ ( y ˉ l r ) = 1 − f n ( s y 2 + β 0 2 s x 2 − 2 β 0 s y x ) \widehat{\text{Var}}(\bar{y}_{lr}) = \frac{1-f}{n} \left( s_y^2 + \beta_0^2 s_x^2 - 2 \beta_0 s_{yx} \right)
Var ( y ˉ l r ) = n 1 − f ( s y 2 + β 0 2 s x 2 − 2 β 0 s y x )
Confidence Interval
[ y ˉ l r ± z α Var ^ ( y ˉ l r ) ] \left[ \bar{y}_{lr} \pm z_\alpha \sqrt{\widehat{\text{Var}}(\bar{y}_{lr})} \right]
[ y ˉ l r ± z α Var ( y ˉ l r ) ]
Case 2: β = b ^ \beta = \hat b β = b ^ is the sample regression coefficient of x x x and y y y
β = b ^ = s y x s x 2 \beta = \hat b = \frac{s_{yx}}{s_x^2}
β = b ^ = s x 2 s y x
Estimator for the Population Mean
y ˉ l r = y ˉ + b ^ ( X ˉ − x ˉ ) \bar{y}_{lr} = \bar{y} + \hat{b} (\bar{X} - \bar{x})
y ˉ l r = y ˉ + b ^ ( X ˉ − x ˉ )
Approximate Unbiased Estimator
E ( y ˉ l r ) ≈ Y ˉ (AUE) E(\bar{y}_{lr}) \approx \bar{Y} \quad \text{(AUE)}
E ( y ˉ l r ) ≈ Y ˉ (AUE)
Mean Squared Error (MSE) and Variance
MSE ( y ˉ l r ) ≈ Var ( y ˉ l r ) ≈ 1 − f n S e 2 \text{MSE}(\bar{y}_{lr}) \approx \text{Var}(\bar{y}_{lr}) \approx \frac{1-f}{n} S_e^2
MSE ( y ˉ l r ) ≈ Var ( y ˉ l r ) ≈ n 1 − f S e 2
This is the theoretically minimum variance estimator.
Estimated Variance of the Estimator
Var ^ ( y ˉ l r ) = 1 − f n s e 2 = 1 − f n ⋅ n − 1 n − 2 ( s y 2 − s y x 2 s x 2 ) \widehat{\text{Var}}(\bar{y}_{lr}) = \frac{1-f}{n} s_e^2 = \frac{1-f}{n} \cdot \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)
Var ( y ˉ l r ) = n 1 − f s e 2 = n 1 − f ⋅ n − 2 n − 1 ( s y 2 − s x 2 s y x 2 )
where
s e 2 = n − 1 n − 2 ( s y 2 − s y x 2 s x 2 ) s_e^2 = \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)
s e 2 = n − 2 n − 1 ( s y 2 − s x 2 s y x 2 )
In regression.r
:
1 regression.mean= function ( y.sample, x.sample, N= NULL , Xbar, alpha, method= "Min" , beta0= NULL )
Regression Estimation of Population Total Y T Y_T Y T
Notice that $$\text{mean }\quad \bar y_{lr} \overset{N}{\longrightarrow} \hat y_{lr} \quad \text{total}$$
1 regression.total= function ( y.sample, x.sample, N= NULL , Xbar, alpha, method= "Min" , beta0= NULL )
Comparison of Simple, Ratio, and Regression Estimation
Their corresponding variances are:
Var ( y ˉ ) = 1 − f n ⋅ S y 2 Var ( y ˉ R ) ≈ 1 − f n ⋅ ( S y 2 + R 2 S x 2 − 2 R ρ S y S x ) Var ( y ˉ l r ) ≈ 1 − f n ⋅ S y 2 ( 1 − ρ 2 ) \begin{aligned}
\text{Var}(\bar y) &= \frac{1-f}{n} \cdot S_y^2 \\
\text{Var}(\bar y_R) &\approx \frac{1-f}{n} \cdot (S_y^2 + R^2 S_x^2 - 2 R\rho S_yS_x) \\
\text{Var}(\bar y_{lr}) &\approx \frac{1-f}{n} \cdot S_y^2 (1-\rho^2)
\end{aligned}
Var ( y ˉ ) Var ( y ˉ R ) Var ( y ˉ l r ) = n 1 − f ⋅ S y 2 ≈ n 1 − f ⋅ ( S y 2 + R 2 S x 2 − 2 Rρ S y S x ) ≈ n 1 − f ⋅ S y 2 ( 1 − ρ 2 )
The condition for the regression estimator to be more efficient than the ratio estimator is:
( B − R ) 2 ≥ 0 (B-R)^2 \geq 0
( B − R ) 2 ≥ 0
When n n n is not large, the estimations might be biased. Real-life experiments show that when n n n is small, the regression estimator can be more biased than the ratio estimator.
Example : Comparing Simple, Ratio, and Regression Estimation of Population Total Y T Y_T Y T
1 2 3 4 5 6 7 8 9 10 11 12 13 total.simple.reult= srs.total( N, y.sample, alpha) print( total.simple.result) total.ratio.result= ratio.total( y.sample, x.sample, N, Xbar, alpha) print( total.ratio.result) total.reg.result= regression.total( y.sample, x.sample, N, Xbar, alpha, method= "Min" , beta0= NULL ) print( total.reg.result) var.result= c ( total.simple.reult$ ytot.var, total.ratio.result$ ytotal.var, total.reg.result$ Var) deff.result= deff( var.result) rownames( deff.result) = c ( "Simple" , "Ratio" , "Regression" ) print( deff.result)
Determining Sample Size
The design efficiency is defined as the fraction.
Deff = V a r ( y ˉ l r ) V a r ( y ˉ ) \text{Deff} = \frac{Var(\bar y_{lr})}{Var(\bar y)}
Deff = Va r ( y ˉ ) Va r ( y ˉ l r )
Given the bound ( V , C , d , r ) (V,C,d,r) ( V , C , d , r ) of y ˉ \bar y y ˉ , determine the simple sample size n simple n_{\text{simple}} n simple , then
n l r = Deff ⋅ n simple n_{lr} = \text{Deff} \cdot n_{\text{simple}}
n l r = Deff ⋅ n simple
Which is similar to ratio estimation.
Stratified Ratio and Regression Estimation
Two approaches for stratified estimation.
Separated Estimation First estimate for each stratum, then take weighted average or sum.
Combined Estimation First take the weighted average or sum, then estimate for the combined sample.
Stratified Ratio Estimation
For the $ h − t h s t r a t u m ( -th stratum ( − t h s t r a t u m ( h = 1, \dots, L $):
Notations
Population ⟶ S R S \qquad \overset{SRS}{\longrightarrow} ⟶ SRS
Sample
$$\begin{pmatrix}Y_{h1} & \cdots & Y_{hN_h} \X_{h1} & \cdots & X_{hN_h}\end{pmatrix}$$
$$\begin{pmatrix}y_{h1} & \cdots & y_{hn_h} \x_{h1} & \cdots & x_{hn_h}\end{pmatrix}$$
Mean
$$\bar{Y}_h\quad \bar X_h$$
$$\bar y_h \quad \bar{x}_h$$
Var, Cov, ρ \rho ρ
$$S_{yh}^2, S_{xh}^2, S_{yxh}, \rho_h$$
$$s_{yh}^2, s_{xh}^2, s_{yxh}, \hat \rho_h$$
Seperate Ratio Estimation for each stratum
$$R_h = \frac{\bar{Y}_h}{\bar{X}_h}$$
$$\hat R_h = \frac{\bar y_h}{\bar x_h}$$
Combined Ratio Estimation
$$R_c = \frac{\bar{Y}}{\bar{X}}$$
$$\hat R_c = \frac{\bar y_{st}}{\bar x_{st}}$$
Separate Ratio Estimation of Population Mean Y ˉ \bar Y Y ˉ ›
Estimator for the Population Mean
y ˉ R S = ∑ h W h y ˉ R h = ∑ h W h ( y ˉ h x ˉ h ⋅ X ˉ h ) \bar{y}_{RS} = \sum_h W_h \bar{y}_{Rh} = \sum_h W_h \left( \frac{\bar{y}_h}{\bar{x}_h} \cdot \bar{X}_h\right)
y ˉ RS = h ∑ W h y ˉ R h = h ∑ W h ( x ˉ h y ˉ h ⋅ X ˉ h )
Notice that $$\bar y_{Rh} = \frac{\bar{y}_h}{\bar{x}_h} \cdot \bar{X}_h$$ is the ratio estimator of the $ h $-th stratum.
Approximate Unbiasedness
E ( y ˉ R S ) ≈ Y ˉ ( AUE ) E(\bar{y}_{RS}) \approx \bar{Y} \quad (\text{AUE})
E ( y ˉ RS ) ≈ Y ˉ ( AUE )
Variance of the Estimator
Var ( y ˉ R S ) ≈ ∑ h W h 2 1 − f h n h ( S y h 2 + R h 2 S x h 2 − 2 R h S y x h ) \text{Var}(\bar{y}_{RS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + R_h^2 S_{x_h}^2 - 2R_h S_{yxh} \right)
Var ( y ˉ RS ) ≈ h ∑ W h 2 n h 1 − f h ( S y h 2 + R h 2 S x h 2 − 2 R h S y x h )
Estimated Variance of the Estimator
Var ^ ( y ˉ R S ) ≈ ∑ h W h 2 1 − f h n h ( s y h 2 + R ^ h 2 s x h 2 − 2 R ^ h s y x h ) \widehat{\text{Var}}(\bar{y}_{RS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{R}_h^2 s_{x_h}^2 - 2\hat{R}_h s_{yxh} \right)
Var ( y ˉ RS ) ≈ h ∑ W h 2 n h 1 − f h ( s y h 2 + R ^ h 2 s x h 2 − 2 R ^ h s y x h )
where $ \hat{R}_h = \frac{\bar{y}_h}{\bar{x}_h} $
In stra ratio.r
1 separate.ratio.mean= function ( Nh, y.sample, x.sample, stra.index, Xbarh, alpha)
Combined Ratio Estimation of Population Mean Y ˉ \bar Y Y ˉ
Estimator for the Population Mean
y ˉ R C = y ˉ s t x ˉ s t ⋅ X ˉ = R ^ c ⋅ X ˉ \bar{y}_{RC} = \frac{\bar{y}_{st}}{\bar{x}_{st}} \cdot \bar{X} = \hat{R}_c \cdot \bar{X}
y ˉ RC = x ˉ s t y ˉ s t ⋅ X ˉ = R ^ c ⋅ X ˉ
Approximate Unbiasedness
E ( y ˉ R C ) ≈ Y ˉ ( AUE ) E(\bar{y}_{RC}) \approx \bar{Y} \quad (\text{AUE})
E ( y ˉ RC ) ≈ Y ˉ ( AUE )
Variance of the Estimator
Var ( y ˉ R C ) = ∑ h W h 2 1 − f h n h ( S y h 2 + R h 2 S x h 2 − 2 R h S y x h ) \text{Var}(\bar{y}_{RC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + R_h^2 S_{x_h}^2 - 2R_h S_{yx_h} \right)
Var ( y ˉ RC ) = h ∑ W h 2 n h 1 − f h ( S y h 2 + R h 2 S x h 2 − 2 R h S y x h )
Estimated Variance of the Estimator
Var ^ ( y ˉ R C ) = ∑ h W h 2 1 − f h n h ( s y h 2 + R ^ c 2 s x h 2 − 2 R ^ c s y x h ) \widehat{\text{Var}}(\bar{y}_{RC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{R}_c^2 s_{x_h}^2 - 2\hat{R}_c s_{yx_h} \right)
Var ( y ˉ RC ) = h ∑ W h 2 n h 1 − f h ( s y h 2 + R ^ c 2 s x h 2 − 2 R ^ c s y x h )
where:
R ^ c = y ˉ s t x ˉ s t \hat{R}_c = \frac{\bar{y}_{st}}{\bar{x}_{st}}
R ^ c = x ˉ s t y ˉ s t
In stra ratio.r
1 combined.ratio.mean= function ( Nh, y.sample, x.sample, stra.index, Xbar, alpha)
Stratified Regression Estimation
Separate Regression Estimation of Population Mean Y ˉ \bar Y Y ˉ
Case I: When β h \beta_h β h is constant
Estimator for the Population Mean
y ˉ l r S = ∑ h W h y ˉ l r h = ∑ h W h ( y ˉ h + β h ( X ˉ h − x ˉ h ) ) \bar{y}_{lrS} = \sum_hW_h \bar y_{lrh }=\sum_h W_h \left( \bar{y}_h + \beta_h (\bar{X}_h - \bar{x}_h) \right)
y ˉ l r S = h ∑ W h y ˉ l r h = h ∑ W h ( y ˉ h + β h ( X ˉ h − x ˉ h ) )
Notice that $$\bar y_{lrh} = \bar{y}_h + \beta_h (\bar{X}_h - \bar{x}_h)$$ is the regression estimator of the $ h $-th stratum.
Unbiasedness
E ( y ˉ l r S ) = Y ˉ ( UE ) E(\bar{y}_{lrS}) = \bar{Y} \quad (\text{UE})
E ( y ˉ l r S ) = Y ˉ ( UE )
Variance of the Estimator
Var ( y ˉ l r S ) = ∑ h W h 2 1 − f h n h ( S y h 2 + β h 2 S x h 2 − 2 β h S y x h ) \text{Var}(\bar{y}_{lrS}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + \beta_h^2 S_{x_h}^2 - 2\beta_h S_{yx_h} \right)
Var ( y ˉ l r S ) = h ∑ W h 2 n h 1 − f h ( S y h 2 + β h 2 S x h 2 − 2 β h S y x h )
Minimum Variance Condition
When $ \beta_h = B_h = \frac{S_{yx_h}}{S_{x_h}^2} $:
Var min = ∑ h W h 2 1 − f h n h S e h 2 \text{Var}_{\text{min}} = \sum_h W_h^2 \frac{1-f_h}{n_h} S_{eh}^2
Var min = h ∑ W h 2 n h 1 − f h S e h 2
where:
S e h 2 = S y h 2 ( 1 − ρ h 2 ) S_{eh}^2 = S_{y_h}^2 (1 - \rho_h^2)
S e h 2 = S y h 2 ( 1 − ρ h 2 )
Estimated Variance of the Estimator
Var ^ ( y ˉ l r S ) = ∑ h W h 2 1 − f h n h ( s y h 2 + β ^ h 2 s x h 2 − 2 β ^ h s y x h ) \widehat{\text{Var}}(\bar{y}_{lrS}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{\beta}_h^2 s_{x_h}^2 - 2\hat{\beta}_h s_{yx_h} \right)
Var ( y ˉ l r S ) = h ∑ W h 2 n h 1 − f h ( s y h 2 + β ^ h 2 s x h 2 − 2 β ^ h s y x h )
Case II: When $ \beta_h = \hat{b}h = \frac{S {yx_h} }{S_{x_h}^2} $ (Regression Coefficient)
Estimator for the Population Mean
y ˉ l r S = ∑ h W h ( y ˉ h + b ^ h ( X ˉ h − x ˉ h ) ) \bar{y}_{lrS} = \sum_h W_h \left( \bar{y}_h + \hat{b}_h (\bar{X}_h - \bar{x}_h) \right)
y ˉ l r S = h ∑ W h ( y ˉ h + b ^ h ( X ˉ h − x ˉ h ) )
Asymptotically Unbiased Estimator
E ( y ˉ l r S ) ≈ Y ˉ ( AUE ) E(\bar{y}_{lrS}) \approx \bar{Y} \quad (\text{AUE})
E ( y ˉ l r S ) ≈ Y ˉ ( AUE )
Variance of the Estimator
Var ( y ˉ l r S ) ≈ ∑ h W h 2 1 − f h n h S y h 2 ( 1 − ρ h 2 ) \text{Var}(\bar{y}_{lrS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} S_{y_h}^2 (1 - \rho_h^2)
Var ( y ˉ l r S ) ≈ h ∑ W h 2 n h 1 − f h S y h 2 ( 1 − ρ h 2 )
Estimated Variance of the Estimator
Var ^ ( y ˉ l r S ) ≈ ∑ h W h 2 1 − f h n h n h − 1 n h − 2 ( s y h 2 − s y x h s x h 2 ) 2 \widehat{\text{Var}}(\bar{y}_{lrS}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \frac{n_h-1}{n_h-2} \left( s_{y_h}^2 - \frac{s_{yx_h}}{s_{x_h}^2} \right)^2
Var ( y ˉ l r S ) ≈ h ∑ W h 2 n h 1 − f h n h − 2 n h − 1 ( s y h 2 − s x h 2 s y x h ) 2
In stra regression.r
1 seperate.regression.mean= function ( Nh, y.sample, x.sample, stra.index, Xbarh, alpha, method = "Min" , beta0 = NULL )
Combined Regression Estimation of Population Mean Y ˉ \bar Y Y ˉ
Case I: When β \beta β is constant
Estimator for the Population Mean
y ˉ l r C = y ˉ s t + β ( X ˉ − x ˉ s t ) \bar{y}_{lrC} = \bar{y}_{st} + \beta (\bar{X} - \bar{x}_{st})
y ˉ l r C = y ˉ s t + β ( X ˉ − x ˉ s t )
Unbiasedness
E ( y ˉ l r C ) = Y ˉ ( UE ) E(\bar{y}_{lrC}) = \bar{Y} \quad (\text{UE})
E ( y ˉ l r C ) = Y ˉ ( UE )
Variance of the Estimator
Var ( y ˉ l r C ) = ∑ h W h 2 1 − f h n h ( S y h 2 + β 2 S x h 2 − 2 β S y x h ) \text{Var}(\bar{y}_{lrC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{yh}^2 + \beta^2 S_{xh}^2 - 2\beta S_{yxh} \right)
Var ( y ˉ l r C ) = h ∑ W h 2 n h 1 − f h ( S y h 2 + β 2 S x h 2 − 2 β S y x h )
Minimum Variance Condition
When $$ \beta = B_c = \frac{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{yxh}}{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{xh}^2} $$
The Variance achieves its minimum.
Estimated Variance of the Estimator
Var ^ ( y ˉ l r C ) = ∑ h W h 2 1 − f h n h ( s y h 2 + β ^ 2 s x h 2 − 2 β ^ s y x h ) \widehat{\text{Var}}(\bar{y}_{lrC}) = \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{yh}^2 + \hat{\beta}^2 s_{xh}^2 - 2\hat{\beta} s_{yxh} \right)
Var ( y ˉ l r C ) = h ∑ W h 2 n h 1 − f h ( s y h 2 + β ^ 2 s x h 2 − 2 β ^ s y x h )
Case II
When $$ \beta = \hat{b}c = \frac{\sum_h W_h^2 \frac{1-f_h}{n_h} S {yx_h}}{\sum_h W_h^2 \frac{1-f_h}{n_h} S_{x_h}^2} $$
Estimator for the Population Mean
y ˉ l r C = y ˉ s t + b ^ c ( X ˉ − x ˉ s t ) \bar{y}_{lrC} = \bar{y}_{st} + \hat{b}_c (\bar{X} - \bar{x}_{st})
y ˉ l r C = y ˉ s t + b ^ c ( X ˉ − x ˉ s t )
Approximate Unbiasedness
E ( y ˉ l r C ) ≈ Y ˉ ( AUE ) E(\bar{y}_{lrC}) \approx \bar{Y} \quad (\text{AUE})
E ( y ˉ l r C ) ≈ Y ˉ ( AUE )
Variance of the Estimator
Var ( y ˉ l r C ) ≈ ∑ h W h 2 1 − f h n h ( S y h 2 + B c 2 S x h 2 − 2 B c S y x h ) \text{Var}(\bar{y}_{lrC}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( S_{y_h}^2 + B_c^2 S_{x_h}^2 - 2B_c S_{yx_h} \right)
Var ( y ˉ l r C ) ≈ h ∑ W h 2 n h 1 − f h ( S y h 2 + B c 2 S x h 2 − 2 B c S y x h )
Estimated Variance of the Estimator
Var ^ ( y ˉ l r C ) ≈ ∑ h W h 2 1 − f h n h ( s y h 2 + b ^ c 2 s x h 2 − 2 b ^ c s y x h ) \widehat{\text{Var}}(\bar{y}_{lrC}) \approx \sum_h W_h^2 \frac{1-f_h}{n_h} \left( s_{y_h}^2 + \hat{b}_c^2 s_{x_h}^2 - 2\hat{b}_c s_{yx_h} \right)
Var ( y ˉ l r C ) ≈ h ∑ W h 2 n h 1 − f h ( s y h 2 + b ^ c 2 s x h 2 − 2 b ^ c s y x h )
In stra regression.r
1 combined.regression.mean = function ( Nh, y.sample, x.sample, stra.index, Xbar, alpha, method = "Min" , beta0 = NULL )
Estimation of Population Total Y T Y_T Y T
Notice that $$\text{mean }\quad \bar Y\overset{N}{\longrightarrow} Y_T \quad \text{total}$$
In stra ratio.r
1 2 3 separate.ratio.total= function ( Nh, y.sample, x.sample, stra.index, Xbarh, alpha) combined.ratio.total= function ( Nh, y.sample, x.sample, stra.index, Xbar, alpha)
In stra regression.r
1 2 3 seperate.regression.total = function ( Nh, y.sample, x.sample, stra.index, Xbarh, alpha, method = "Min" , beta0 = NULL ) combined.regression.total = function ( Nh, y.sample, x.sample, stra.index, Xbar, alpha, method = "Min" , beta0 = NULL )
Determining Sample Size
Deff = V a r ( y ˉ prop ) V a r ( y ˉ st ) ⟹ n prop = Deff ⋅ n y ˉ s t \text{Deff} = \frac{Var(\bar y_{\text{prop}})}{Var(\bar y_{\text{st}})}
\implies n_{\text{prop}} = \text{Deff} \cdot n_{\bar y_{st}} Deff = Va r ( y ˉ st ) Va r ( y ˉ prop ) ⟹ n prop = Deff ⋅ n y ˉ s t
Where given bound ( V , C , d , r ) (V,C,d,r) ( V , C , d , r ) of y ˉ s t \bar y_{st} y ˉ s t , calculate sample size n y ˉ s t n_{\bar y_{st}} n y ˉ s t
y ˉ prop \bar y _{\text{prop}} y ˉ prop is estimated by the methods (RS, RC, lrS, lrC).
Double Sampling
Double Sampling or Two-phase Sampling is a method with two phases.
First, sample from the population to obtain a big sample to obtain auxiliary information. In this course, the first-phase sampling is always SRS.
Second, sample with a small size. In this book, the second-phase is always sampled from the first-phase.
Process
Population:
Y 1 , … , Y N Y_1, \dots, Y_N
Y 1 , … , Y N
Step 1:
Sampling Method : SRS (Simple Random Sampling)
Sample Drawn :
y 1 ′ , … , y n ′ ′ (First Sample) y_1', \dots, y_{n'}' \quad \text{(First Sample)}
y 1 ′ , … , y n ′ ′ (First Sample)
Step 2:
y 1 , … , y n (Second Sample) y_1, \dots, y_n \quad \text{(Second Sample)}
y 1 , … , y n (Second Sample)
Estimation:
θ ^ = θ ^ ( y 1 , … , y n ) \hat{\theta} = \hat{\theta}(y_1, \dots, y_n)
θ ^ = θ ^ ( y 1 , … , y n )
Expectation and Variance Decomposition
Expectation of the Estimator
E ( θ ^ ) = E ( θ ^ ( y 1 , … , y n ) ) = E 1 [ E 2 ( θ ^ ( y 1 , … , y n ) ∣ y 1 ′ , … , y n ′ ′ ) ] = E 1 [ E 2 ( θ ^ ) ] \begin{aligned}
E(\hat{\theta}) &= E\left( \hat{\theta}(y_1, \dots, y_n) \right)\\
&= E_1 \left[ E_2 \left( \hat{\theta} (y_1,\ldots,y_n) \big| y_1', \dots, y_{n'}' \right) \right]\\
&= E_1 \left[ E_2 (\hat \theta)\right]\end{aligned}
E ( θ ^ ) = E ( θ ^ ( y 1 , … , y n ) ) = E 1 [ E 2 ( θ ^ ( y 1 , … , y n ) y 1 ′ , … , y n ′ ′ ) ] = E 1 [ E 2 ( θ ^ ) ]
Variance of the Estimator
The variance of the estimator $ \hat{\theta} $ can be decomposed as:
Var ( θ ^ ) = Var 1 ( E 2 ( θ ^ ∣ y 1 ′ , … , y n ′ ′ ) ) + E 1 ( Var 2 ( θ ^ ∣ y 1 ′ , … , y n ′ ′ ) ) \text{Var}(\hat{\theta}) = \text{Var}_1 \left( E_2 \left( \hat{\theta} \mid y_1', \dots, y_{n'}' \right) \right) + E_1 \left( \text{Var}_2 \left( \hat{\theta} \mid y_1', \dots, y_{n'}' \right) \right)
Var ( θ ^ ) = Var 1 ( E 2 ( θ ^ ∣ y 1 ′ , … , y n ′ ′ ) ) + E 1 ( Var 2 ( θ ^ ∣ y 1 ′ , … , y n ′ ′ ) )
Double Stratified Sampling
Sampling Process
Step 1
SRS sample from the population to obtain the first-phase samples. For known N N N and given n ′ n' n ′ :
( Y 1 , … , Y N ) ⟶ S R S ( y 1 ′ , … , y n ′ ′ ) (Y_1,\ldots,Y_N) \overset{SRS}{\longrightarrow} (y_1',\ldots,y_{n'}' )
( Y 1 , … , Y N ) ⟶ SRS ( y 1 ′ , … , y n ′ ′ )
Step 2
Stratify the first-phase samples ( y 1 ′ , … , y n ′ ′ ) (y_1',\ldots,y_{n'}' ) ( y 1 ′ , … , y n ′ ′ ) into L L L strata. The unit count for stratum h h h is n h ′ n_h' n h ′ .
The samples are: ( y n 1 ′ ′ , … , y n h ′ ′ ) , h = 1 , … , L (y_{n_1'}',\ldots, y_{n_h'}'),\quad h = 1,\ldots, L ( y n 1 ′ ′ , … , y n h ′ ′ ) , h = 1 , … , L
Step 3
Estimate the stratum weight of stratum h h h , since W h = N h N W_h = \frac{N_h}{N} W h = N N h is unkown.
Using samples from the first-phase, we have:
w h ′ = n h ′ n ′ , h = 1 , … , L w_h' = \frac{n_h'}{n'},\quad h = 1,\ldots ,L
w h ′ = n ′ n h ′ , h = 1 , … , L
Step 4
Perform a stratified sampling from the first-phase samples ( y 1 ′ , … , y n ′ ′ ) (y_1',\ldots,y_{n'}' ) ( y 1 ′ , … , y n ′ ′ ) to obtain the second-phase samples.
( y n 1 ′ ′ , … , y n h ′ ′ ) ⟶ ( y n 1 , … , y n n h ) (y_{n_1'}',\ldots, y_{n_h'}') \longrightarrow (y_{n_1},\ldots, y_{n_{n_h}})
( y n 1 ′ ′ , … , y n h ′ ′ ) ⟶ ( y n 1 , … , y n n h )
1. Second-phase sampling proportion:
v h = n h n h ′ v_h = \frac{n_h}{n_h'}
v h = n h ′ n h
$ n_h $: Size of the second-phase sample.
$ n_h’ $: Size of the first-phase sample.
2. Second-phase Sample Mean for h h h -th stratum ($ \bar{y}_h $):
y ˉ h = 1 n h ∑ j = 1 n h y h j \bar{y}_h = \frac{1}{n_h} \sum_{j=1}^{n_h} y_{hj}
y ˉ h = n h 1 j = 1 ∑ n h y hj
$ y_{hj} $: Value of the target variable $ y $ for the $ j $-th unit in the second-phase sample of stratum $ h $.
3. Second-phase Variance for h h h -th stratum ($ S_h^2 $):
S h 2 = 1 n h − 1 ∑ j = 1 n h ( y h j − y ˉ h ) 2 S_h^2 = \frac{1}{n_h - 1} \sum_{j=1}^{n_h} (y_{hj} - \bar{y}_h)^2
S h 2 = n h − 1 1 j = 1 ∑ n h ( y hj − y ˉ h ) 2
$ \bar{y}_h $: Sample mean of the target variable $ y $ for the second-phase sample of stratum $ h $.
Double Stratified Sampling Estimation of Population Mean $ \bar{Y}$
Estimator for the Population Mean :
y ˉ s t D = ∑ h = 1 L w h ′ ⋅ y ˉ h \bar{y}_{stD} = \sum_{h=1}^{L} w_h' \cdot \bar{y}_h
y ˉ s t D = h = 1 ∑ L w h ′ ⋅ y ˉ h
Unbiased Estimation :
E ( y ˉ s t D ) = Y ˉ ( UE ) E(\bar{y}_{stD}) = \bar{Y} \quad (\text{UE})
E ( y ˉ s t D ) = Y ˉ ( UE )
Variance of the Estimator :
Var ( y ˉ s t D ) = ( 1 n ′ − 1 N ) S 2 + ∑ h = 1 L 1 n h ′ ( v h ′ − 1 ) w h ′ S h 2 \text{Var}(\bar{y}_{stD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S^2 + \sum_{h=1}^{L} \frac{1}{n_h'} (v_h' - 1) w_h' S_h^2
Var ( y ˉ s t D ) = ( n ′ 1 − N 1 ) S 2 + h = 1 ∑ L n h ′ 1 ( v h ′ − 1 ) w h ′ S h 2
Estimated Variance of the Estimator :
Var ^ ( y ˉ s t D ) = ∑ h = 1 L ( 1 n h − 1 n h ′ ) w h ′ s h 2 + ( 1 n ′ − 1 N ) ∑ h = 1 L w h ′ ( y ˉ h − y ˉ s t D ) 2 \widehat{\text{Var}}(\bar{y}_{stD}) = \sum_{h=1}^{L} \left( \frac{1}{n_h} - \frac{1}{n_h'} \right) w_h' s_h^2 + \left( \frac{1}{n'} - \frac{1}{N} \right) \sum_{h=1}^{L} w_h' \left( \bar{y}_h - \bar{y}_{stD} \right)^2
Var ( y ˉ s t D ) = h = 1 ∑ L ( n h 1 − n h ′ 1 ) w h ′ s h 2 + ( n ′ 1 − N 1 ) h = 1 ∑ L w h ′ ( y ˉ h − y ˉ s t D ) 2
In two phase stra.r
1 2 3 twophase.stra.mean1= function ( N= NULL , nh.1st, nh.2nd, ybarh, s2h, alpha) twophase.stra.total1= function ( N, nh.1st, nh.2nd, ybarh, s2h, alpha)
Double Ratio Estimation and Double Regression Estimation
Sampling Process
Y Y Y is the target property and X X X is the auxiliary property.
Step 1
SRS sample from the population to obtain the first-phase samples. For known N N N and given n ′ n' n ′ :
( Y 1 ⋯ Y N X 1 ⋯ X N ) ⟶ S R S ( y 1 ′ ⋯ y n ′ ′ x 1 ⋯ x n ′ ′ ) \begin{pmatrix}Y_1 &\cdots &Y_N\\X_1 & \cdots &X_N \end{pmatrix} \overset{SRS}{\longrightarrow} \begin{pmatrix}y_1' &\cdots &y_{n'}'\\x_1 & \cdots &x_{n'}' \end{pmatrix}
( Y 1 X 1 ⋯ ⋯ Y N X N ) ⟶ SRS ( y 1 ′ x 1 ⋯ ⋯ y n ′ ′ x n ′ ′ )
Step 2
Since the auxiliary information X ˉ \bar X X ˉ is unknown, use the first-phase samples to estimate X ˉ \bar X X ˉ :
X ˉ = 1 n ′ ∑ j = 1 n ′ x j ′ \bar X = \frac{1}{n'} \sum_{j=1}^{n'} x_{j}'
X ˉ = n ′ 1 j = 1 ∑ n ′ x j ′
Step 3
SRS from the first-phase samples to obtain the second-phase samples:
( y 1 ′ ⋯ y n ′ ′ x 1 ⋯ x n ′ ′ ) ⟶ S R S ( y 1 ⋯ y n x 1 ⋯ x n ) \begin{pmatrix}y_1' &\cdots &y_{n'}'\\x_1 & \cdots &x_{n'}' \end{pmatrix} \overset{SRS}{\longrightarrow} \begin{pmatrix}y_1 &\cdots &y_{n}\\x_1 & \cdots &x_{n} \end{pmatrix}
( y 1 ′ x 1 ⋯ ⋯ y n ′ ′ x n ′ ′ ) ⟶ SRS ( y 1 x 1 ⋯ ⋯ y n x n )
Notations for second-phase samples:
y ˉ , x ˉ , s y 2 , s x 2 , s y x \bar y , \bar x, s_y^2, s_x^2, s_{yx}
y ˉ , x ˉ , s y 2 , s x 2 , s y x
Double Ratio Estimation of Population Mean $ \bar{Y}$
Estimator for the Population Mean :
y ˉ R D = R ^ ⋅ x ˉ ′ = y ˉ ′ x ˉ ′ ⋅ x ˉ ′ \bar{y}_{RD} = \hat{R} \cdot \bar{x}' = \frac{\bar{y}'}{\bar{x}'} \cdot \bar{x}'
y ˉ R D = R ^ ⋅ x ˉ ′ = x ˉ ′ y ˉ ′ ⋅ x ˉ ′
Asymptotically Unbiased Estimation :
E ( y ˉ R D ) ≈ Y ˉ AUE E(\bar{y}_{RD}) {\approx} \bar{Y}\quad \text{AUE}
E ( y ˉ R D ) ≈ Y ˉ AUE
Variance of the Estimator :
Var ( y ˉ R D ) = ( 1 n ′ − 1 N ) S y 2 + ( 1 n − 1 n ′ ) ( S y 2 + R 2 S x 2 − 2 R S y x ) \text{Var}(\bar{y}_{RD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) (S_y^2 + R^2 S_x^2 - 2RS_{yx})
Var ( y ˉ R D ) = ( n ′ 1 − N 1 ) S y 2 + ( n 1 − n ′ 1 ) ( S y 2 + R 2 S x 2 − 2 R S y x )
Estimated Variance of the Estimator :
Var ^ ( y ˉ R D ) = ( 1 n ′ − 1 N ) s y 2 + ( 1 n − 1 n ′ ) ( s y 2 + R ^ 2 s x 2 − 2 R ^ s y x ) \widehat{\text{Var}}(\bar{y}_{RD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( s_y^2 + \hat{R}^2 s_x^2 - 2\hat{R}s_{yx} \right)
Var ( y ˉ R D ) = ( n ′ 1 − N 1 ) s y 2 + ( n 1 − n ′ 1 ) ( s y 2 + R ^ 2 s x 2 − 2 R ^ s y x )
In twophase ratio.r
1 2 3 twophase.ratio.mean= function ( N= NULL , n.1st, xbar.1st, y.sample, x.sample, alpha) twophase.ratio.total= function ( N, n.1st, xbar.1st, y.sample, x.sample, alpha)
Double Regression Estimation of Population Mean $ \bar{Y}$
Case 1: When $ \beta $ is a Constant ($ \beta = \beta_0 $, i.e., $ \beta = 1 $)
Estimator for the Population Mean :
y ˉ l r D = y ˉ + β ( x ˉ ′ − x ˉ ) \bar{y}_{lrD} = \bar{y} + \beta (\bar{x}' - \bar{x})
y ˉ l rD = y ˉ + β ( x ˉ ′ − x ˉ )
Unbiasedness :
E ( y ˉ l r D ( β 0 ) ) = Y ˉ ( UE ) E(\bar{y}_{lrD}(\beta_0)) = \bar{Y} \quad (\text{UE})
E ( y ˉ l rD ( β 0 )) = Y ˉ ( UE )
Variance of the Estimator :
Var ( y ˉ l r D ( β 0 ) ) = ( 1 n ′ − 1 N ) S y 2 + ( 1 n − 1 n ′ ) ( S y 2 + β 0 2 S x 2 − 2 β 0 S y x ) \text{Var}(\bar{y}_{lrD}(\beta_0)) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( S_y^2 + \beta_0^2 S_x^2 - 2\beta_0 S_{yx} \right)
Var ( y ˉ l rD ( β 0 )) = ( n ′ 1 − N 1 ) S y 2 + ( n 1 − n ′ 1 ) ( S y 2 + β 0 2 S x 2 − 2 β 0 S y x )
Estimated Variance of the Estimator :
Var ^ ( y ˉ l r D ( β 0 ) ) = ( 1 n ′ − 1 N ) s y 2 + ( 1 n − 1 n ′ ) ( s y 2 + β 0 2 s x 2 − 2 β 0 s y x ) \widehat{\text{Var}}(\bar{y}_{lrD}(\beta_0)) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) \left( s_y^2 + {\beta}_0^2 s_x^2 - 2{\beta}_0 s_{yx} \right)
Var ( y ˉ l rD ( β 0 )) = ( n ′ 1 − N 1 ) s y 2 + ( n 1 − n ′ 1 ) ( s y 2 + β 0 2 s x 2 − 2 β 0 s y x )
Here is the Markdown representation of the given mathematical expressions:
Case II: When $ \beta$ is the regression coefficient of the second-phase sample
β = b ^ = S y x S x 2 \beta = \hat{b} = \frac{S_{yx}}{S_x^2}
β = b ^ = S x 2 S y x
Estimator for the Population Mean
y ˉ l r D = y ˉ + b ^ ( x ˉ ′ − x ˉ ) \bar{y}_{lrD} = \bar{y} + \hat{b} (\bar{x}' - \bar{x})
y ˉ l rD = y ˉ + b ^ ( x ˉ ′ − x ˉ )
Asymptotically Unbiased Estimation
E ( y ˉ l r D ) ≈ Y ˉ ( AUE ) E(\bar{y}_{lrD}) \approx \bar{Y} \quad (\text{AUE})
E ( y ˉ l rD ) ≈ Y ˉ ( AUE )
Variance of the Estimator
Var ( y ˉ l r D ) = ( 1 n ′ − 1 N ) S y 2 + ( 1 n − 1 n ′ ) S y 2 ( 1 − ρ 2 ) \text{Var}(\bar{y}_{lrD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) S_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) S_y^2 (1 - \rho^2)
Var ( y ˉ l rD ) = ( n ′ 1 − N 1 ) S y 2 + ( n 1 − n ′ 1 ) S y 2 ( 1 − ρ 2 )
Estimated Variance of the Estimator
Var ^ ( y ˉ l r D ) = ( 1 n ′ − 1 N ) s y 2 + ( 1 n − 1 n ′ ) s e 2 \widehat{\text{Var}}(\bar{y}_{lrD}) = \left( \frac{1}{n'} - \frac{1}{N} \right) s_y^2 + \left( \frac{1}{n} - \frac{1}{n'} \right) s_e^2
Var ( y ˉ l rD ) = ( n ′ 1 − N 1 ) s y 2 + ( n 1 − n ′ 1 ) s e 2
where:
s e 2 = n − 1 n − 2 ( s y 2 − s y x 2 s x 2 ) s_e^2 = \frac{n-1}{n-2} \left( s_y^2 - \frac{s_{yx}^2}{s_x^2} \right)
s e 2 = n − 2 n − 1 ( s y 2 − s x 2 s y x 2 )
In twophase regression.r
1 2 3 twophase.regression.mean= function ( N= NULL , n.1st, xbar.1st, y.sample, x.sample, alpha, beta0= NULL ) twophase.regression.total= function ( N= NULL , n.1st, xbar.1st, y.sample, x.sample, alpha, beta0= NULL )
Cluster Sampling
The population is formed by clusters. Cluster Sampling is to sample clusters and examine all the smaller units within the clusters.
Cluseter Sampling Estimation of Population Mean $ \bar{Y}$
Sampling Process
Population is formed by N N N clusters:
Y 11 , … , Y 1 M 1 1 ⋯ Y i 1 , … , Y i M i i ⋯ Y N 1 , … , Y N M N N \boxed{Y_{11},\ldots ,Y_{1M_1}}_1\quad \cdots\quad \boxed{Y_{i1},\ldots ,Y_{iM_i}}_i\quad \cdots\quad \boxed{Y_{N1},\ldots ,Y_{NM_N}}_N
Y 11 , … , Y 1 M 1 1 ⋯ Y i 1 , … , Y i M i i ⋯ Y N 1 , … , Y N M N N
SRS from the cluster indices:
( 1 , … , N ) ⟶ S R S ( 1 , … , n ) (1,\ldots,N) \overset{SRS}{\longrightarrow} (1,\ldots,n)
( 1 , … , N ) ⟶ SRS ( 1 , … , n )
We obtain the samples:
y 11 , … , y 1 m 1 1 ⋯ y i 1 , … , y i m i i ⋯ y n 1 , … , y n m n n \boxed{y_{11},\ldots,y_{1m_1}}_1\quad \cdots\quad \boxed{y_{i1},\ldots,y_{im_i}}_i\quad \cdots\quad \boxed{y_{n1},\ldots,y_{nm_n}}_n
y 11 , … , y 1 m 1 1 ⋯ y i 1 , … , y i m i i ⋯ y n 1 , … , y n m n n
For a given n n n , the sample rate: $$f = \frac n N$$
Clusters with the same size ( M i = M = m i ) (M_i = M = m_i) ( M i = M = m i )
Notations
UPPER CASE: population; lower case: sample.
Y ‾ i = 1 M ∑ j = 1 M Y i j (cluster mean) Y ‾ ‾ = 1 M N ( ∑ i = 1 N ∑ j = 1 M Y i j ) (unit mean) Y ‾ = 1 N ( ∑ i = 1 N 1 M ∑ j = 1 M Y i j ) (mean by cluster) S w 2 = 1 N ( M − 1 ) ∑ i = 1 N ∑ j = 1 M ( Y i j − Y ‾ i ) 2 (Within-cluster variance) S 2 = 1 N M − 1 ∑ i = 1 N ∑ j = 1 M ( Y i j − Y ‾ ) 2 (Total variance) S b 2 = M N − 1 ∑ i = 1 N ( Y ‾ i − Y ˉ ) 2 (Between-cluster variance) \begin{aligned}
\overline{Y}_{i} &= \frac{1}{M} \sum_{j=1}^{M} Y_{ij} & \text{(cluster mean)} \\
\overline{\overline{Y}} &= \frac{1}{MN} \left( \sum_{i=1}^{N} \sum_{j=1}^{M} Y_{ij} \right) & \text{(unit mean)} \\
\overline{Y} &= \frac{1}{N} \left( \sum_{i=1}^{N} \frac{1}{M} \sum_{j=1}^{M} Y_{ij} \right) & \text{(mean by cluster)}\\
S_{w}^{2} &= \frac{1}{N(M-1)} \sum_{i=1}^{N} \sum_{j=1}^{M} \left( Y_{ij} - \overline{Y}_{i} \right)^{2} & \text{(Within-cluster variance)} \\
S^{2} &= \frac{1}{NM-1} \sum_{i=1}^{N} \sum_{j=1}^{M} \left( Y_{ij} - \overline{Y} \right)^{2} & \text{(Total variance)} \\
S_{b}^{2} &= \frac{M}{N-1} \sum_{i=1}^{N} \left( \overline{Y}_{i} - \bar{Y} \right)^{2} & \text{(Between-cluster variance)}
\end{aligned} Y i Y Y S w 2 S 2 S b 2 = M 1 j = 1 ∑ M Y ij = MN 1 ( i = 1 ∑ N j = 1 ∑ M Y ij ) = N 1 ( i = 1 ∑ N M 1 j = 1 ∑ M Y ij ) = N ( M − 1 ) 1 i = 1 ∑ N j = 1 ∑ M ( Y ij − Y i ) 2 = NM − 1 1 i = 1 ∑ N j = 1 ∑ M ( Y ij − Y ) 2 = N − 1 M i = 1 ∑ N ( Y i − Y ˉ ) 2 (cluster mean) (unit mean) (mean by cluster) (Within-cluster variance) (Total variance) (Between-cluster variance)
y ˉ i , y ˉ ˉ , y ˉ , s w 2 , s 2 , S b 2 \bar y_i , \bar{\bar{y}}, \bar y , s_w^2, s^2,S_b^2 y ˉ i , y ˉ ˉ , y ˉ , s w 2 , s 2 , S b 2 can be defined similiarly.
Proposition:
Decomposition of population variance S 2 S^2 S 2 :
S 2 = 1 N M − 1 [ ( N − 1 ) S b 2 + N ( M − 1 ) S 2 2 ] = N − 1 N M − 1 S b 2 + N ( M − 1 ) N M − 1 S w 2 \begin{aligned}S^2 &= \frac{1}{NM-1} \left[(N-1)S_b^2 + N(M-1) S_2^2\right]\\&=\frac{N-1}{NM-1}S_b^2 + \frac{N(M-1)}{NM-1}S_w^2\end{aligned}
S 2 = NM − 1 1 [ ( N − 1 ) S b 2 + N ( M − 1 ) S 2 2 ] = NM − 1 N − 1 S b 2 + NM − 1 N ( M − 1 ) S w 2
Decomposition of sample variance s 2 s^2 s 2 :
s 2 = n − 1 n M − 1 s b 2 + n ( M − 1 ) n M − 1 s w 2 \begin{aligned}s^2 &=\frac{n-1}{nM-1}s_b^2 + \frac{n(M-1)}{nM-1}s_w^2\end{aligned}
s 2 = n M − 1 n − 1 s b 2 + n M − 1 n ( M − 1 ) s w 2
Estimation of the unit mean Y ‾ ‾ \overline{\overline{Y} } Y
Estimation
y ‾ ‾ = 1 n M ∑ i = 1 n ∑ j = 1 M y i j = ( 1 M ) y ˉ \overline{\overline{y}} = \frac{1}{nM} \sum_{i=1}^n \sum_{j=1}^M y_{ij} = \left( \frac{1}{M} \right) \bar{y}
y = n M 1 i = 1 ∑ n j = 1 ∑ M y ij = ( M 1 ) y ˉ
( = 1 n ∑ i = 1 n y ˉ i = mean ( y ˉ 1 , … , y ˉ n ) ) \left( = \frac{1}{n} \sum_{i=1}^n \bar{y}_i = \text{mean}(\bar{y}_1, \ldots, \bar{y}_n) \right)
( = n 1 i = 1 ∑ n y ˉ i = mean ( y ˉ 1 , … , y ˉ n ) )
Unbiased
E ( y ‾ ‾ ) = Y ‾ ‾ ( UE ) E(\overline{\overline{y}}) = \overline{\overline{Y}} \quad (\text{UE})
E ( y ) = Y ( UE )
Variance of Estimation
Var ( y ‾ ‾ ) = 1 − f n M S b 2 \text{Var}(\overline{\overline{y}}) = \frac{1-f}{nM} S_b^2
Var ( y ) = n M 1 − f S b 2
Estimation of Variance
Var ^ ( y ‾ ‾ ) = 1 − f n M S b 2 \widehat{\text{Var}}(\overline{\overline{y}}) = \frac{1-f}{nM} S_b^2
Var ( y ) = n M 1 − f S b 2
In cluster srs.r
1 cluster.srs.mean = function ( N, M.ith, ybar.ith, s2.ith, alpha)
Estimation of Population Variance S 2 S^2 S 2
Proposition
Recall:
S 2 = N − 1 N M − 1 S b 2 + N ( M − 1 ) N M − 1 S w 2 \begin{aligned}S^2 &=\frac{N-1}{NM-1}S_b^2 + \frac{N(M-1)}{NM-1}S_w^2\end{aligned}
S 2 = NM − 1 N − 1 S b 2 + NM − 1 N ( M − 1 ) S w 2
s 2 = n − 1 n M − 1 s b 2 + n ( M − 1 ) n M − 1 s w 2 \begin{aligned}s^2 &=\frac{n-1}{nM-1}s_b^2 + \frac{n(M-1)}{nM-1}s_w^2\end{aligned}
s 2 = n M − 1 n − 1 s b 2 + n M − 1 n ( M − 1 ) s w 2
We have:
s b 2 s_b^2 s b 2 is an Unbiased Estimator of S b 2 S_b^2 S b 2
s w 2 s_w^2 s w 2 is an Unbiased Estimator of S w 2 S_w^2 S w 2
s 2 s^2 s 2 is NOT an Unbiased Estimator of S 2 S^2 S 2
The Unbiased Estimator of S 2 S^2 S 2 is:
S 2 ≈ N − 1 N M − 1 s b 2 + N ( M − 1 ) N M − 1 s w 2 ( N is given ) S 2 ≈ 1 M s b 2 + M − 1 M s w 2 ( N = + ∞ ) \begin{aligned}S^2 &\approx \frac{N-1}{NM-1}s_b^2 + \frac{N(M-1)}{NM-1}s_w^2 \quad &(N\text{ is given})\\S^2& \approx \frac{1}M s_b^2 + \frac{M-1}M s_w^2 \quad &(N = +\infty)\end{aligned}
S 2 S 2 ≈ NM − 1 N − 1 s b 2 + NM − 1 N ( M − 1 ) s w 2 ≈ M 1 s b 2 + M M − 1 s w 2 ( N is given ) ( N = + ∞ )
Design Efficiency
Definition
Within Cluster Correlation Coefficient:
ρ c = def 2 ∑ i = 1 N ∑ j < k M ( Y i j − Y ‾ ‾ ) ( Y i k − Y ‾ ‾ ) ( M − 1 ) ( N M − 1 ) S 2 = 1 − N M S w 2 ( N M − 1 ) S 2 \begin{aligned}\rho_c &\overset{\text{def}}{=} \frac{2\sum_{i=1}^N \sum_{j<k}^M (Y_{ij} - \overline{\overline{Y}})(Y_{ik} - \overline{\overline{Y}})}{(M-1)(NM-1) S^2}\\&= 1 - \frac{NM S_w^2}{(NM-1) S^2}\end{aligned}
ρ c = def ( M − 1 ) ( NM − 1 ) S 2 2 ∑ i = 1 N ∑ j < k M ( Y ij − Y ) ( Y ik − Y ) = 1 − ( NM − 1 ) S 2 NM S w 2
Note that:
ρ c ∈ [ − 1 M − 1 , 1 ] \rho_c \in \left[ -\frac{1}{M-1}, 1 \right]
ρ c ∈ [ − M − 1 1 , 1 ]
To Calculate the Design Efficiency, we need to calculate the variance of our cluster estimator versus the variance of SRS.
Variance of the Cluster Estimator
Var ( y ‾ ‾ ) = 1 − f n M S b 2 = 1 − f n M ⋅ N M − 1 M ( N − 1 ) S 2 ( 1 + ( M − 1 ) ρ c ) ( N is given ) = 1 − f n M S 2 ( 1 + ( M − 1 ) ρ c ) ( N = + ∞ ) \begin{aligned}\text{Var}(\overline{\overline{y}}) &= \frac{1-f}{nM} S_b^2\\
&=\frac{1-f}{nM} \cdot \frac{NM-1}{M(N-1)} S^2 (1 + (M-1)\rho_c) \quad &(N \text{ is given})\\
&=\frac{1-f}{nM} S^2 (1 + (M-1)\rho_c) \quad & (N = +\infty)\end{aligned}
Var ( y ) = n M 1 − f S b 2 = n M 1 − f ⋅ M ( N − 1 ) NM − 1 S 2 ( 1 + ( M − 1 ) ρ c ) = n M 1 − f S 2 ( 1 + ( M − 1 ) ρ c ) ( N is given ) ( N = + ∞ )
Now lets tackle ρ c \rho_c ρ c , the estimation of ρ c \rho_c ρ c is:
ρ c = 1 − N M s 2 2 ( N M − 1 ) S ^ 2 N is given = s b 2 − s w 2 s b 2 + ( M − 1 ) s w 2 N = + ∞ \begin{aligned}\rho_c &= 1 - \frac{NMs_2^2}{(NM-1)\hat S^2} \quad & N \text{ is given}\\ & = \frac{s_b^2 - s_w^2}{s_b^2 + (M-1)s_w^2} \quad & N = +\infty\end{aligned}
ρ c = 1 − ( NM − 1 ) S ^ 2 NM s 2 2 = s b 2 + ( M − 1 ) s w 2 s b 2 − s w 2 N is given N = + ∞
Variance for SRS from a population of N M NM NM with sample size n M nM n M
Var ( y ‾ S R S ) = 1 − f n M S 2 \text{Var}(\overline{y}_{SRS}) = \frac{1 - f}{nM} S^2
Var ( y SRS ) = n M 1 − f S 2
Hence the Design Efficiency can be derived as:
Deff ^ = Var ( y ‾ ‾ ) Var ( y ‾ S R S ) = { N M − 1 M ( N − 1 ) ( 1 + ( M − 1 ) ρ ^ c ) N is limitied 1 + ( M − 1 ) ρ ^ c N = + ∞ \widehat{\text{Deff}} = \frac{\text{Var}(\overline{\overline{y}})}{\text{Var}(\overline{y}_{SRS})} = \begin{cases}
\frac{NM-1}{M(N-1)} (1 + (M-1)\hat\rho_c) & \text{ $N$ is limitied} \\
1 + (M-1)\hat\rho_c & \text{ $N = +\infty$}
\end{cases} Deff = Var ( y SRS ) Var ( y ) = { M ( N − 1 ) NM − 1 ( 1 + ( M − 1 ) ρ ^ c ) 1 + ( M − 1 ) ρ ^ c N is limitied N = + ∞
Determining the Sample Size
Given ( V , C , d , r ) (V,C,d,r) ( V , C , d , r ) for y ˉ S R S \bar y_{SRS} y ˉ SRS , we can determine the sample size n S R S n_{SRS} n SRS , therefore:
n min = Deff ^ ⋅ n S R S n_{\min} = \widehat{\text{Deff}} \cdot n_{SRS}
n m i n = Deff ⋅ n SRS
The minimum number of clusters is:
n cluster = n min M n_{\text{cluster}} = \frac{n_{\min}}M
n cluster = M n m i n
Clusters with different sizes
If M i M_i M i are close enough, use the mean M ˉ \bar M M ˉ as a proxy for M M M :
M ˉ = 1 N ∑ i = 1 N M i \bar M = \frac 1 N \sum_{i=1}^N M_i
M ˉ = N 1 i = 1 ∑ N M i
When M i M_i M i are widely apart, use the stratified method for each cluster to obtain a similar stratum size. Then use the mean as a proxy.