不等機率抽樣下多零值資料的擬概度信賴區間

淡江大學機構典藏 > 理學院 > 應用數學與數據科學學系 > 研究報告 > Item 987654321/76939

Please use this identifier to cite or link to this item: https://tkuir.lib.tku.edu.tw/dspace/handle/987654321/76939

Title:	不等機率抽樣下多零值資料的擬概度信賴區間
Other Titles:	Pseudo Likelihood Confidence Intervals for the Mean of a Population Containing Many Zero Values under Varying Probability Sampling
Authors:	陳順益
Contributors:	淡江大學數學學系
Keywords:	Accounting;inclusion probability;mixture models;pseudo likelihood;stratified sampling;survey sampling
Date:	2011
Issue Date:	2012-05-22 22:15:53 (UTC+8)
Abstract:	Pseudo Likelihood Confidence Intervals for the Mean of a Population Containing Many Zero Values under Varying Probability Sampling The many-zero-observation problem in survey sampling under complex probability sampling is considered. In this project the problem is addressed in the context of confidence interval estimation for the population mean. The traditional approach based on the central limit theorem (CLT) performs poorly due to the sever skewness of the population at zero and the maximum likelihood (ML) method does not work well either in applications of survey sampling because the sampling designs can often be so complex in practice that it is difficult to pin down the likelihood function and express it explicitly. The nonparametric approach suggested by Chen, Chen and Rao (2003) and Chen and Qin (2003) is completely free from the risk of model misspecification. When a suitable parametric model is available, parametric analysis has potential advantages in efficiency and simplicity. In this spirit, Chen, Chen and Chen (2010) consider the mixture model proposed by Kvanli, Shen and Deng (1998) and propose a pseudo likelihood method to attack the problem. The pseudo likelihood function is unbiased when the weights are chosen to be the reciprocal of the inclusion probabilities. Simulation results show that the pseudo likelihood method improves the coverage probability substantially when the inclusion probabilities are related to the unit values and it outperforms the CLT and ML methods on the coverage probability, the balance of non-coverage rates on the lower and upper sides, and the interval length. The pseudo likelihood method is intended to deal with complex survey sampling problems. It is noted from the simulation results of Chen, Chen and Chen (2010) that the pseudo likelihood method is quite robust against mis-specification of superpopulation models (In fact, their discussion is only for the normal and gamma distributions). However, it is unclear here. We will investigate why the pseudo likelihood method is robust against mis-specification of superpopulation models. Furthermore, in this project, several other distributions that have been widely used in mixture models will also be discussed, and their applications derived. We will include the exponential, Weibull, and generalized gamma distributions. Regarding the choice of weights in the pseudo likelihood method, since the auxiliary information (Xj) in complex surveys is used and the correlation coefficient of (Yj ,Xj) is known, we will consider different weighting systems that can utilize the auxiliary information such as, for the unit i in the stratum Pj , wâˆ’1 i = x(i) Pl2Pj x(l) pj . Another problem is that, is it reasonable that the above inclusion probabilities are proportional to wâˆ’1 i for unit i? It is possible to do some modifications in future study. It is generally easily said than done to have an unequal probability sampling plan. We have to provide more details regarding the weights and the inclusion probability. Finally, in this research project, we will also apply the pseudo likelihood approach to the data set contains many zero values by utilizing several sampling schemes, such as probability-proportional-to-size sampling and biased sampling. We will develop the related theories and perform extensive simulations. We will also look into possibilities of employing the new method to different sampling designs, e.g., simple random sampling and stratified random sampling. 本計畫將研究複雜機率抽樣下含有大量零值的調查資料, 以建構其母體平均數之信賴區間。多零值資料相當常見, 舉例來說, 像是到診所看病, 大部分人是繳掛號費150 元, 只有少數會發生看診項目或藥物給付而超過掛號費, 病患需再依各自情況付費, 則此種資料顯示大多數病患都為自付150 元掛號費, 只有少數付超過150 元。另外, 在品質管制中檢查不良樣品, 不良的個數通常只有少數幾件, 若將多數良品資料記為0 , 不良品資料即為非零值時, 此筆資料即為大量含零值的資料。通常抽樣調查資料會使用傳統中央極限近似法, 用已知樣本估計未知母體平均數的信賴區間。但是當所得樣本含帶有大量零值訊息時, 用傳統方法估計的結果會變得不可靠。但若利用Kvanli, Shen 和Deng (1998) 提出的最大概度比方法(maximum likelihood ratio) 來處理, 則會因複雜機率抽樣而無法得到準確的概度函數。為了解決這個問題, 一個自然的補救方法就是採用無母數方法。Chen, Chen, 和Rao(2003) 及Chen 和Qin (2003) 發展出經驗概度比法, 建立多零母體平均數的信賴區間。在多零值資料(Y ) 中, 若每個資料值皆可以找到和變數 X 有相關程度的輔助訊息, Chen 和Sitter (1999) 將經驗概度方法, 結合輔助訊息, 推廣成擬經驗概度法(pseudo empirical likelihood) 並應用到複雜抽樣設計(complex survey sampling) 上, 這種抽取樣本的方法會讓帶高訊息的資料越有較高機率被抽取到。例如查稅, 高所得的納稅人會比低所得的納稅人較容易被抽到成為查稅的樣本。且以前被查稅過且有犯錯的人很有可能會再被抽取到。在此例子中, 高所得和之前被查過犯錯的輔助訊息會使之有較高機率被抽到。其所建立出來的擬經驗概度信賴區間會比不用輔助訊息的信賴區間更加精確。但如果有合適的參數分布模型可用, 因其簡單有效, 所以Chen, Chen 和Chen (2010) 提出擬概度法(pseudo likelihood method), 結合輔助訊息, 利用不同機率抽取樣本的方法來解決此類問題, 所建立出來的信賴區間會比傳統方法與最大概度估計方法所建立出來的更加準確可靠, 且較不受非零值比例的大小影響。擬概度法可應用到複雜抽樣設計上, Chen, Chen 和Chen (2010) 模擬結果顯示此方法, 對錯誤指定超大母體分布時有穩健性, 但不清楚為何有如此穩健性。本計畫將探討其對錯及分析原因, 並研究應用於其他幾種常用的參數分布模型。另外不同機率抽取樣本的方法需要選取相對的權數, 本計畫將研究選取其他相對的權數, 尤其是與輔助訊息相關的權數。本研究計畫針對少量非零值資料值且全部樣本個數不多的數據, 將探討利用輔助訊息, 研究其他各種抽樣方法的可行性, 如偏差抽樣(biased sampling), PPS 抽樣(probability-proportional-to-size sampling) 等不等機率抽樣方法。同時並研究將新方法應用於各種不同的抽樣設計, 如簡單隨機抽樣(simple random sampling), 及分層抽樣設計 (stratified sampling)。
Appears in Collections:	[應用數學與數據科學學系] 研究報告

Files in This Item:

There are no files associated with this item.

Loading...