domenica 10 aprile 2016

Identification of possible determinants of Life Expectancy using Classification and Regression Tree (CART) analysis


Methods

Sample

1.1 Study population
In the world there are now 195 independent sovereign states (including disputed but defacto independent Taiwan), plus about 60 dependent areas, and five disputed territories, like Kosovo. (http://www.nationsonline.org/oneworld/countries_of_the_world.htm).

1.2 Study sample
The current study includes data from a sample of 248 world countries .  

1.3 Description of the sample
The sample represents 95% (248/260) of the whole .population of sovereign states, dependent areas and disputed territories. Data are from the World Bank database of countries indicators; data refers to years 2013.

Measures

2.1 Description of the variables to be included in the analysis

Life expectancy at birth, the response variable, indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. This variable refers to year 2013 and derived from male and female life expectancy at birth from sources such as: (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Report (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
The explicatory  variables included in the analysis are:

GDP per capita
GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. The World Bank is the source of data and values refer to year 2013.
Access to electricity
Access to electricity is the percentage of population with access to electricity. Electrification data are collected from industry, national surveys and international sources. Values refer to year 2013.

Health expenditure per capita
Total health expenditure is the sum of public and private health expenditures as a ratio of total population. It covers the provision of health services (preventive and curative), family planning activities, nutrition activities, and emergency aid designated for health but does not include provision of water and sanitation. Data are in current U.S. dollars and refers to year 2013.

Improved sanitation facilities
Access to improved sanitation facilities refers to the percentage of the population using improved sanitation facilities. Improved sanitation facilities are likely to ensure hygienic separation of human excreta from human contact. They include flush/pour flush (to piped sewer system, septic tank, pit latrine), ventilated improved pit (VIP) latrine, pit latrine with slab, and composting toilet. Data are from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation and refers to year 2013.

Improved water source
Access to an improved water source refers to the percentage of the population using an improved drinking water source. The improved drinking water source includes piped water on premises (piped household water connection located inside the user’s dwelling, plot or yard), and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection). Data are from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation  and refers to year 2013.

Labor force, female
Female labor force as a percentage of the total show the extent to which women are active in the labor force. Labor force comprises people ages 15 and older who meet the International Labour Organization's definition of the economically active population. Data are from the International Labour Organization and refers to year 2013.

PM2.5 air pollution
Percent of population exposed to ambient concentrations of PM2.5 that exceed the WHO guideline value is defined as the portion of a country’s population living in places where mean annual concentrations of PM2.5 are greater than 10 micrograms per cubic meter, the guideline value recommended by the World Health Organization as the lower end of the range of concentrations over which adverse health effects due to PM2.5 exposure have been observed. Data are from the Institute for Health Metrics and Evaluation, University of Washington in Seattle and refers to year 2013.

2.2 How variables are managed.
Life expectancy at birth, the response variable, has been dichotomized to split countries depending on whether they have a value less/equal to 65.65 years (countries with low life expectancy at birth) or higher.
All predictors variables are kept in the original continuous format.

Analyses

1) Description of the statistical methods
The distributions for the predictors and life expectancy at birth, the response variable, were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.
Scatter plots and box plots were also examined, and Pearson correlation and Analysis of Variance (ANOVA) were used to test bivariate associations between individual predictors and life expectancy at birth, the response variable.
Classification and Regression Tree (CART) analysis was used to identify possible determinants of life expectancy; CART analysis was performed using PROC HPSPLIT in SAS version 9.14
The entropy criterion was selected in the GROW statement to split the observations during the process of recursive partitioning that results in a large initial tree. Cost-complexity was selected in the PRUNE statement for pruning and select a smaller subtree that avoids overfitting the data.
The cost-complexity plot was also displayed with estimates of the average square error (ASE) for a series of progressively smaller subtrees of the large tree.
The confusion matrix to evaluate the accuracy of the fitted tree, the misclassification rate, the specificity and the sensitivity were also calculated. Missing value are assigned using ASSIGNMISSING=POPULAR.

2) Training and tests data sets
CART divides the data into learning and test subsamples. The learning sample is used to grow an overly large tree, while the test sample is then used to estimate the rate at which cases are misclassified. The misclassification rate is calculated for every sized tree and the selected subtree represents the lowest probability of misclassification.

3) Type of cross validation to be used
A cross validation of the final model parameters is performed and a table that describes the cross validation error measures of the parameters is produced.
Cross validation method used assign each training observation randomly to one of 10 folds (with a probability of 1/10 for any given fold).



Nessun commento:

Posta un commento