sabato 16 aprile 2016

Identification of possible determinants of Life Expectancy using Classification and Regression Tree (CART) analysis

Results

Among world countries available in this study sample, 149 countries (75%) had a Life expectancy at birth greater than 65 years while 50 countries (25%) had a Life expectancy at birth less/equal to 65 years.
Table 1 presents the bivariate associations between country characteristics and Life expectancy at birth. Bonferroni adjustment of the p value for the evaluation of these multiple comparisons is p < .007.

Table 1 Country characteristics by Life expectancy at birth

Life expectancy at birth greater than 65 years
(N=149)
Life expectancy at birth less/equal to 65 years
(N=50)

Country characteristics
Mean
standard deviation
Mean
standard deviation
F value
p value
GDP PER CAPITA (CURRENT US$)
19258.6
23322.0
2243.7
3600.3
25.7
<.0001
ACCESS TO ELECTRICITY
(% OF POPULATION)
91.7
16.3
35.5
22.5
361.4
<.0001
HEALTH EXPENDITURE PER CAPITA (CURRENT US$)
1397.8
2040.8
115.2
152.7
18.8
<.0001
IMPROVED SANITATION FACILITIES
 (% OF POPULATION WITH ACCESS)
86.4
16.9
31.6
18.2
351.2
<.0001
IMPROVED WATER SOURCE
(% OF POPULATION WITH ACCESS)
94.6
7.7
69.6
14.7
217.8
<.0001
LABOR FORCE, FEMALE
(% OF TOTAL LABOR FORCE)
40.0
9.2
44.0
8.0
7.4
0.0070
PM2.5 AIR POLLUTION, POPULATION EXPOSED TO LEVELS EXCEEDING WHO GUIDELINE VALUE (% OF TOTAL)
70.6
37.3
78.0
32.8
1.5
0.2134




Analysis of Variance for continuous variables yielded the following differences:
  •  countries with Life expectancy at birth greater or equal than 65.65 years compared with countries with Life expectancy at birth less/equal to 65 years have a higher Gross domestic product (GDP), 7.425.7, p < .0001
  • . and exhibited more health expenditure per capita, F=18.8, p < .0001
  •  and showed a higher % of population with access to electricity, F=361.4, p < .0001
  •  and were found to have more population with access to improved sanitation facilities, F=351.2, p < .0001
  •   and a greater % of population with access to improved water source, F=217.8, p < .0001
  • and reported a lower % of female labor force, F=7.4, p < .007
  •   and finally didn’t present a statistically significantly difference in term of % of population exposed to PM2.5 levels of air pollution exceeding World Health Organization guideline value

Next, countries variables were included as possible contributors to a CART model evaluating Life expectancy at birth (Figure 1).


Figure 1 Classification tree.




Each pentagon represents a decision point. For the decision point, the predictor variable and cut point are presented. Final groups with high and low outcome probability are represented by rectangles in the figure and include outcome frequencies and percentages. Shaded rectangle represent subgroups with relatively high rates of countries with high Life expectancy at birth, nonshaded rectangles, relatively low rates.

The % of population with access to electricity was the first variable to separate the sample into two subgroups. Countries with a % of population with access to electricity greater or equal than 56.3  were more likely to have high Life expectancy at birth compared to countries not meeting this cutoff (95.3% vs. 12.2%).

Of the countries with % of population with access to electricity greater than 56.3 , a further subdivision was made again with the % of population with access to electricity. Countries who reported having a % of population with access to electricity greater or equal than 89.6 were more likely to have high Life expectancy at birth compared to countries not meeting this cutoff (99.2% vs. 78.6%).

After splitting twice the data based on % of population with access to electricity, the other variables don’t impact on Life expectancy at birth.

The model classified 92% of the sample correctly, 95% of countries with Life expectancy at birth greater or equal than 65 years (sensitivity) and 82% of countries with Life expectancy at birth lower than 65 years (specificity).

domenica 10 aprile 2016

Identification of possible determinants of Life Expectancy using Classification and Regression Tree (CART) analysis


Methods

Sample

1.1 Study population
In the world there are now 195 independent sovereign states (including disputed but defacto independent Taiwan), plus about 60 dependent areas, and five disputed territories, like Kosovo. (http://www.nationsonline.org/oneworld/countries_of_the_world.htm).

1.2 Study sample
The current study includes data from a sample of 248 world countries .  

1.3 Description of the sample
The sample represents 95% (248/260) of the whole .population of sovereign states, dependent areas and disputed territories. Data are from the World Bank database of countries indicators; data refers to years 2013.

Measures

2.1 Description of the variables to be included in the analysis

Life expectancy at birth, the response variable, indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. This variable refers to year 2013 and derived from male and female life expectancy at birth from sources such as: (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Report (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
The explicatory  variables included in the analysis are:

GDP per capita
GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. The World Bank is the source of data and values refer to year 2013.
Access to electricity
Access to electricity is the percentage of population with access to electricity. Electrification data are collected from industry, national surveys and international sources. Values refer to year 2013.

Health expenditure per capita
Total health expenditure is the sum of public and private health expenditures as a ratio of total population. It covers the provision of health services (preventive and curative), family planning activities, nutrition activities, and emergency aid designated for health but does not include provision of water and sanitation. Data are in current U.S. dollars and refers to year 2013.

Improved sanitation facilities
Access to improved sanitation facilities refers to the percentage of the population using improved sanitation facilities. Improved sanitation facilities are likely to ensure hygienic separation of human excreta from human contact. They include flush/pour flush (to piped sewer system, septic tank, pit latrine), ventilated improved pit (VIP) latrine, pit latrine with slab, and composting toilet. Data are from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation and refers to year 2013.

Improved water source
Access to an improved water source refers to the percentage of the population using an improved drinking water source. The improved drinking water source includes piped water on premises (piped household water connection located inside the user’s dwelling, plot or yard), and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection). Data are from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation  and refers to year 2013.

Labor force, female
Female labor force as a percentage of the total show the extent to which women are active in the labor force. Labor force comprises people ages 15 and older who meet the International Labour Organization's definition of the economically active population. Data are from the International Labour Organization and refers to year 2013.

PM2.5 air pollution
Percent of population exposed to ambient concentrations of PM2.5 that exceed the WHO guideline value is defined as the portion of a country’s population living in places where mean annual concentrations of PM2.5 are greater than 10 micrograms per cubic meter, the guideline value recommended by the World Health Organization as the lower end of the range of concentrations over which adverse health effects due to PM2.5 exposure have been observed. Data are from the Institute for Health Metrics and Evaluation, University of Washington in Seattle and refers to year 2013.

2.2 How variables are managed.
Life expectancy at birth, the response variable, has been dichotomized to split countries depending on whether they have a value less/equal to 65.65 years (countries with low life expectancy at birth) or higher.
All predictors variables are kept in the original continuous format.

Analyses

1) Description of the statistical methods
The distributions for the predictors and life expectancy at birth, the response variable, were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.
Scatter plots and box plots were also examined, and Pearson correlation and Analysis of Variance (ANOVA) were used to test bivariate associations between individual predictors and life expectancy at birth, the response variable.
Classification and Regression Tree (CART) analysis was used to identify possible determinants of life expectancy; CART analysis was performed using PROC HPSPLIT in SAS version 9.14
The entropy criterion was selected in the GROW statement to split the observations during the process of recursive partitioning that results in a large initial tree. Cost-complexity was selected in the PRUNE statement for pruning and select a smaller subtree that avoids overfitting the data.
The cost-complexity plot was also displayed with estimates of the average square error (ASE) for a series of progressively smaller subtrees of the large tree.
The confusion matrix to evaluate the accuracy of the fitted tree, the misclassification rate, the specificity and the sensitivity were also calculated. Missing value are assigned using ASSIGNMISSING=POPULAR.

2) Training and tests data sets
CART divides the data into learning and test subsamples. The learning sample is used to grow an overly large tree, while the test sample is then used to estimate the rate at which cases are misclassified. The misclassification rate is calculated for every sized tree and the selected subtree represents the lowest probability of misclassification.

3) Type of cross validation to be used
A cross validation of the final model parameters is performed and a table that describes the cross validation error measures of the parameters is produced.
Cross validation method used assign each training observation randomly to one of 10 folds (with a probability of 1/10 for any given fold).



sabato 2 aprile 2016

Identification of possible determinants of Life Expectancy using Classification and Regression Tree (CART) analysis


Title

Identification of possible determinants of Life Expectancy using Classification and Regression Tree (CART) analysis

Research question

The aim of this study is to identify possible determinants of life expectancy (the dependent variable) between a number of predictor variables such as: gross domestic product, health expenditure, education expenditure, air pollution and access to clean water, sanitation and electricity.

Motivation

The motivation of this research ground in the importance of health in accelerating countries development.

 “Better health is central to human happiness and well-being. It also makes an important contribution to economic progress, as healthy populations live longer, are more productive, and save more.” (www.who.int/hdp/en/)

Life Expectancy at Birth is one of the most widely used summary measures of the population health status at system level.

Life Expectancy at Birth (LEB) is the average number of years a newborn infant would be expected to live if health and living conditions at the time of its birth remained the same throughout its life. It reflects the health of a country's people and the quality of care they receive when they are sick.

According to Wikipedia in the Bronze and Iron Age LEB was 26 years; the 2010 World LEB was 67.2, but Life Expectancy differs dramatically between countries, for example: in Swaziland LEB is about 49 years while in Japan is about 83 years.
Identify determinants of Life Expectancy using machine learning technique such as Classification and Regression Tree (CART) analysis can help focus interventions in country with low Life Expectancy.