Subscribe to our Newsletter

### Analyse TB data using network analysis

Guest blog post by Tim Groot

Analyse TB data using network analysis

Introduction

In a very interesting publication from Jose A. Dianes on tuberculosis (TB) cases per country it was shown that dimension reduction is achieved using Principal Component Analysis (PCA) and Cluster Analysis (http://www.datasciencecentral.com/profiles/blogs/data-science-with-python-r-dimensionality-reduction-and). By showing that the first principal component corresponded mainly to the mean value of TB cases and the second mainly to the change over the used time span, it become clear that the first two PCA-components have a real physical meaning. This is not necessarily the case for PCA constructs an orthogonal basis, by making linear combinations of the original measurements, of which the eigen vectors are orederd in a decending order. Though, this method may not work with data having different types of variables. The scripts in this article are written in R.

Method

Finding correlations in the time trend is a better way to monitor the correspondence between countries. Correlation shows similarities in the trend between countries and is sensitive to deviations from the main trend. Grouping countries based on similarities can give insight in the mechanism behind the trend and opens a way to find effective measures for the illness. Or a hidden measure may have a good causal relation but was not identified yet.

The necessary libraries to use are:

library(RCurl) # reading data

library(igraph) # network plot

Results

Loading required data from datasciencecentral.com and process existing cases file analogous to Jose A. Dianes.

existing_cases_file <-

existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)

existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-

lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],

function(x) { as.integer(gsub(',', '', x) )})

countries <- rownames(existing_df)

meantb <- rowMeans(existing_df)

Create the link-table from the correlation matrix, filtered for the duplicates and the 1’s on the diagonal. The lower triangle function was used here.

cortb <- cor(t(existing_df))

cortb <- cortb*lower.tri(cortb)

links <- data.frame(NULL, ncol(3))

for(i in 1:length(countries)){

meantb[i]))

}

A network graph of this link-table will result in one uniform group because each country is still liked to all others.

g <- graph.data.frame(links, directed=FALSE)

plot(g)

The trend is formed from a period of only 18 years. Correlation may therefore not be a strong function to separate the trends of the countries. For a longer span of years correlation will perform better as separator. The trends in this data are generally the same, they are decreasing. Therefore a high limit for the level of correlation is used (0.90).

The link-table is filtered for correlations larger than 0.9 and create a network graph.

set.seed(5)

g <- graph.data.frame(links, directed=FALSE)

plot(g)

fgc <- cluster_fast_greedy(g)

length(fgc)

## [1] 5

The countries now appear to split-up into 5 groups, three large clusters and two small ones.

Discussion

By plotting time-trends of the groups, a grouping in the trends is visible.

trendtb <- as.data.frame(t(existing_df))

for(group in 1:length(fgc)){

sel <- trendtb[,as.character(unlist(fgc[group]))]

plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(sel), main= paste("Group", group),

xlab='Year', ylab = 'TB cases per 100K distribution')

for(i in names(sel)){

points(1990:2007,sel[,i],type='o', cex = .1, lty = "solid",

col = which(i == names(sel)))

}

if(group %in% c(4,5)) legend('topright', legend = names(sel), lwd=2,

col=1:length(names(sel)))

}

In group 4 and 5 pretty particular trends are selected. Group 4 consist of countries with a maximum amount of TB-cases in the period 1996 to 2003 and the two countries in group 5 show a dramatic drop in TB-cases at 1996 which is followed by a large increase. This latter trend should be explained by meta data about the dataset.

fgc[4:5]

## \$`4`

## [1] "Lithuania" "Zambia"    "Belarus"   "Bulgaria"  "Estonia"

##

## \$`5`

## [1] "Namibia"  "Djibouti"

Group 4 consists of former USSR-countries, though, Zambia is an exception. This trend could be explained by social problems during the collapse of the USSR and for Zambia this trend should be explained by political changes too.

The range in TB-cases the first three graphs is too large to see the similarities within the groups. Dividing them with the mean gives a better view on the trend.

for(group in 1:3){

sel <- trendtb[,as.character(unlist(fgc[group]))]

selavg <- meantb[as.character(unlist(fgc[group]))]

plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(t(sel)/selavg),

main= paste("Group", group), xlab='Year',

ylab = 'TB cases per 100K distribution')

for(i in names(sel)){

points(1990:2007,sel[,i]/selavg[i],type='o', cex = .1, lty = "solid",

col= which(i == names(sel)))

}

}

Now the difference between group 1 and 3 better visible, group 1 groups countries tending towards sigmoid-trends while group 3 consist of countries with a more steady decay in TB-cases. Countries in group 2 show an increasing sigmoid-like trend.

In the groups western- and development-countries are mixed. For western countries the amount of TB-cases are low and one TB-case more or less may flip the trend, again a better separation will be found for a larger span in time.

print(fgc[1:3])

## \$`1`

##  [1] "Albania"                          "Argentina"

##  [3] "Bahrain"                          "Bangladesh"

##  [5] "Bosnia and Herzegovina"           "China"

##  [7] "Costa Rica"                       "Croatia"

##  [9] "Czech Republic"                   "Korea, Dem. Rep."

## [11] "Egypt"                            "Finland"

## [13] "Germany"                          "Guinea-Bissau"

## [15] "Honduras"                         "Hungary"

## [17] "India"                            "Indonesia"

## [19] "Iran"                             "Japan"

## [21] "Kiribati"                         "Kuwait"

## [23] "Lebanon"                          "Libyan Arab Jamahiriya"

## [25] "Myanmar"                          "Nepal"

## [27] "Pakistan"                         "Panama"

## [29] "Papua New Guinea"                 "Philippines"

## [31] "Poland"                           "Portugal"

## [33] "Puerto Rico"                      "Singapore"

## [35] "Slovakia"                         "Syrian Arab Republic"

## [37] "Thailand"                         "Macedonia, FYR"

## [39] "Turkey"                           "Tuvalu"

## [41] "United States of America"         "Vanuatu"

## [43] "West Bank and Gaza"               "Yemen"

## [45] "Micronesia, Fed. Sts."            "Saint Vincent and the Grenadines"

## [47] "Viet Nam"                         "Eritrea"

## [49] "Jordan"                           "Tunisia"

## [51] "Monaco"                           "Niger"

## [53] "New Caledonia"                    "Guam"

## [55] "Timor-Leste"                      "Iraq"

## [57] "Mauritius"                        "Afghanistan"

## [59] "Australia"                        "Cape Verde"

## [61] "French Polynesia"                 "Malaysia"

##

## \$`2`

##  [1] "Botswana"                 "Burundi"

##  [3] "Cote d'Ivoire"            "Ethiopia"

##  [5] "Guinea"                   "Rwanda"

##  [7] "Senegal"                  "Sierra Leone"

##  [9] "Suriname"                 "Swaziland"

## [11] "Tajikistan"               "Zimbabwe"

## [13] "Azerbaijan"               "Georgia"

## [15] "Kenya"                    "Kyrgyzstan"

## [17] "Russian Federation"       "Ukraine"

## [19] "Tanzania"                 "Moldova"

## [21] "Burkina Faso"             "Congo, Dem. Rep."

## [23] "Guyana"                   "Nigeria"

## [25] "Chad"                     "Equatorial Guinea"

## [27] "Mozambique"               "Uzbekistan"

## [29] "Kazakhstan"               "Algeria"

## [31] "Armenia"                  "Central African Republic"

##

## \$`3`

##  [1] "Barbados"                 "Belgium"

##  [3] "Bermuda"                  "Bhutan"

##  [5] "Bolivia"                  "Brazil"

##  [7] "Cambodia"                 "Cayman Islands"

##  [9] "Chile"                    "Colombia"

## [11] "Comoros"                  "Cuba"

## [13] "Dominican Republic"       "Ecuador"

## [15] "El Salvador"              "Fiji"

## [17] "France"                   "Greece"

## [19] "Haiti"                    "Israel"

## [21] "Laos"                     "Luxembourg"

## [23] "Malta"                    "Mexico"

## [25] "Morocco"                  "Netherlands"

## [27] "Netherlands Antilles"     "Nicaragua"

## [29] "Norway"                   "Peru"

## [31] "San Marino"               "Sao Tome and Principe"

## [33] "Slovenia"                 "Solomon Islands"

## [35] "Somalia"                  "Spain"

## [37] "Switzerland"              "United Arab Emirates"

## [39] "Uruguay"                  "Antigua and Barbuda"

## [41] "Austria"                  "British Virgin Islands"

## [43] "Canada"                   "Cyprus"

## [45] "Denmark"                  "Ghana"

## [47] "Guatemala"                "Ireland"

## [49] "Italy"                    "Jamaica"

## [51] "Mongolia"                 "Oman"

## [53] "Saint Lucia"              "Seychelles"

## [55] "Turks and Caicos Islands" "Virgin Islands (U.S.)"

## [57] "Venezuela"                "Maldives"

## [59] "Trinidad and Tobago"      "Korea, Rep."

## [61] "Andorra"                  "Anguilla"

## [63] "Belize"                   "Mali"