Tmap and Assorted Work
Linear Relationship: GNP and LEXPF - country_demographics.csv
Correlation coefficient between GNP and LEXPF
# set working directory
setwd("C:/Users/data")
country<-read.csv("country_demographics.csv",sep=",",header=T)
country2<-country[complete.cases(country), ]
countryDF<-as.data.frame(country2)
LEXPF<-countryDF$LEXPF
GNP<-countryDF$GNP
str(LEXPF)
## num [1:91] 75.5 74.7 77.7 73.8 75.7 72.4 74 75.9 74.8 72.7 ...
str(GNP)
## int [1:91] 600 2250 2980 2780 1690 1640 2242 1880 1320 2370 ...
cor(GNP,LEXPF)
## [1] 0.6500464
Correlation coefficient, rather than covariance, is the first step in determining the linear relationship between gross national product and the life expectancy of females in the country demographics data set due to impact that variable scales have on covariance values. The correlation coefficient evidences the magnitude of the association between variables on a scale of -1 to 1. The correlation coefficient value is generated using the covariance of the variables and dividing that value by the value of each variables’ standard deviation multiplied by one another. The correlation coefficient for the country_demographics data set is 0.6500464. This value is positive and closer to 1 than it is to 0 suggesting a strong positive relationship. The regression model and t* test-statistic for significance will indicate more about the relationship.
Set hypothesis statement as:
H_0: p = 0
H_A: p != 0
lm<-lm(LEXPF~GNP)
lm
##
## Call:
## lm(formula = LEXPF ~ GNP)
##
## Coefficients:
## (Intercept) GNP
## 6.09e+01 8.94e-04
summary(lm)
##
## Call:
## lm(formula = LEXPF ~ GNP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8771 -5.8096 0.6023 6.5539 14.1377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.090e+01 1.095e+00 55.60 < 2e-16 ***
## GNP 8.940e-04 1.108e-04 8.07 3.12e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.506 on 89 degrees of freedom
## Multiple R-squared: 0.4226, Adjusted R-squared: 0.4161
## F-statistic: 65.13 on 1 and 89 DF, p-value: 3.118e-12
The p-value of 3.118e-12, or 0.000000000003118, is less than the significance level of 0.05. Therefore, the null hypothesis is rejected. Additionally, the coefficients tell us that for every one unit increase in gross national product, there is an 8.940e-04 (or 0.0008940) increase in female life expectancy. Interpretation of these are results using y=mx+b formula is y=0.0008940(x)+60.90. This tells us that with a GNP of $0 the female life expectancy is 60.90 years (variable b in the equation). For each additional unit of GNP, female life expectancy increases by 0.0008940 years (variable m in the equation, also known as the slope).
Plotting the Relationship: GNP and LEXPF
plot(GNP,LEXPF)
abline(lm(LEXPF~GNP))
The plot is positively skewed. Let’s look at the plot for this relationship and the accompanying regression line while using log(). Log() will reduce the skew produced by the GNP variable by pulling in outlier values that cause the skew seen in the plot above.
Plotting the Relationship: GNP and LEXPF with log()
logGNP<-log(GNP)
plot(logGNP,LEXPF)
abline(lm(LEXPF~logGNP))
The log() function generated a distribution that is closer to a normal distribution and appears less impacted by skew than the previous plot. From the new plot we can see that the value of b in y=mx+b has decreased significantly, suggestion that the analysis done prior to the log() function was impacted by the skew present in the GNP data.
Recalculating the Correlation Coefficient using the Logged GNP
cor(logGNP,LEXPF)
## [1] 0.8233949
The correlation coefficient value of the relationship, using the logged GNP data, is 0.08233949. This value suggests that the data is more strongly positively correlated than the previous calculation indicated.
Working with Shapefiles
Data Sources: U.S. Census 5-Digit ZIP Code Tabulation Area (ZCTA 5) 2000 - 2019 (Social Explorer). / Colleges and Universities (HIFLD Open Data).
Load Libraries and Shapefiles
library(tmap)
library(tmaptools)
library(raster)
library(sf)
setwd("C:/Users/data")
counties<-st_read("tl_2022_us_county.shp")
## Reading layer `tl_2022_us_county' from data source
## `C:\Users\data\tl_2022_us_county.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 3235 features and 17 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: NAD83
colleges<-st_read("CollegesUniversities.shp")
## Reading layer `CollegesUniversities' from data source
## `C:\Users\data\CollegesUniversities.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 6559 features and 43 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -19007040 ymin: -1611319 xmax: 19077760 ymax: 11514220
## Projected CRS: WGS 84 / Pseudo-Mercator
Filter Data Sets for California
The CA Counties shapefile mapped by land area in square meters allows the viewer to visualize counties with the largest area. This visualization shows that California counties with the largest area tend to be clustered in the inland desert portion of the state with two large area counties residing in the northern region of the state.
countiesCA <- counties[counties$STATEFP == '06', ]
collegesCA <- colleges[colleges$STATE == 'CA', ]
Map CA Counties using ALAND Variable
tmap_options(check.and.fix = TRUE)
tm_shape(countiesCA)+tm_fill("ALAND", title="Sq. M Area", style = "quantile", n=8, palette = "Blues")+tm_borders()+tm_compass(position = c("LEFT", "BOTTOM"))+tm_layout(main.title = "CA Counties & Area", legend.position = c("RIGHT", "TOP"), title.position = c("center", "top"), inner.margins = .01)+tm_scale_bar(position = c("LEFT", "BOTTOM"))
Map Colleges and CA Counties using ALAND Variable
tm_shape(countiesCA)+tm_fill("ALAND", title="Sq. M Area", style = "quantile", n=8, palette = "Blues")+tm_borders()+tm_compass(position = c("LEFT", "BOTTOM"))+tm_layout(main.title = "CA Counties & Colleges", main.title.size = 1, legend.outside = TRUE, legend.outside.position = "bottom", main.title.position = "LEFT")+tm_scale_bar(position = c("LEFT", "BOTTOM")) + tm_shape(collegesCA) + tm_dots(col = "POPULATION", palette = "Greens", style = "quantile", title = "College Population Size", size = .05)
Point Pattern Analysis
# mean center
xy<-cbind(collegesCA$LONGITUDE, collegesCA$LATITUDE)
mc<-apply(xy, 2, mean)
mcCA<-st_sfc(st_point(c(mc[1],mc[2])))
mcCA
## Geometry set for 1 feature
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -119.2307 ymin: 35.28371 xmax: -119.2307 ymax: 35.28371
## CRS: NA
## POINT (-119.2307 35.28371)
tm_shape(countiesCA)+tm_fill("ALAND", title="Sq. M Area", style = "quantile", n=8, palette = "Blues")+tm_borders()+tm_compass(position = c("LEFT", "BOTTOM"))+tm_layout(main.title = "CA Counties & Colleges", main.title.size = 1, legend.outside = TRUE, legend.outside.position = "bottom", main.title.position = "LEFT")+tm_scale_bar(position = c("LEFT", "BOTTOM")) + tm_shape(collegesCA) + tm_dots(col = "POPULATION", palette = "Greens", style = "quantile", title = "College Population Size", size = .05) + tm_shape(mcCA) + tm_dots(col='red',size=.35)
# mean circle
sd<-sqrt(sum((xy[,1] - mc[1])^2 + (xy[,2] - mc[2])^2) / nrow(xy))
bearing<-1:360 * pi/180
cx<-mc[1]+sd*cos(bearing)
cy<-mc[2]+sd*sin(bearing)
circle<-cbind(cx, cy)
mean_circle<-st_sfc(st_multipoint(circle))
tm_shape(countiesCA)+tm_fill("ALAND", title="Sq. M Area", style = "quantile", n=8, palette = "Blues")+tm_borders()+tm_compass(position = c("LEFT", "BOTTOM"))+tm_layout(main.title = "CA Counties & Colleges", main.title.size = 1, legend.outside = TRUE, legend.outside.position = "bottom", main.title.position = "LEFT")+tm_scale_bar(position = c("LEFT", "BOTTOM")) + tm_shape(collegesCA) + tm_dots(col = "POPULATION", palette = "Greens", style = "quantile", title = "College Population Size", size = .05) + tm_shape(mcCA) + tm_dots(col='red',size=.35) + tm_shape(mean_circle) + tm_dots(col='red',shape=21)
The spatial mean center is calculated by adding all x values together and dividing that value by the total number of observations in the calculation; the same calculation is done with the y values. The result is an average, or mean, coordinate that represents the center of the data. The mean center of the CA Counties and Colleges data is -119.2307, 35.28371 and tells viewers that the average of the coordinate positions resides in Central California, just above the South Coast region.The location suggests that a large number of the California Colleges reside in the Southern coastal region of the state with the next largest density of colleges occurring in the Bay Area region. These high densities result in the mean center residing between the Bay Area and Southern Coast regions of California while remaining closer to the southern portion due to the higher density of colleges in that region.
State Population and Centroids Data Sets
Data Sources: U.S. States and Territories (National Weather Service & NOAA).
Load the Data Sets
setwd("C:/Users/data")
pop<-read.csv("us_state_popln_history.csv",sep=",",header=T)
centroids<-read.csv("us_state_centroids.csv",sep=",",header=T)
states<-st_read("tl_2017_us_state.shp")
## Reading layer `tl_2017_us_state' from data source
## `C:\Users\data\tl_2017_us_state.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## Geodetic CRS: NAD83
Filter Data Sets
pop2 <- pop[which(pop$state !="Alaska" & pop$state !="Hawaii"),]
states2 <- states[which(states$NAME !="Alaska" & states$NAME !="Hawaii" & states$NAME !="Commonwealth of the Northern Mariana Islands" & states$NAME !="American Samoa" & states$NAME !="Puerto Rico" & states$NAME !="United States Virgin Islands" & states$NAME !="Guam"), ]
Dealing with NA Values
In order to deal with the two NA values present in the Oklahoma data for 1870 and 1880, we are going to set NA values to 0 for the calculation of the weighted mean center as missing values will produce an error.
Merging Population Data and the State Shapefile
stpop <- merge(states2, pop2, by.x="NAME", by.y="state")
## NAME REGION DIVISION STATEFP
## Length:49 Length:49 Length:49 Length:49
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## STATENS GEOID STUSPS LSAD
## Length:49 Length:49 Length:49 Length:49
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## MTFCC FUNCSTAT ALAND AWATER
## Length:49 Length:49 Min. :1.584e+08 Min. :1.868e+07
## Class :character Class :character 1st Qu.:9.279e+10 1st Qu.:1.540e+09
## Mode :character Mode :character Median :1.389e+11 Median :3.543e+09
## Mean :1.562e+11 Mean :8.744e+09
## 3rd Qu.:2.062e+11 3rd Qu.:8.532e+09
## Max. :6.766e+11 Max. :1.040e+11
## INTPTLAT INTPTLON pop1870 pop1880
## Length:49 Length:49 Min. : 0 Min. : 0
## Class :character Class :character 1st Qu.: 91874 1st Qu.: 174768
## Mode :character Mode :character Median : 537454 Median : 802525
## Mean : 786915 Mean :1023587
## 3rd Qu.:1184059 3rd Qu.:1542180
## Max. :4382759 Max. :5082871
## pop1890 pop1900 pop1910 pop1930
## Min. : 47355 Min. : 42335 Min. : 81875 Min. : 91058
## 1st Qu.: 332422 1st Qu.: 411588 1st Qu.: 577056 1st Qu.: 687497
## Median :1118588 Median :1311564 Median :1574449 Median : 1854482
## Mean :1284647 Mean :1550910 Mean :1876985 Mean : 2505613
## 3rd Qu.:1767518 3rd Qu.:2069042 3rd Qu.:2333860 3rd Qu.: 2939006
## Max. :6003174 Max. :7268894 Max. :9113614 Max. :12588066
## pop1940 pop1950 pop1960 pop1970
## Min. : 110247 Min. : 160083 Min. : 285278 Min. : 332416
## 1st Qu.: 663091 1st Qu.: 791896 1st Qu.: 951023 1st Qu.: 1016000
## Median : 1901974 Median : 2233351 Median : 2535234 Median : 2824376
## Mean : 2687128 Mean : 3075456 Mean : 3642127 Mean : 4125367
## 3rd Qu.: 3137587 3rd Qu.: 3444578 3rd Qu.: 4319813 3rd Qu.: 4676501
## Max. :13479142 Max. :14830192 Max. :16782304 Max. :19953134
## pop1980 pop1990 pop2000 pop2010
## Min. : 469557 Min. : 453588 Min. : 493782 Min. : 563626
## 1st Qu.: 1302894 1st Qu.: 1515069 1st Qu.: 1808344 1st Qu.: 1852994
## Median : 3107576 Median : 3486703 Median : 4041769 Median : 4533372
## Mean : 4595495 Mean : 5041869 Mean : 5705784 Mean : 6258674
## 3rd Qu.: 5463105 3rd Qu.: 6016425 3rd Qu.: 6349097 3rd Qu.: 6724540
## Max. :23667902 Max. :29760021 Max. :33871648 Max. :37253956
## pop2020 lon lat geometry
## Min. : 576851 Min. :-120.56 Min. :28.63 MULTIPOLYGON :49
## 1st Qu.: 1961504 1st Qu.:-100.23 1st Qu.:35.86 epsg:4269 : 0
## Median : 4657757 Median : -89.20 Median :39.33 +proj=long...: 0
## Mean : 6719604 Mean : -90.99 Mean :39.50
## 3rd Qu.: 7705281 3rd Qu.: -78.85 3rd Qu.:43.00
## Max. :39538223 Max. : -69.24 Max. :47.45
Weighted Mean Center
x<-stpop2$lon
y<-stpop2$lat
wmx_1870<-weighted.mean(x,stpop2$pop1870)
wmy_1870<-weighted.mean(y,stpop2$pop1870)
wmx_1880<-weighted.mean(x,stpop2$pop1880)
wmy_1880<-weighted.mean(y,stpop2$pop1880)
wmx_1890<-weighted.mean(x,stpop2$pop1890)
wmy_1890<-weighted.mean(y,stpop2$pop1890)
wmx_1900<-weighted.mean(x,stpop2$pop1900)
wmy_1900<-weighted.mean(y,stpop2$pop1900)
wmx_1910<-weighted.mean(x,stpop2$pop1910)
wmy_1910<-weighted.mean(y,stpop2$pop1910)
wmx_1930<-weighted.mean(x,stpop2$pop1930)
wmy_1930<-weighted.mean(y,stpop2$pop1930)
wmx_1940<-weighted.mean(x,stpop2$pop1940)
wmy_1940<-weighted.mean(y,stpop2$pop1940)
wmx_1950<-weighted.mean(x,stpop2$pop1950)
wmy_1950<-weighted.mean(y,stpop2$pop1950)
wmx_1960<-weighted.mean(x,stpop2$pop1960)
wmy_1960<-weighted.mean(y,stpop2$pop1960)
wmx_1970<-weighted.mean(x,stpop2$pop1970)
wmy_1970<-weighted.mean(y,stpop2$pop1970)
wmx_1980<-weighted.mean(x,stpop2$pop1980)
wmy_1980<-weighted.mean(y,stpop2$pop1980)
wmx_1990<-weighted.mean(x,stpop2$pop1990)
wmy_1990<-weighted.mean(y,stpop2$pop1990)
wmx_2000<-weighted.mean(x,stpop2$pop2000)
wmy_2000<-weighted.mean(y,stpop2$pop2000)
wmx_2010<-weighted.mean(x,stpop2$pop2010)
wmy_2010<-weighted.mean(y,stpop2$pop2010)
wmx_2020<-weighted.mean(x,stpop2$pop2020)
wmy_2020<-weighted.mean(y,stpop2$pop2020)
wm_x <- c(wmx_1870, wmx_1880, wmx_1890, wmx_1900, wmx_1910, wmx_1930,wmx_1940, wmx_1950, wmx_1960, wmx_1970, wmx_1980, wmx_1990, wmx_2000, wmx_2010, wmx_2020)
wm_y<-c(wmx_1870, wmx_1880, wmx_1890, wmx_1900, wmx_1910, wmx_1930,wmx_1940, wmx_1950, wmx_1960, wmx_1970, wmx_1980, wmx_1990, wmx_2000, wmx_2010, wmx_2020)
wm_xy<-cbind(wm_x,wm_y)
df_wmxy<-data.frame(wm_xy)
uspoints<-st_as_sf(x=df_wmxy,coords=c("wm_x","wm_y"),crs=4326)
# uspoints2<- st_transform(uspoints, crs = "+proj=longlat +datum=NAD83") # tried this but it did not help
years<-c("1870", "1880", "1890", "1900", "1910", "1930", "1940", "1950", "1960", "1970", "1980", "1990", "2000", "2010", "2020")
uspointsYears<-cbind(uspoints,years)
Mapping Weighted Means
tm_shape(stpop2)+tm_polygons()+
tm_shape(uspointsYears)+tm_dots()
stpop2 <- merge(stpop, centroids, by.x="NAME", by.y="state")
stpop2[is.na(stpop2)] = 0
summary(stpop2)