| Title: | Tools for Statistical Inference with Geo-Coded Data |
|---|---|
| Description: | Fast computation of Conley (1999) <doi:10.1016/S0304-4076(98)00084-0> spatial heteroskedasticity and autocorrelation consistent (HAC) standard errors for linear regression models with geo-coded data, with a fast C++ implementation by Christensen, Hartman, and Samii (2021) <doi:10.1017/S0020818321000187>. Performance-critical distance calculations, kernel weighting, and variance component accumulation are implemented in C++ via 'Rcpp' and 'RcppArmadillo'. Includes tools for estimating the spatial correlation range from covariograms and correlograms following the bandwidth selection method proposed in Lehner (2026) <doi:10.48550/arXiv.2603.03997>, and diagnostic visualizations for bandwidth selection. |
| Authors: | Alexander Lehner [aut, cre] (ORCID: <https://orcid.org/0000-0001-5885-5966>) |
| Maintainer: | Alexander Lehner <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.0 |
| Built: | 2026-05-25 07:32:19 UTC |
| Source: | https://github.com/axlehner/spatialinference |
Spatial Variance Component for Balanced Panel
Bal_XeeXhC(dmat, X, e, n1, k)Bal_XeeXhC(dmat, X, e, n1, k)
dmat |
pre-computed distance matrix (n x n) |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
A k x k matrix representing the spatial X'ee'X component.
Convenience wrapper around conley_SE() that returns only the spatial
Conley standard error for the first regressor. Useful for quick extraction
of Conley SEs in scripting contexts.
compute_conley_lfe(lfeobj, cutoff, kernel_choice = "bartlett", ...)compute_conley_lfe(lfeobj, cutoff, kernel_choice = "bartlett", ...)
lfeobj |
A regression object of class |
cutoff |
Numeric. The spatial bandwidth (cutoff distance in km). |
kernel_choice |
Character string specifying the kernel function.
Default is |
... |
Additional arguments passed to |
A single numeric value: the spatial Conley standard error for the first (or only) regressor.
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE)) { reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon, data = US_counties_centroids, keepCX = TRUE) compute_conley_lfe(reg, cutoff = 500) }data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE)) { reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon, data = US_counties_centroids, keepCX = TRUE) compute_conley_lfe(reg, cutoff = 500) }
Computes Conley (1999) spatial HAC (Heteroskedasticity and Autocorrelation
Consistent) variance-covariance matrices for regression models estimated
with lfe::felm(). Supports cross-sectional spatial correlation,
serial (temporal) correlation, and the combined spatial HAC estimator.
Multiple kernel functions and distance metrics are available.
conley_SE( reg, unit, time, lat, lon, kernel = "bartlett", dist_fn = "Haversine", dist_cutoff = 500, lag_cutoff = 5, lat_scale = 111, verbose = FALSE, cores = 1, balanced_pnl = FALSE )conley_SE( reg, unit, time, lat, lon, kernel = "bartlett", dist_fn = "Haversine", dist_cutoff = 500, lag_cutoff = 5, lat_scale = 111, verbose = FALSE, cores = 1, balanced_pnl = FALSE )
reg |
A regression object of class |
unit |
Character string naming the cross-sectional unit identifier
variable in the |
time |
Character string naming the time period identifier variable
in the |
lat |
Character string naming the latitude variable in the |
lon |
Character string naming the longitude variable in the |
kernel |
Character string specifying the kernel function for spatial
weighting. One of |
dist_fn |
Character string specifying the distance function.
|
dist_cutoff |
Numeric. The spatial bandwidth (cutoff distance in km)
beyond which observations receive zero weight. Default is |
lag_cutoff |
Numeric. The temporal bandwidth (number of time periods)
for serial correlation correction. Default is |
lat_scale |
Numeric. Scaling factor for latitude (km per degree).
Default is |
verbose |
Logical. If |
cores |
Integer. Number of CPU cores for parallel computation via
|
balanced_pnl |
Logical. If |
A named list of three variance-covariance matrices, each of dimension k x k where k is the number of regressors:
The standard OLS variance-covariance matrix from the felm object.
The spatial-only Conley VCV, correcting for cross-sectional spatial correlation.
The full spatial HAC VCV, correcting for both spatial and serial correlation.
Christensen, D., Hartman, A. C. and Samii, C. (2021). Legibility and external investment: An institutional natural experiment in Liberia. International Organization, 75(4), 1087–1108. doi:10.1017/S0020818321000187
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703–708. doi:10.2307/1913610
data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE)) { reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon, data = US_counties_centroids, keepCX = TRUE) vcvs <- conley_SE(reg, unit = "unit", time = "year", lat = "lat", lon = "lon", kernel = "bartlett", dist_cutoff = 500) # Spatial standard errors: sqrt(diag(vcvs$Spatial)) }data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE)) { reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon, data = US_counties_centroids, keepCX = TRUE) vcvs <- conley_SE(reg, unit = "unit", time = "year", lat = "lat", lon = "lon", kernel = "bartlett", dist_cutoff = 500) # Spatial standard errors: sqrt(diag(vcvs$Spatial)) }
Extracts point coordinates from an sf or sfc object and returns
them as a tibble. This is a lightweight alternative to
sf::st_coordinates() that returns a tibble directly.
coords_as_columns(x, names = c("x", "y"))coords_as_columns(x, names = c("x", "y"))
x |
An |
names |
Character vector of length 2 specifying the column names
for the x and y coordinates. Default is |
A tibble::tibble() with two columns named according to the
names argument, containing the x and y coordinates.
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. doi:10.32614/RJ-2018-009
data(US_counties_centroids) coords <- coords_as_columns(US_counties_centroids) head(coords)data(US_counties_centroids) coords <- coords_as_columns(US_counties_centroids) head(coords)
Estimates a covariogram from an sf data frame (either from a single
variable or regression residuals) using gstat::variogram() with
covariogram = TRUE, extracts the zero-crossing as the estimated
correlation range via extract_corr_range(), and produces a diagnostic
plot.
covgm_range( df.input, depvar = "noise1", indepvar = "noise2", maxdist = NA, spacing = NA, single.variable = FALSE )covgm_range( df.input, depvar = "noise1", indepvar = "noise2", maxdist = NA, spacing = NA, single.variable = FALSE )
df.input |
An |
depvar |
Character string naming the dependent variable. Default
is |
indepvar |
Character string naming the independent variable (used
when |
maxdist |
Numeric. Maximum distance for the covariogram (in metres).
Default is |
spacing |
Numeric. Bin width for the covariogram (in metres). Default
is |
single.variable |
Logical. If |
A ggplot2::ggplot() object showing the covariogram with the
estimated correlation range marked by a vertical red line and annotated
text.
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7), 683–691. doi:10.1016/j.cageo.2004.03.012
data(US_counties_centroids) if (requireNamespace("fixest", quietly = TRUE) && requireNamespace("gstat", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) { covgm_range(US_counties_centroids) }data(US_counties_centroids) if (requireNamespace("fixest", quietly = TRUE) && requireNamespace("gstat", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) { covgm_range(US_counties_centroids) }
Create Distance Matrix
DistMat(M, cutoff, kernel = "bartlett", dist_fn = "Haversine")DistMat(M, cutoff, kernel = "bartlett", dist_fn = "Haversine")
M |
a matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
A symmetric n x n matrix of kernel weights with 1s on the diagonal.
Identifies the distance at which spatial autocorrelation first crosses
zero, providing an estimate of the spatial correlation range. Works with
correlograms from ncf::correlog() and covariograms from
gstat::variogram() (with covariogram = TRUE).
extract_corr_range(input, returnzeroifNA = FALSE)extract_corr_range(input, returnzeroifNA = FALSE)
input |
A correlogram object from |
returnzeroifNA |
Logical. If |
For correlograms, the function detects the first sign change in the rounded and floored correlation values. For covariograms, it finds the first index where gamma transitions from positive to non-positive. In both cases, the estimated range is the midpoint between the last positive and first non-positive distance bins.
A numeric value representing the estimated correlation range. For covariograms, the unit is km (distance in metres divided by 1000). For correlograms, the unit matches the input distance unit.
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7), 683–691. doi:10.1016/j.cageo.2004.03.012
# With a mock gstatVariogram: mock_vgm <- data.frame( np = rep(100, 10), dist = seq(50000, 500000, by = 50000), gamma = c(5, 3, 2, 1, 0.5, -0.2, -0.5, -0.3, -0.1, -0.05), dir.hor = 0, dir.ver = 0, id = "var1" ) class(mock_vgm) <- c("gstatVariogram", "data.frame") extract_corr_range(mock_vgm)# With a mock gstatVariogram: mock_vgm <- data.frame( np = rep(100, 10), dist = seq(50000, 500000, by = 50000), gamma = c(5, 3, 2, 1, 0.5, -0.2, -0.5, -0.3, -0.1, -0.05), dir.hor = 0, dir.ver = 0, id = "var1" ) class(mock_vgm) <- c("gstatVariogram", "data.frame") extract_corr_range(mock_vgm)
Computes the mean centre of an sf data frame, optionally weighted by
an attribute variable. The function first extracts polygon or point
centroids using sf::st_centroid(), then calculates the (weighted)
arithmetic mean of the x and y coordinates. This is the two-dimensional
analogue of a weighted mean and corresponds to the "centre of gravity"
or "mean centre" in spatial statistics. A common use case is computing
the population-weighted centroid of a set of administrative units.
gravity_centroid(df.sf, weight = NA)gravity_centroid(df.sf, weight = NA)
df.sf |
An |
weight |
Numeric vector of weights with length equal to |
An sfc_POINT object (CRS 4326 / WGS84) representing the
(weighted) mean centre as a single point.
Arlinghaus, S. L. (1994). Practical Handbook of Spatial Statistics. CRC Press.
data(US_counties_centroids) # Unweighted mean centre (geographic centroid of all county centroids) gravity_centroid(US_counties_centroids) # Weighted mean centre (shifted toward areas with higher noise1 values) gravity_centroid(US_counties_centroids, weight = abs(US_counties_centroids$noise1) + 1)data(US_counties_centroids) # Unweighted mean centre (geographic centroid of all county centroids) gravity_centroid(US_counties_centroids) # Weighted mean centre (shifted toward areas with higher noise1 values) gravity_centroid(US_counties_centroids, weight = abs(US_counties_centroids$noise1) + 1)
Overlays a regular rectangular grid on an sf data frame, intersects each
observation with the grid, and returns a factor variable identifying the
grid cell to which each observation belongs. This is useful for constructing
spatial fixed effects in regression models: by including grid_FE() as a
factor variable, the regression absorbs location-specific variation at the
resolution of the chosen grid. Finer grids absorb more spatial variation
but consume more degrees of freedom.
grid_FE(df.sf, size, distance = FALSE)grid_FE(df.sf, size, distance = FALSE)
df.sf |
An |
size |
Numeric. When |
distance |
Logical. If |
The grid is constructed via sf::st_make_grid() and observations are
assigned to cells via sf::st_intersection(). Observations that fall
outside the grid (e.g., in coastal regions where cells do not cover the
full bounding box) are dropped during intersection.
A factor vector with one element per observation that survived the intersection (see Details), where each level is a grid cell ID.
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. doi:10.32614/RJ-2018-009
data(US_counties_centroids) # Create a 5 x 5 grid of spatial fixed effects grid_ids <- grid_FE(US_counties_centroids, size = 5) table(grid_ids) # Finer grid (10 x 10) grid_ids_fine <- grid_FE(US_counties_centroids, size = 10) length(levels(grid_ids_fine))data(US_counties_centroids) # Create a 5 x 5 grid of spatial fixed effects grid_ids <- grid_FE(US_counties_centroids, size = 5) table(grid_ids) # Finer grid (10 x 10) grid_ids_fine <- grid_FE(US_counties_centroids, size = 10) length(levels(grid_ids_fine))
Visualizes the relationship between the spatial bandwidth (cutoff distance) and the resulting Conley standard error. This typically reveals an inverse-U shaped relationship, helping to identify the appropriate bandwidth for spatial HAC estimation.
inverseu_plot_conleyrange( df.input, cutoffrange = NA, kernel_choice_conley = "epanechnikov", depvar = "noise1", indepvar = "noise2", range_add = FALSE, ... )inverseu_plot_conleyrange( df.input, cutoffrange = NA, kernel_choice_conley = "epanechnikov", depvar = "noise1", indepvar = "noise2", range_add = FALSE, ... )
df.input |
An |
cutoffrange |
Numeric vector of cutoff distances (in km) to evaluate. |
kernel_choice_conley |
Character string specifying the kernel function.
Default is |
depvar |
Character string naming the dependent variable column.
Default is |
indepvar |
Character string naming the independent variable column.
Default is |
range_add |
Logical. If |
... |
Additional arguments (currently unused). |
A ggplot2::ggplot() object showing Conley SE (y-axis) against
cutoff distance (x-axis), with the HC1 standard error as a grey
dashed horizontal reference line.
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE) && requireNamespace("fixest", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) { inverseu_plot_conleyrange(US_counties_centroids, cutoffrange = seq(100, 1000, by = 200)) }data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE) && requireNamespace("fixest", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) { inverseu_plot_conleyrange(US_counties_centroids, cutoffrange = seq(100, 1000, by = 200)) }
Estimates a linear regression via lfe::felm() and augments the output
with Moran's I tests for spatial autocorrelation, optional correlograms
for range estimation, and Conley spatial HAC standard errors. The returned
object has class "custom" prepended, enabling display via
modelsummary::modelsummary() with custom tidy and glance methods.
lm_sac( formula.chr, data.sf, knn_number = 20, conley_cutoff = 5, conley_kernel = "bartlett", correlograms = FALSE, ... )lm_sac( formula.chr, data.sf, knn_number = 20, conley_cutoff = 5, conley_kernel = "bartlett", correlograms = FALSE, ... )
formula.chr |
Character string specifying the regression formula in
|
data.sf |
An |
knn_number |
Integer. Number of nearest neighbours for the spatial
weights matrix used in Moran's I tests. Default is |
conley_cutoff |
Numeric. Spatial bandwidth (cutoff distance in km)
for the Conley standard error. Default is |
conley_kernel |
Character string specifying the kernel function.
Default is |
correlograms |
Logical. If |
... |
Additional arguments passed to |
An object of class c("custom", "lm") with additional components:
Character, the spatial fixed effect variable name.
Moran's I test statistic on the OLS residuals,
or NA if the test failed.
Moran's I test statistic on the response
variable, or NA if the test failed.
Estimated correlation range from the
residual correlogram (km), or NA if correlograms = FALSE.
Estimated correlation range from the
response correlogram (km), or NA if correlograms = FALSE.
Numeric vector of Conley spatial standard errors (with 0 for intercept and higher-order FE coefficients).
Conley SEs using the correlogram-based cutoff,
or NA if correlograms = FALSE.
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17–23. doi:10.2307/2332142
Bivand, R. S., Pebesma, E. and Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R. 2nd ed. Springer.
data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE) && requireNamespace("spdep", quietly = TRUE) && requireNamespace("stringr", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE) && requireNamespace("sandwich", quietly = TRUE)) { out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon", US_counties_centroids, conley_cutoff = 500) out$conley_SE }data(US_counties_centroids) if (requireNamespace("lfe", quietly = TRUE) && requireNamespace("spdep", quietly = TRUE) && requireNamespace("stringr", quietly = TRUE) && requireNamespace("dplyr", quietly = TRUE) && requireNamespace("sandwich", quietly = TRUE)) { out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon", US_counties_centroids, conley_cutoff = 500) out$conley_SE }
Temporal Variance Component (Time Dimension)
TimeDist(times, cutoff, X, e, n1, k)TimeDist(times, cutoff, X, e, n1, k)
times |
numeric vector of time period identifiers |
cutoff |
the lag cutoff (number of time periods) |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
A k x k matrix representing the temporal X'ee'X component.
An sf data frame containing the centroids of all 3,108 counties of the
contiguous United States (2017 geographies), along with synthetic
spatially-correlated noise variables for use in examples and vignettes.
US_counties_centroidsUS_counties_centroids
An sf data frame with 3,108 rows and the following columns:
Numeric state FIPS code.
County name.
County name with legal/statistical area description.
Unique ID for joining with IPUMS data (2017 geographies).
Latitude of the county centroid (WGS84).
Longitude of the county centroid (WGS84).
Cross-sectional unit identifier (constant 1 for
cross-sectional use).
Time period identifier (constant 1 for cross-sectional
use).
Synthetic spatially-correlated variable (noise 1).
Synthetic spatially-correlated variable (noise 2).
Distance variable (spatially non-stationary example).
Point geometry column (NAD83 / EPSG:4269).
IPUMS NHGIS, University of Minnesota, https://www.nhgis.org.
Spatial Variance Component (X'ee'X)
XeeXhC(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")XeeXhC(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")
M |
matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
A k x k matrix representing the spatial X'ee'X component.
Memory-efficient variant that avoids constructing the full n x n distance matrix.
XeeXhC_Lg(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")XeeXhC_Lg(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")
M |
matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
A k x k matrix representing the spatial X'ee'X component.