Sadly, there are some errors in the data that the measuring stations submit to SINAICA. Internally, the stations and SINAICA are supposed to flag data that is mistaken, but this is not always the case; in addition, not all stations send their data to SINAICA in a reliable way.

First, let's load the packages necessary for the analysis.

## Auto-install required R packages
packs <- c("dplyr", "ggplot2", "lubridate", "rsinaica")
success <- suppressWarnings(sapply(packs, require, character.only = TRUE))
if (length(names(success)[!success])) {
install.packages(names(success)[!success])
sapply(names(success)[!success], require, character.only = TRUE)
}

## Incomplete data

The data is incomplete in the sense that some networks don't report the values of all the pollutants they measure in a timely manner. For example, if you visit the Mexicali air quality website you'll see that they have a section reporting recent PM10 pollution values, but when you query SINAICA for the dates supported by all the Mexicali stations, none support downloading recent PM10 data.

mexicali <- stations_sinaica[stations_sinaica$network_name %in% "Mexicali", 1:5] ## All PM10 station in Mexicali are manual ll <- lapply(mexicali$station_id, function(x) {sinaica_station_dates(x, "Manual")})
names(ll) <- mexicali$station_id ll ##$369
## [1] "1997-01-10" "2010-08-30"
##
## $37 ## [1] NA NA ## ##$38
## [1] NA NA
##
## $371 ## [1] "1997-02-03" "2014-03-14" ## ##$403
## [1] "1997-02-15" "2001-12-27"
##
## $39 ## [1] "1997-05-28" "2015-12-26" ## ##$40
## [1] "2015-01-06" "2015-12-26"
##
## $370 ## [1] "1997-03-23" "2014-08-27" ## ##$402
## [1] "2004-04-15" "2011-03-10"
##
## $41 ## [1] "2011-06-20" "2015-12-26" ## ##$366
## [1] NA NA
##
## $42 ## [1] NA NA ## ##$372
## [1] NA NA

Furthermore SINAICA only has manually collected data available, while the data at the Mexicali air quality website seems to be automatically collected, since it's available hourly. Guess they just haven't gotten around to connecting this new air quality information source to SINAICA.

## Errors in validation

The data reported to SINAICA is supposed to be checked for extreme values and errors, but this is quite often not possible. In addition, when you query data from the SINAICA website it automatically removes all O3 values above 0.2, PM10 above 600, PM2.5 above 175, NO2 above 0.21, SO2 above 0.2, and for CO above 15. But we know that, particularly in Mexico City and Guadalajara, there have been days where the O3 values have been above 0.2 ppm. The functions in this package have a remove_extremes option to mimic this behavior, but by default it is not enabled.

The station CBTIS in Aguascalientes reported pure garbage for ozone data in 2017, even with the option remove_extremes set to TRUE, as can be seen in the chart below. A value higher than 0.155 ppm would be a phase I contingency in Mexico City, and given that ozone production depends on chemical reactions between oxides of nitrogen and volatile organic compounds in the presence of sunlight, it's extremely unlikely to be present in large quantities at night or in the early morning

## Download data for all Aguascalientes stations for a single month
get_month <- function(start_date, end_date, net){
bind_rows(
lapply(stations_sinaica$station_id[stations_sinaica$network_name %in% net],
sinaica_station_data, "O3", start_date, end_date, "Crude",
remove_extremes = TRUE)
)
}
df <- bind_rows(
mapply(get_month,
seq(as.Date("2017-01-01"), as.Date("2017-12-01"), by = "month"),
seq(as.Date("2017-02-01"), as.Date("2018-01-01"), by = "month") - 1,
"Aguascalientes", SIMPLIFY = FALSE)
)

df$datetime <- with_tz(as.POSIXct(paste0(df$date, " ", df\$hour, ":00"),
tz = "Etc/GMT+6"),
tz = "America/Mexico_City")
ggplot(df, aes(datetime, value, group = station_name, color = station_name)) +
geom_line(alpha = .8, size = .3) +
ggtitle("Ozone data reported by the Aguascalientes stations") +
xlab("date") +
ylab("hourly ozone concentration in ppm") +
theme_bw()