Data Critique

What information is included in our dataset?

Our dataset contains counts of patients by demographics and diagnoses from various hospital facilities across the state of California. The data was then combined with annual air quality data of California counties, gathered from the United States Environmental Protection Agency pre-generated data files. The dataset contains summary counts of patients at each hospital, based on age groups, race, insurance type, and their chief complaint. The dataset also includes the address of the included hospitals. Variables pertaining to air quality measures include annual median air quality indexes of each county, their average PM 2.5 levels, ozone levels and NO2 levels.

What can our dataset reveal?

The key element of our dataset is the correlation between patient demographics, chief complaint and air pollution. The sequential aspect of our dataset allows us to spot long-term trends between these 3 factors that are much more apparent when compiled together. Moreover, the dataset is population-based, adding stability to our measures and the relationships between them. The holistic nature of our dataset allows us to approach the question of air quality and health equity from a humanistic perspective, providing insight into the disproportionate impacts of air pollution on California counties.

Although our dataset conveys a relationship between patient hospital visit data and air quality, there are gaps in the foundational information.

Unfortunately, there is no clear explanation for the AQI levels based off of environmental or societal causation. It is likely that there are confounding variables which impact air quality that are ignored in the dataset that should be acknowledged. This dataset also cannot reveal a causation for respiratory or cardiovascular issues. It is unclear whether these illnesses are directly caused by air pollution or a reason unrelated to AQI. Additionally, values to represent the proportion of each demographic with its corresponding number of patients for respiratory intakes would benefit the overall narrative of the project. Our current project only includes the number of patients per hospital measured by demographic and the number of respiratory patients. Overall, our dataset lacks the direct “causation” of columns that could more accurately the narrative of our project.

How was the data generated?

The data for the hospital intake information was generated from two databases: the ED Treat-and-Release Database and the Inpatient Database from the Department of Health Care Access and Information of California. It was generated from 2005 to 2023 and comes from all hospitals across California, who gave their data to these sources. On the other hand, the air quality data was generated via monitors across California that record various air quality values, including PM2.5, PM10, AQI, and many more. These monitors then send the data to the EPA to concatenate all of the data into one location.

The original source for the patient and hospital data come from the Department of Health Care Access and Information (formerly known as the Office of Statewide Health Planning and Development. More specifically, it comes from the Healthcare Analytics Branch. For the air quality data, the original source is the US Environmental Protection Agency (EPA) Air Data.

The organization that funded the creation of the healthcare dataset is the Department of Health Care Access and Information, which itself receives both federal funding from Congress and statewide funding. Regarding the air quality dataset, the funding also comes from Congress via bills that allocate money towards the agency.

What information is left out of the spreadsheet?

Although this database included significant information supporting the overall mission to discover the cause effect from air pollution, many segments of information were left out, much of which would be useful toward explaining patient background information that is important in helping to validate the connections of the outcomes against confounding variables.

Socioeconomic class (SEC) per subject was omitted in the database fostering an inherent bias to the location and level of impact from the attributes of the county, ie. population, density, and urbanization. Those of higher SEC may have the advantage of financial ability to afford higher quality patient care or the ability to gain a greater scope of patient care. Secondly, a complete patient history in relation to each subject in the study was unavailable creating even more inherent bias to the severity of results by failing to include any past history such as smoking, COPD, or even asthma. This could show results that are not entirely accurate to the conclusions being made based on the data set.

Conclusion

In conclusion, the ideological effects of the data’s division allow for the creation of a humanist narrative. The data is able to paint a picture of the changing of air quality over time by showing yearly ratings for PM2.5-based AQI per county in California and other air quality measurements, as well as yearly measurements of patient data and hospitalizations for cardiovascular and respiratory intakes in the same area. Although these are seemingly objective measurements of respective categories, it can inherently marginalize some groups. For example, the majority of the AQI data in the northeast corner of California was modeled, meaning that there was not proper funding or facilities to record air quality at a yearly timescale. This area is predominantly more rural and less wealthy, and this dataset can unjustly affect how we view their situation. Even more, if this dataset were the only source of information, there would be no data about the cause of these poor air qualities, which is the essential answer to uncover how they can hurt people’s health. Ultimately, this humanistic search to find a connection between air quality and people’s health is affected by the dataset’s own ontology, fundamentally affecting the way in which we analyze its metadata and information.

The Air Divide