LibGuides: Publicly Available Sources of Data for Health &amp; Social Determinants of Health: Data Challenges

Points to consider when looking for data

In using this guide, there are at least 7 (and most likely more) issues affecting data access:

Very few of our data sets are linked with other data sets so it is difficult to establish potential associations or even test hypotheses of associations without gathering your own data or smushing data sets together.
- NCHS (CDC) is working on resolving this with some of the longitudinal surveys they administer. Visit the NCHS Data Linkages page for more information.
- An example of a project using data linkage is:
  - An Online Geographic Data Visualization Tool to Relate Preterm Births to Environmental Factors
    - Explore Preterm Birth in Fresno
Incidence and prevalence data for chronic diseases has not been routinely collected, especially at the county level.
- Surveillance systems track hospital admissions; longitudinal risk behavior surveys using a sample of the population also provide some data, especially at the state-level. Mortality data by cause is available at the county level.
Speaking of mortality data by cause-- studies on mortality from chronic conditions are often not shown as the underlying cause of death. Instead, heart failure might be listed when it resulted from a diabetes, hypertension, or even advanced dementia. Consider looking at multiple cause of death data to get a better grasp of true mortality from certain chronic conditions.
Government agencies use different metrics when gathering data.
- One example of this is the YRBSS, survey of adolescent risk behaviors. Several questions deal with alcohol and other drug use. SAMHSA NSDUH also asks about alcohol and other drug use. For an explanation of the variations, see page 120 of the 2019 Methodological Summary and Definitions from the National Survey on Drug Use and Health.
Lack of geographical granularity.
Privacy/HIPPA.
Interfaces (i.e. look and feel) of each site that links to publicly available data is different. Options for variables are dependent on the dataset being searched, the views are different, and the feel is often different. You may have to resort to trial and error to get what you want. Look for ways to download the data if that is helpful, then open in Excel.

Common Data Problems to Watch For

As you view these links to various sources of health data, keep in mind that data does not tell the whole story. There may be a story behind the data. As you will see, data and data gathering are not perfect-- if you see an anomaly or large deviation in the data, find out why. Don't assume.

However, once you look over some of the caveats detailed on this page, realize that there is still comparability amongst the data once you understand the anomalies! And whenever possible, take a look at the technical notes for more insight into the data.

Here are some examples of how data can mislead.

1. Do you know what data is really gathered?

Be certain you understand the rules of data gathering before you try to interpret the data! Below are 3 examples of how the data gathering can mislead. Two of them come from Texas; the third is from Pennsylvania.

A. Births: Changes to the Texas Birth Certificate

Births for Harris County
	Marital Status
	Married		Not Married		Unknown		All Mothers
Year	Number	Percent	Number	Percent	Number	Percent	Number	Percent
1992	45,851	78.2	12,801	21.8	0	0.0	58,652	100.0
1993	45,873	78.8	12,332	21.2	0	0.0	58,205	100.0
1994	39,130	68.6	17,691	31.0	205	0.4	57,026	100.0
1995	38,208	67.0	18,783	32.9	66	0.1	57,057	100.0

(From Texas Health Data Births to Texas Residents)

In this example, there is a fairly substantial jump in the increase of babies born to unmarried mothers in 1994-- over 5,300-- and a corresponding decrease-- over 6,700-- to those born to married mothers. What happened? It turns out that the shift was not in behavior per se, but in what data was gathered by the state of Texas. Prior to 1994, birth certificates did not indicate marital status, but it was assumed when compiling the statistics that the woman was married if the father's name was listed. After 1994, when marital status was specifically asked about, a truer picture emerged. (Thanks go to Dr. Bill Spears for ferreting that out and sharing it with me.)

B. Deaths: Changes to the Texas Death Certificate

Death Rate (per 100,000) in Texas
Cause of Death: Diabetes mellitus (ICD 250)
	Total
Year	Rate	Deaths
1987	12.8	2,127
1988	12.3	2,056
1989	17.1	2,872
1990	20.4	3,458
1991	20.7	3,593

(From Texas VitalWeb)

In this example, we see a fairly substantial jump in the number from 1988 to 1989, then another jump from 1989 to 1990. Were there really that many more deaths from diabetes? Most likely not. If anything, death from diabetes were probably underreported prior to 1989. However, the Texas Death Certificate was changed in 1989 to include an example on the back, with diabetes used in the example. Immediately, the death rate for diabetes increased. Unfortunately for Texas, deaths due to diabetes continue to rise. Nearly one third of the deaths in Texas in 2002 were attributed to Diabetes Mellitus (Texas Health Data Deaths of Texas Residents) although some of the increase could have been a result of the change from ICD-9 to ICD-10. In the case of diabetes, deaths attributed to diabetes rose slightly (less than 1% or a comparability ration of 1.0082). (See below for more discussion on ICD-9 vs. ICD-10.) (Thanks go to Daniel Goldman for ferreting out the death certificate change and sharing it with me.)

C. Infectious Diseases: Changes to the list of notifiable diseases

In 1994 there were 59 infectious diseases notifiable at the national level. In 2010, there were 100. Not knowing when an infectious disease becomes notifiable can lead to a misinterpretation of the data. Looking at chlamydia, for example, we see the following data for all of Pennsylvania.

Chlamydia in Pennsylvania
	2010	2000	1995	1991
Count	47,518	26,475	22,961	4,275
Rate per 100,000	374.09	215.49	188.23	35.68

(From: US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for HIV, STD and TB Prevention (NCHSTP), Division of STD/HIV Prevention, Sexually Transmitted Disease Morbidity 1984-2014, CDC WONDER Online Database. Accessed at http://wonder.cdc.gov/std-sex.html on Jul 15, 2019 12:57:27 PM )

There is what appears to be a chlamydia epidemic in Chlamydia. The number of chlamydia cases jumped 6-fold between 1991 and 2000. But wait-- when did chlamydia become a notifiable disease? Based on the data, we might guess that it was 1995 as that is when we see a large increase in both count and rate. Prior to 1995, chlamydia was only voluntarily reported; it became a notifiable disease in 1995. Learn more about data reporting for chlamydia.

The CDC lists other concerns when interpreting data:

"Incidence data in the Summary are presented by the date of report to CDC as determined by the MMWR week and year assigned by the state or territorial health department.....Thus, surveillance data reported by other CDC programs may vary from data reported in the Summary because of differences in 1) the date used to aggregate data (e.g., date of report, date of disease occurrence), 2) the timing of reports, 3) the source of the data, 4) surveillance case definitions, and 5) policies regarding case jurisdiction (i.e., which state should report the case to CDC).

The data reported in the Summary are useful for analyzing disease trends and determining relative disease burdens. However, these data must be interpreted in light of reporting practices. Some diseases that cause severe clinical illness (e.g., plague and rabies) are most likely reported accurately if they were diagnosed by a clinician. However, persons who have diseases that are clinically mild and infrequently associated with serious consequences (e.g., salmonellosis) might not seek medical care from a health-care provider. Even if these less severe diseases are diagnosed, they are less likely to be reported.

The degree of completeness of data reporting also is influenced by the diagnostic facilities available; the control measures in effect; public awareness of a specific disease; and interests, resources, and priorities of state and local officials responsible for disease control and public health surveillance. Finally, factors such as changes in the case definitions for public health surveillance, introduction of new diagnostic tests, or discovery of new disease entities can cause changes in disease reporting that are independent of the true incidence of disease."

Take a look at a list of notifiable diseases as well as statistics from the MMWR Summary of Notifiable Diseases for 1993 through 2015 and the NNDSS Data and Statistics System from 2016 forward. The most current list is available from through the National Notifiable Diseases Surveillance System.

2. How have standards changed in the reporting or collection of data?

A. Has an age-adjustment been made on the data? If so, which standard was used?

Age adjustments are used to compare two populations during the same time period or the same population during different time periods. They are used to eliminate observed differences in the population that are age-related. There are four common standards, the most current being the 2000 standard. Other standards include: the 1980 standard (not as common), the 1970 standard (common), and the 1940 standard (common). In order to get a viable comparison, you must use the same standard. The TX Dept of State Health Services has a nice description and example of age adjustment.

B. Which international classification of disease (ICD) revision was used to report the data?

ICD-10 is in current use to classify mortality data; ICD-9 was in use prior to 2010. This international classification provides for a means of comparison between the US and other countries. ICD-10 is more detailed than ICD-9 and utilizes an alpha-numeric system; ICD-9 was a numeric only system. For the purpose of comparison, see Anderson, RN, et al. (2001). Comparability of Cause of Death Between ICD-9 and ICD-10: Preliminary Estimates. National Vital Statistics Reports, 49(2). The World Health Organization has posted ICD-10 codes online (2019 edition).

C. Is the mortality data measuring underlying cause of death or multiple causes of death?

Some Healthy People data (specifically diabetes) reports using multiple causes; mortality data via many of the sources on these pages show underlying cause only. Be certain you know which you are looking at so you aren't misled by conflicting data.

D. Infectious diseases: Changes in case definitions

"Surveillance case definition" is defined by the CDC as "a set of uniform criteria used to define a disease for public health surveillance." Case definitions guide the numbers reported to the National Notifiable Diseases Surveillance System at the CDC. But case definitions can change which is what happened with Lyme Disease in 2022. According to a 2024 MMWR Report, the cases of Lyme Disease greatly increased in 2022, most likely as a result of changes to surveillance methods and not from an actual increase in cases. The report points out the surveillance has improved but we can no longer make those historical comparisons. How big was the increase? The table below shows only the subtotals by jurisdiction and the total at the national level. For high-incidence jurisdictions, the change is fairly significant-- a whopping 73% increase in identified cases!

Number of reported Lyme disease cases and Lyme disease incidence, by jurisdiction and incidence category* — United States, 2017–2019 and 2022

Jurisdiction	# of reported cases			Incidence
High-incidence: 10 or more confirmed cases per 100,000 population for 3 years Low incidence: all others	2017–2019	2022	% change	2017–2019	2022	Incidence difference
High-incidence jurisdictions Subtotal	34,557	59,734	72.9	43.3	68.3	25
Pennsylvania	10,369	8,413	−18.9	79.7	64.7	−15.0
Low-incidence jurisdictions Subtotal	2,561	2,817	10	0.7	0.5	−0.2
U.S. total	37,118	62,551	68.5	11.2	18.9	7.7

3. What is the unit of measure of the data?

As you look at the data, be sure you understand what the actual unit of measure is. Are you looking at a count or a rate? Is the rate age-adjusted? Which standard was used? If you aren't certain what that means, take a look at the Rates and Formulas page for additional information.

Be sure you understand the unit of measure so that you can compare apples to apples. You cannot compare a non-adjusted rate with an age-adjusted rate. And quite honestly, you probably do not want to compare crude rates if there are several years separating them (i.e. a decade), especially in areas that are rapidly changing. It is possible to calculate an age-adjusted rate fairly easily. A general epidemiology book will explain how.

4. What has changed in medicine to affect the data?

Medicine has made great strides in keeping people alive when twenty years ago, or even ten years ago, they would have died. Just think about AIDS, heart attacks, strokes, and cancer. Mortality data is not always the most accurate reflection of the health of a people.

Another example is infant mortality. Rates increased in the United States in 2002, from 6.8 deaths per 1,000 births in 2001 to 7.0 deaths in 2002. What has happened? The causes are not fully known yet, but the CDC has some thoughts on the reasons why. Take a look at the "Supplemental Analyses of Recent Trends in Infant Mortality." Again, medical technology could have influenced infant mortality. What would have once been a miscarriage is now a preterm delivery.

5. How has the population changed?

Has the population aged? If so, we may see a sharp increase in cancer and cardiovascular diseases. Is it a younger population? Then there may be an increase in the number of pregnancies and STDs. Be sure to look at the demographics of the population when examining the number of occurrences of an event.

6. Is the data reporting self-reported behaviors?

Data for self-reported behaviors cannot expect to be accurate. After all, a survey participant may be asked about behaviors that are embarrassing or even illegal. Consequently, when questioned, participants may under-report certain behaviors (drinking while pregnant) and over-report others (exercise).

7. Is the frequency of events or the population (or both) a very small number?

Be careful when working with diseases in which there are not a large number of deaths or the population is fairly small. While the mortality rate may be expressed as the number per 100,000 (i.e. a point estimate), you should also take into consideration the confidence interval which indicates how large or small the spread of uncertainty is. What is a confidence interval? "A confidence interval is a range around a measurement that conveys how precise the measurement is. For most chronic disease and injury programs, the measurement in question is a proportion or a rate (the percent of New Yorkers who exercise regularly or the lung cancer incidence rate).

The confidence interval tells you more than just the possible range around the estimate. It also tells you about how stable the estimate is. A stable estimate is one that would be close to the same value if the survey were repeated. An unstable estimate is one that would vary from one sample to another. Wider confidence intervals in relation to the estimate itself indicate instability. For example, if 5 percent of voters are undecided, but the margin of error of your survey is plus or minus 3.5 percent, then the estimate is relatively unstable. In one sample of voters, you might have 2 percent say they are undecided, and in the next sample, 8 percent are undecided. This is four times more undecided voters, but both values are still within the margin of error of the initial survey sample.

On the other hand, narrow confidence intervals in relation to the point estimate tell you that the estimated value is relatively stable; that repeated polls would give approximately the same results." (From the NY State Department of Health: Confidence Intervals - Statistics Teaching Tools)

For example, CDC Wonder Underlying Cause of Death dataset was queried. The table was set up to show the crude rate of mortality from brain cancer by gender for each county in Pennsylvania along with a 95% confidence interval. The 3 years of data were aggregated. A point estimate rate was calculated for most of the counties but 24 counties in Pennsylvania were flagged as the numerator was less than 20; the crude rate was not calculated although the confidence interval was. And in one case, the confidence interval ran from 9.0 to 34.5. The "true" crude rate would be somewhere in between.

Publicly Available Sources of Data for Health & Social Determinants of Health

For additional information, contact: