ICD, CCS, Elixhauser Data Set

Sven Halvorson, 2020-04-15

The purpose of this project was to create a library of ICD codes and attach some classifications to them that are useful for research. Unfortunately, the resultant data set is too large to upload to github. This repository has all of the necessary data sets, the code to compile it, and a description of the fields. If you have any questions or spot an error, please email me at svenpubmail@gmail.com

One of the first concepts introduced to me when I started working as a statistical programmer were procedure and diagnosis codes. Medical providers need ways of describing what the state of a patient's health is and what was done about it in a way that is not just free text. These codes are used for many purposes including record keeping, chart reviews, billing, and research. As a researcher, I frequently use these as ways to identify exposures, outcomes, and control for comorbidities. We're often confronted with research questions that involve identifying patients with a particular condition, such as hypertension, but there are many variations on how this can be described in a database. Do we mean primary or secondary? Relating to a particular organ such as a kidney? Neonatal? Exhaustively identifying these permutations with Google alone is not easy.

What I was very surprised to learn when I started was that my team did not have a single centralized data set that contained all of the codes. Various people have disjointed and incomplete lists on different drives. We googled a lot of diseases and took the first hits from whatever site or paper we found. I thought it would be worthwhile to aggregate several coding systems into a single data set. This document describes the process of creating that as well as some instructions on how to replicate it. I have used this data set in many applications. One of the most common uses is creating lists of possible codes to investigators to narrow down their definitions.

As a side note, if you are a SAS/STATA user, there are programs on the HCUP site that can do some nice transformations related to these codes. I don't really like to use either of those programs which is part of why I created this data set.

Types of codes

There are a variety of schemes used to categorize medical procedures and diagnoses. Here are some that I am aware of and descriptions of what they are:

  • International Classification of Diseases (ICD): This is a system created and maintained by the World Health Organization (WHO). These codes are separated into procedure and diagnosis codes and come in a series of versions. At my current place of employment (and I suspect most hospitals) we are using version 10 (ICD10). This conversion was relatively recent and many databases have a lot of ICD9 codes as well. Both versions, and ICD10 in particular, strive for accuracy and thus are fine grain. This makes them more challenging to use if you want to characterize a more general concept of a disease or procedure. The datas et created here will begin with the ICD codes as its base.
  • Clinical Classification Software (CCS): Researchers at the Healthcare Cost and Utilization Project (HCUP) created a set of larger bins for the ICD codes. I find these particularly useful for finding sets of codes as they have a series of levels that get more specific for each ICD code. For example, a code might be classified at level 1 as a neoplasm, at level 2 as a benign neoplasm, and as a benign neoplasm of colon at level 4. Not every ICD code falls within these categories but most do. The versions for diagnoses were discontinued in favor of the CCSR (below) but I have continued to use a beta version since it's backwards compatible with the ICD 9 codes.
  • Clinical Classification Software Refined (CCSR): This is a modification of the CCS system that was created in early 2020. It only applies to ICD 10 diagnoses and can classify the same diagnosis into multiple categories. The total number of categories is much larger. I don't have much experience using this at the moment.
  • Elixhauser Comorbidities: This schema is also provided by HCUP but focuses on a set of 30 chronic conditions (comorbidities) that have been demonstrated to have a strong relationship with mortality, length of hospital stay, and hospital charges. These categories are useful for confounder adjustment for when conducting statistical analyses. They're not too numerous but capture lots of reasons why patients might have poor outcomes.
  • Current Procedural Terminology (CPT): CPT codes are created by a panel of members that come from big hospital, insurance, and government agencies. They only supply procedure codes and are more focused on billing. There is no reliable crosswalk (that I know of) to assign these to ICD procedures. Because of this, I rarely use these and they are not included in the data set.

Creating the data set

The data set was created with these goals in mind:

  1. Create a close to exhaustive list of all the ICD codes
  2. Link CCS and CCSR categories to codes when applicable
  3. Link Elixhauser comorbidity categories to ICD codes
  4. Create a format for ICD codes that can be merged on as well as R/Py functions that can put codes into that format
  5. Preserve original representations of ICD codes as given by sources
  6. Have clean labels and documentation

The data sets needed to create the merged version are listed below. I also note which version I utilized at the time of writing this as at least version 10 is likely to be updated in the future.

Backbone of data set is from the Center for Medicare Services (CMS):

Next get the CCS codes from HCUP here:

Elixhauser comorbidities:

  • My teammate at the Cleveland Clinic created some SAS files for converting ICD to Elixhauser which I have uploaded to the github.