Interpreting the “fraction of reads in cells” values reported with the MiSeq results in single cell RNA-seq

In short: the “fraction of reads in cells” value is a measure of the proportion of sequencing reads assigned to true cells versus reads attributed to “background” such as RNA from dying/lysed cells captured in otherwise empty droplets.

The relatively shallow depth of the MiSeq’s sequencing is used as a part of our library preparation and sequencing services to capture two things: the number of cell barcodes present in the sample and the number of UMIs associated with each cell barcode. The way these data points are commonly visualized (e.g. in the summary files produced by Cell Ranger) is as a histogram where the cell barcodes on the X axis are ranked according to the number of UMIs associated with that cell barcode on the Y axis. The result is a “knee plot” that looks something like this:

Blue dashed lines represent the analysis pipeline’s primary cell # call; pink dashed lines represent secondary estimates.

For a library prepared from a high quality, high viability initial sample, you will often see a very sharp inflection in the knee plot as you transition suddenly from cell barcodes with relatively high UMI counts, which are more likely to be true cells, to cell barcodes with very low UMI counts cells, which are more likely to be cases where one of the 10X Genomics gel beads was encapsulated in an oil droplet with “background” or “ambient” RNA from dead/dying/lysed cells in the suspension.

The custom analysis pipeline we use to count these barcodes in the MiSeq data is designed to detect and set a threshold at this inflection point. Everything on the plot to the left of this threshold is considered a cell, and everything to the right is considered background. This is where we get the cell number estimate that we will report to you. The other metric we report, the “fraction of reads in cells”, is directly related to this threshold. It is essentially the area under the curve for everything to the left of the threshold: the proportion of sequencing reads that can be assigned with confidence to live cells as a percent of the total reads, including everything to the right of this threshold.

Ideally, this fraction of reads in cells value should be 90% or above, but we are generally satisfied to see this number above 70%. For libraries below 70%, we begin to caution against proceeding with the NovaSeq run (especially for cases where this number is ~60% or less), as a large portion of the sequencing reads will be used up by this junk in the library and it can become more and more difficult to obtain useable data from what cells are present. Due to the associated high amount of background, libraries with poor fraction of reads in cells also typically display much softer or broader knee plots with less firm delineation between cells and background, which can make it much more difficult to accurately gauge the number of captured cells from the MiSeq data.

Please note that there are certain sample types – particularly samples with large sub-populations of cell types with very different levels of RNA content – where this measurement can be somewhat less reliable with the read depth provided by the MiSeq. This is a common problem in neutrophil-rich samples, for example, as one can see additional knees emerge representing the transition from the cell population with more RNA to the cell population with less RNA. Since the “fraction of reads in cells” value is directly downstream of the called cell number, getting a more accurate estimate of this value occasionally requires us to force a called cell estimate. We will usually take this number from one of the analysis pipeline’s secondary estimates (like the pink dashed line in the above example), but this is ultimately a somewhat subjective call based on a variety of factors including our expectations for that sample (e.g. how many cells we roughly expect to have captured based on measurements like the sample concentration and viability, etc.).