Frequently asked questions
Why didn't the sample/project I specified when running the data download script download?
First, we recommend using the --dryrun
flag when running the download-data.py
script to check which files would be downloaded.
This will confirm that there is nothing wrong with your internet connection and that you are properly logged into your AWS profile.
If running the script with --dryrun
states that only the DATA_USAGE.md
file is being downloaded, this means the data files you are attempting to download do not exist.
There are two main reasons why this might occur:
- Not all samples are available in
AnnData
format. If you are attempting to downloadAnnData
format, please consult the ScPCA documentation to learn more about which types of projects do not haveAnnData
files. - The sample or project ID(s) that you specified during download do(es) not exist.
Therefore, you may wish to list some of the files in the data release S3 bucket to confirm the sample/project files you are trying to download actually exist.
Data files in each release are organized on S3 as:
{Release}
├── {Project ID}
│ └── {Sample ID}
│ └── {Library files}
├── bulk_metadata.tsv (if applicable)
├── bulk_quant.tsv (if applicable)
└── single_cell_metadata.tsv
Follow these instructions to determine the contents of a release:
-
First, find the names of all the data releases, which are named based on their release date in ISO 8601 format. You can use the
download-data.py
script with the--list-releases
flag to do this. Don't forget to log into your AWSopenscpca
profile first. We also strongly suggest storing your AWS profile name to help handle credentials. -
Identify the most recent date from the output. In the example output below, the most recent release is
2024-05-01
. -
All data is stored on S3 in
s3://openscpca-data-release/{release name}/
. To list all available files, you'll need to use the AWS CLI in a terminal and runaws s3 ls
to list the contents of a given S3 bucket. For example, you can list files in a given release as:This has the (abbreviated) output:
Output from listing all projects in the 2024-05-01 releasePRE SCPCP000001/ PRE SCPCP000002/ PRE SCPCP000003/ PRE SCPCP000004/
Note that the
PRE
prefix appears for directories that are inside the overall data release bucket,s3://openscpca-data-release
.Don't forget the trailing slash!
The trailing slash at the end of
s3://openscpca-data-release/2024-05-01/
is necessary for contents to be listed. If you omit the slash and just runaws s3 ls s3://openscpca-data-release/2024-05-01
, the output will only list the directory itself, and not its contents. -
From there, you can continue to list the next level of nested files and directories (prefixes). For example, you can list all files in the
SCPCS000001
sample from theSCPCP000001
project with:# List all SCPCS000001 files, again specifying the AWS profile aws s3 ls s3://openscpca-data-release/2024-05-01/SCPCP000001/SCPCS000001/
This has the output:
Output from listing all SCPCS000001 files in the 2024-05-01 release2024-04-30 09:38:01 2583665 SCPCL000001_celltype-report.html 2024-04-30 09:38:01 47346465 SCPCL000001_filtered.rds 2024-04-30 09:40:45 48974004 SCPCL000001_filtered_rna.h5ad 2024-04-30 09:38:01 45388563 SCPCL000001_processed.rds 2024-04-30 09:40:45 214177811 SCPCL000001_processed_rna.h5ad 2024-04-30 09:38:01 3057684 SCPCL000001_qc.html 2024-04-30 09:38:03 47019247 SCPCL000001_unfiltered.rds 2024-04-30 09:40:45 94000784 SCPCL000001_unfiltered_rna.h5ad
-
If you attempt to list files with a prefix that does not exist, you will not get any output. For example, this will not return anything since there is no project ID
SCPCP000099
:
How can I use results from existing modules in my analysis module?
Once modules reach maturity, the Data Lab adds them to a Nextflow workflow to automatically generate the module's results from the latest data release.
If you want to use these module results (e.g., cell type annotations conducted in a different module) as input for your analysis, you can obtain them with the download-results.py
script.
Results downloaded with this script will be stored in data/current/results/{module name}
.
However, there may be circumstances when you want to use results from a module which has not yet been added to the Nextflow workflow.
In such cases, you will need to run the module yourself to generate the results.
Instructions for running the module, including its software and compute requirements, should be available in the module's main README.md
file.
After running the module, results will generally be stored in analysis/{module name}/results
, and the module's documentation should describe the contents of result files.
What if I want to use Seurat?
While data downloads are only available in SingleCellExperiment
and AnnData
format, Seurat
versions of processed objects (in v5 assay format) are also available for use.
These files are part of the OpenScPCA results, associated with the module seurat-conversion
which we wrote to convert the processed SingleCellExperiment
objects to Seurat
format.
For more information on obtaining result files, please refer to the documentation for the download-results.py
script.
When working with these Seurat
objects, please bear in mind the following:
- These
Seurat
objects include the same content as theSingleCellExperiment
objects that they are derived from. This includes raw and normalized counts, annotations of highly variable genes, PCA and UMAP transformations, as well as cell and feature metadata. - Note that all calculations were performed using
Bioconductor
packages, so values will differ from the results obtained usingSeurat
functions from the same raw data. - If your analysis requires fields created from
Seurat
processing pipelines, you will need to repeat those processing steps. - To be more consistent with
Seurat
analysis pipelines, these objects use gene symbols rather than Ensembl ids as the row names and primary feature id.
The ScPCA data objects contain Ensembl ids, but I need gene symbols for my analysis. How should I perform this conversion?
In an effort to keep this consistent across the OpenScPCA project, we provide functions to convert Ensembl ids to gene symbols in an R package we maintain called rOpenScPCA
.
Installation instructions are provided in the rOpenScPCA
GitHub repository.
This package has two particular functions to support this task:
rOpenScPCA::ensembl_to_symbol()
- This function converts a vector of Ensembl ids to a vector of gene symbols
rOpenScPCA::sce_to_symbols()
- This function converts row names in a
SingleCellExperiment
object from Ensembl ids to gene symbols
Please refer to these functions' help pages (e.g., ?rOpenScPCA::sce_to_symbols
) for additional information on their use, including options for handling duplicate or missing gene symbols.
I noticed there are cluster assignments in the processed data files. Should I use those or re-cluster the data?
All ScPCA data objects contain cluster assignments which were calculated using an automated pipeline. Because the clustering parameters used in this automated pipeline were not tailored to any given dataset, we do not recommend relying on these clusters for downstream analysis. Instead, we strongly recommend re-clustering the data and evaluating your cluster assignments before using them.
To support clustering analysis and evaluation, we provide several functions in an R package we maintain called rOpenScPCA
to accomplish the following tasks:
- Perform graph-based clustering
- Evaluate clustering results with several quality control metrics
- Calculate different sets of clustering results across parameter space in order to identify an optimal clustering scheme
We also provide an OpenScPCA analysis module hello-clusters
with example notebooks demonstrating how to use the clustering functionality in rOpenScPCA
.
This module also provides instructions on how to install rOpenScPCA
.