Showing posts with label data resources. Show all posts

Friday, January 10, 2025

Profiling the transcriptomic age of single-cells in humans

Data for "Profiling the transcriptomic age of single-cells in humans"

Friday, June 17, 2022

aging methylation data, single cell, Transcription factor

Inference of age-associated transcription factor regulatory activity changes in single cells

https://www.nature.com/articles/s43587-022-00233-9#data-availability

re we present and validate a TF activity estimation method for single cells from the hematopoietic system that is based on TF regulons, and apply it to a mouse single-cell RNA-sequencing atlas, to infer age-associated differentiation activity changes in the immune cells of different organs

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56046

Series GSE56046

Query DataSets for GSE56046

Status

Public on Nov 24, 2014

Title

Transcriptomics and methylomics of human monocytes [methylome]

Organism

Homo sapiens

Experiment type

Methylation profiling by genome tiling array

Summary

The MESA Epigenomics and Transcriptomics Study has been launched to investigate potential gene expression regulatory methylation sites in humans by examining the association between CpG methylation and gene expression in purified human monocytes from a large study population (community-dwelling participants in the Multi-Ethnic Study of Atherosclerosis (MESA)).
The MESA Epigenomics and Transcriptomics Study was funded by a National Heart, Lung and Blood Institute grant (R01HL101250) through the NIH Roadmap Epigenomics Program in 2009.

Wednesday, June 15, 2022

genome wide pertub-seq

CRISPi + scRNA

https://gwps.wi.mit.edu/

matrix file in H5AD format

https://doi.org/10.25452/figshare.plus.20029387

Genome wide screen targeted n=9867 genes.

Essential-wide screen targeted n=2285 essential genes.

Growth phenotypes were measured log2-guide enrichment per cell doubling (gamma)

"The relative homogeneity of CRISPRi reduces selection for unperturbed cells, especially when studying essential genes. Unlike CRISPR knockout, CRISPRi does not lead to activation of the DNA damage response which can alter transcriptional signatures (Haapaniemi et al., 2018)."

"We use a compact, multiplexed CRISPR interference (CRISPRi) library to assay thousands of loss-of-function genetic perturbations with single-cell RNA sequencing (scRNA-seq) in chronic myeloid leukemia (CML) (K562) and retinal pigment epithelial (RPE1) cell lines."

There are four datasets on SRA:

K562 day 8 Perturb-seq (KD8): targeting all expressed genes at day 8 after transduction
RPE1 day 7 Perturb-seq (RD7): targeting DepMap essential genes at day 7 after transduction
K562 day 6 Perturb-seq (KD7): targeting DepMap essential genes at day 6 after transduction
K562 day 8 Perturb-seq (KD8_ultima): scRNA-seq libraries from the KD8 experiment sequenced on the Ultima sequencing platform rather than the Illumina sequencing platform

Potential problem: essential gene cannot be deleted?

This is good data set for graph controllability, graph neural network analysis.

Sunday, May 22, 2022

MIMIC-III

https://physionet.org/content/mimiciii/1.4/

MIMIC-III is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).

MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.

Wednesday, May 11, 2022

WILD

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

https://github.com/p-lambda/wilds/tree/main/examples

Monday, September 20, 2021

single cell multiplexed and image and proteomic data

Automated assignment of cell identity from single-cell multiplexed imaging and proteomic data

Geuenich Michael; Hou Jinyu; Lee Sunyun; Ayub Shanza; Jackson Hartland; Campbell Kieran

https://zenodo.org/record/5156049#.YUjywGZKjzc

Tuesday, May 11, 2021

GTEx data portals

DBGAP

https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=pi_requests&filter=wlid&wlid=29078

ANVIL

https://anvil.terra.bio/#profile

https://app.terra.bio/#workspaces

Thursday, April 15, 2021

human protein atlas

https://www.proteinatlas.org/humanproteome/cell

69 cell lines of human,
deep RNA seq.
protein localization by antibody profiling with immunofluorescence and confocal microscopy -> 35 locations

Wednesday, April 14, 2021

multi-view or multi-task genomics data sets

yeast:

scmd version morphology

fitness

lifespan, RLS, CLS

Monday, April 5, 2021

Skeletal muscle transcriptome in healthy aging

https://www.nature.com/articles/s41467-021-22168-2

RNA was extracted and sequenced from muscle biopsies collected from 53 healthy individuals (22–83 years old) of the GESTALT study of the National Institute on Aging–NIH

Saturday, January 2, 2021

Federal tax revenue by states

https://en.wikipedia.org/wiki/Federal_tax_revenue_by_state#Fiscal_Year_2019

Monday, December 28, 2020

USA 2020 election results by counties

MIT election night 2020

https://github.com/MEDSL/election_night2020

2020 results by counties

https://github.com/tonmcg/US_County_Level_Election_Results_08-20/blob/master/2020_US_County_Level_Presidential_Results.csv

https://github.com/openelections

2004, 2008, 2012

https://github.com/helloworlddata/us-presidential-election-county-results

Tuesday, November 24, 2020

jackson20 elife Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

11 different TF deletion strains

a digital expression matrix was provided in the supporting doc.

from scRNA counts per gene to gene regulatory network, the author used "inferelator", a regression based method.

A gold-standard prior TF-network was cited from Tchourine, 2018. which include 1403 signed (-1, 0, 1) interactions in a 998 genes by 98 transcription factor regulatory matrix.

Another unsigned gene network cited Teixeira 2018 that include 11486 interactions in a 3912 genes by 152 TFs.

A multi-task fitting was used to infer a gene network to 11 conditions.

Single-Cell processing pipeline
The cellranger pipeline is available from 10x Genomics under the MIT license (https://github.com/
10XGenomics/cellranger). The fastqToMat0 pipeline is available from GitHub (https://github.com/
flatironinstitute/fastqToMat0; Jackson, 2020; copy archived at https://github.com/elifesciences-pub-
lications/fastqToMat0) and is released under the MIT license. Genome sequence and annotations
are included as Source code 4.

## Supplemental Data 1: Jackson et al 2019 ##
# Included in this archive are the following data files:
# data/
# 103118_SS_Data.tsv.gz (TSV count matrix of single-cell yeast reads [cells x genes])
# 110518_SS_NEG_Data (TSV count matrix of simulated single-cell reads [cells x genes])
# TRIZOL_BULK.tsv (TSV count matrix of data prepared with TRIZOL [samples x genes])
# yeast_gene_names.tsv (TSV with Systematic Names and Common Names for yeast genes)
# STable5.tsv (TSV copy of Supplemental Table 5 with cross-validation results)
# STable6.tsv (TSV copy of Supplemental Table 6 with genes grouped into categories)
# go_slim_mapping.tab (GO Slim Terms: https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab)
# go_slim_labels.tsv (TSV with shorter figure names for GO slim terms)
# GASCH_2017_COUNTS.tsv (TSV from GSE102475; Gasch 2017 BY4741 single-cell TPM data in mid-log YPD [cells x genes])
# LEWIS_ALL.tsv (TSV from GSE135430; Scholes 2019 BY4741 bulk TPM data in mid-log YPD[samples x genes])
# LARS_2019_COUNTS.tsv (TSV from GSE122392; Nadal-Ribelles 2019 BY4741 single-cell count data in mid-log YPD [cells x genes])
# inferelator/
# jackson_2019_figureXX.py (Python script to run the network inference with the inferelator v0.3.0 for the associated figure)
# network/
# signed_network.tsv (TSV signed [-1, 0, 1] network of regulatory relationships [genes x TFs])
# COND_signed_network.tsv (TSV signed [-1, 0, 1] network of regulatory relationships [genes x TFs] for each of 11 conditions)
# priors/
# Tchourine_gold_standard.tsv.gz (TSV with gold standard from Tchourine et al 2018)
# ATAC-motif_priors.tsv.gz (TSV with atac-motif priors from Castro et al 2019)
# YEASTRACT_priors_20181118.tsv.gz (TSV with YEASTRACT priors downloaded from YEASTRACT 11/18/2018)
# YEASTRACT_20190713_BOTH.tsv (TSV with YEASTRACT priors downloaded from YEASTRACT 07/13/2019)
# YEASTRACT_20190713_DNABINDING.tsv (TSV with YEASTRACT DNA-binding interaction data downloaded from YEASTRACT 07/13/2019)
# YEASTRACT_20190713_EXPRESSION.tsv (TSV with YEASTRACT expression change interaction data downloaded from YEASTRACT 07/13/2019)
# BUSSEMAKER_priors_2008.tsv.gz (TSV with priors from Ward & Bussemaker 2008)

Source code 1 A ‘tar.gz’ archive containing R scripts used to generate Figures 2–7 and accompanying supplementary figures with a README detailing the necessary R environment to run them locally. It also contains a data folder with the raw count matrix as a TSV file (103118_SS_Data.tsv.gz), the simulated negative data count matrix as a TSV file (110518_SS_NEG_Data.tsv.gz), a gene name metadata TSV file (yeast_gene_names.tsv), supplemental tables 5 (STable5.tsv) and 6 (STable6.tsv) as TSV files, and the yeast gene ontology slim mapping as a TAB file (go_slim_mapping.tab). Source code 1 also contains a priors folder with the Gold Standard, the three sets of priors data tested in this work, and the YEASTRACT comparison data, all as TSV files. Source code 1 also contains a network folder with the network learned in this paper (signed_network.tsv) as a TSV file, and the networks for each experimental condition (COND_signed_network.tsv) as 11 separate TSV files. Source code 1 also contains an inferelator folder with the python scripts used to generate the networks for Figures 5, 6, 7.: https://cdn.elifesciences.org/articles/51254/elife-51254-code1-v3.tar.gz
Download elife-51254-code1-v3.tar.gz
Source code 2 The raw count matrix as a gzipped TSV file. This file contains 38,225 observations (cells). Doublets and low-count cells have already been removed; gene expression values are unmodified transcript counts after deartifacting using UMIs (these values are directly produced by the cellranger count pipeline): https://cdn.elifesciences.org/articles/51254/elife-51254-code2-v3.tsv.gz
Download elife-51254-code2-v3.tsv.gz
Source code 3 The network learned in this paper as a TSV file.: https://cdn.elifesciences.org/articles/51254/elife-51254-code3-v3.tsv
Download elife-51254-code3-v3.tsv
Source code 4 A ‘.tar.gz’ archive containing the sequences used for mapping reads. It also contains a FASTA file containing the genotype-specific barcodes (bcdel_1_barcodes.fasta), a FASTA file containing the yeast S288C genome modified with markers (Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.Marker.fa), and a GTF file containing the yeast gene annotations modified to include untranslated regions at the 5’ and 3’ end, and with markers (Saccharomyces_cerevisiae.R64-1-1.Marker.UTR.notRNA.gtf).: https://cdn.elifesciences.org/articles/51254/elife-51254-code4-v3.tar.gz
Download elife-51254-code4-v3.tar.gz
Source code 5 A zipped HTML document containing the raw R output figures for Figures 2–7 and accompanying supplementary Figures. The R markdown file to create this document is contained in Source code 1.: https://cdn.elifesciences.org/articles/51254/elife-51254-code5-v3.zip
Download elife-51254-code5-v3.zip
Supplementary file 1 An excel file containing Supplemental Tables 1-6. Supplemental Table 1 contains all primer sequences used in this work. Supplemental Table 2 contains all Saccharomyces cerevisiae strains used in this work. Supplemental Table 3 contains all plasmids used in this work. Supplemental Table 4 contains all media formulations used in this work. Supplemental Table 5 contains the source data for modeling performance (as AUPR) that is reported graphically in Figure 5. Supplemental Table 6 contains the gene categorizations (cell cycle stage, RP, RiBi, etc) used in Figure 3.: https://cdn.elifesciences.org/articles/51254/elife-51254-supp1-v3.xlsx
Download elife-51254-supp1-v3.xlsx
Transparent reporting form: https://cdn.elifesciences.org/articles/51254/elife-51254-transrepform-v3.pdf
Download elife-51254-transrepform-v3.pdf

Download links

Tuesday, November 10, 2020

PA 2020 vote records

PA open data

https://data.pa.gov/Government-Efficiency-Citizen-Engagement/2020-General-Election-Mail-Ballot-Requests-Departm/mcba-yywm

Saturday, November 7, 2020

voter registration record data sets

https://data.pa.gov/Government-Efficiency-Citizen-Engagement/2020-General-Election-Mail-Ballot-Requests-Departm/mcba-yywm/data

data.gov

https://catalog.data.gov/organization/allegheny-county-city-of-pittsburgh-western-pa-regional-data-center

Wednesday, November 4, 2020

Broad TERRA workspace

https://terra.bio/covid19

https://support.terra.bio/hc/en-us/articles/360041068771--COVID-19-workspaces-data-and-tools-in-Terra

https://terra.bio/covid19

Some broad sequences in SRA

Monday, November 2, 2020

medical image MNIST data set

https://arxiv.org/abs/2010.14925

"We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets."

https://github.com/MedMNIST/MedMNIST

Wednesday, September 16, 2020

*** government response stringency index, COVID19

university of oxford

COVID19

https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker

https://github.com/OxCGRT/covid-policy-tracker/raw/master/data/timeseries/OxCGRT_timeseries_all.xlsx

https://github.com/OxCGRT/covid-policy-tracker

this is related to Twitter sentiment analysis

cell vision .org

yeast data resource

https://thecellvision.org/

Tuesday, September 15, 2020

mass spec data base

MassIVE.quant: a community resource of quantitative mass spectrometry–based proteomics datasets

https://www.nature.com/articles/s41592-020-0955-0