https://zenodo.org/records/10405106
This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
Friday, January 10, 2025
Friday, June 17, 2022
aging methylation data, single cell, Transcription factor
Inference of age-associated transcription factor regulatory activity changes in single cells
https://www.nature.com/articles/s43587-022-00233-9#data-availability
re we present and validate a TF activity estimation method for single cells from the hematopoietic system that is based on TF regulons, and apply it to a mouse single-cell RNA-sequencing atlas, to infer age-associated differentiation activity changes in the immune cells of different organs
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE56046
| ||||
Status | Public on Nov 24, 2014 | |||
Title | Transcriptomics and methylomics of human monocytes [methylome] | |||
Organism | Homo sapiens | |||
Experiment type | Methylation profiling by genome tiling array | |||
Summary | The MESA Epigenomics and Transcriptomics Study has been launched to investigate potential gene expression regulatory methylation sites in humans by examining the association between CpG methylation and gene expression in purified human monocytes from a large study population (community-dwelling participants in the Multi-Ethnic Study of Atherosclerosis (MESA)). The MESA Epigenomics and Transcriptomics Study was funded by a National Heart, Lung and Blood Institute grant (R01HL101250) through the NIH Roadmap Epigenomics Program in 2009. |
Wednesday, June 15, 2022
genome wide pertub-seq
CRISPi + scRNA
https://gwps.wi.mit.edu/
matrix file in H5AD format
https://doi.org/10.25452/figshare.plus.20029387
Genome wide screen targeted n=9867 genes.
Essential-wide screen targeted n=2285 essential genes.
Growth phenotypes were measured log2-guide enrichment per cell doubling (gamma)
"The relative homogeneity of CRISPRi reduces selection for unperturbed cells, especially when studying essential genes. Unlike CRISPR knockout, CRISPRi does not lead to activation of the DNA damage response which can alter transcriptional signatures (Haapaniemi et al., 2018)."
"We use a compact, multiplexed CRISPR interference (CRISPRi) library to assay thousands of loss-of-function genetic perturbations with single-cell RNA sequencing (scRNA-seq) in chronic myeloid leukemia (CML) (K562) and retinal pigment epithelial (RPE1) cell lines."
There are four datasets on SRA:
- K562 day 8 Perturb-seq (KD8): targeting all expressed genes at day 8 after transduction
- RPE1 day 7 Perturb-seq (RD7): targeting DepMap essential genes at day 7 after transduction
- K562 day 6 Perturb-seq (KD7): targeting DepMap essential genes at day 6 after transduction
- K562 day 8 Perturb-seq (KD8_ultima): scRNA-seq libraries from the KD8 experiment sequenced on the Ultima sequencing platform rather than the Illumina sequencing platform
Sunday, May 22, 2022
MIMIC-III
https://physionet.org/content/mimiciii/1.4/
MIMIC-III is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).
MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.
Wednesday, May 11, 2022
WILD
WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.
https://github.com/p-lambda/wilds/tree/main/examples
Monday, September 20, 2021
single cell multiplexed and image and proteomic data
Tuesday, May 11, 2021
GTEx data portals
DBGAP
https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=pi_requests&filter=wlid&wlid=29078
ANVIL
https://anvil.terra.bio/#profile
https://app.terra.bio/#workspaces
Thursday, April 15, 2021
human protein atlas
https://www.proteinatlas.org/humanproteome/cell
deep RNA seq.
protein localization by antibody profiling with immunofluorescence and confocal microscopy -> 35 locations
Wednesday, April 14, 2021
Monday, April 5, 2021
Skeletal muscle transcriptome in healthy aging
Skeletal muscle transcriptome in healthy aging
https://www.nature.com/articles/s41467-021-22168-2
RNA was extracted and sequenced from muscle biopsies collected from 53 healthy individuals (22–83 years old) of the GESTALT study of the National Institute on Aging–NIH
Saturday, January 2, 2021
Monday, December 28, 2020
USA 2020 election results by counties
MIT election night 2020
https://github.com/MEDSL/election_night2020
2020 results by counties
https://github.com/openelections
2004, 2008, 2012
https://github.com/helloworlddata/us-presidential-election-county-results
Tuesday, November 24, 2020
jackson20 elife Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments
11 different TF deletion strains
a digital expression matrix was provided in the supporting doc.
from scRNA counts per gene to gene regulatory network, the author used "inferelator", a regression based method.
A gold-standard prior TF-network was cited from Tchourine, 2018. which include 1403 signed (-1, 0, 1) interactions in a 998 genes by 98 transcription factor regulatory matrix.
Another unsigned gene network cited Teixeira 2018 that include 11486 interactions in a 3912 genes by 152 TFs.
A multi-task fitting was used to infer a gene network to 11 conditions.
The cellranger pipeline is available from 10x Genomics under the MIT license (https://github.com/
10XGenomics/cellranger). The fastqToMat0 pipeline is available from GitHub (https://github.com/
flatironinstitute/fastqToMat0; Jackson, 2020; copy archived at https://github.com/elifesciences-pub-
lications/fastqToMat0) and is released under the MIT license. Genome sequence and annotations
are included as Source code 4.
# Included in this archive are the following data files:
# data/
# 103118_SS_Data.tsv.gz (TSV count matrix of single-cell yeast reads [cells x genes])
# 110518_SS_NEG_Data (TSV count matrix of simulated single-cell reads [cells x genes])
# TRIZOL_BULK.tsv (TSV count matrix of data prepared with TRIZOL [samples x genes])
# yeast_gene_names.tsv (TSV with Systematic Names and Common Names for yeast genes)
# STable5.tsv (TSV copy of Supplemental Table 5 with cross-validation results)
# STable6.tsv (TSV copy of Supplemental Table 6 with genes grouped into categories)
# go_slim_mapping.tab (GO Slim Terms: https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab)
# go_slim_labels.tsv (TSV with shorter figure names for GO slim terms)
# GASCH_2017_COUNTS.tsv (TSV from GSE102475; Gasch 2017 BY4741 single-cell TPM data in mid-log YPD [cells x genes])
# LEWIS_ALL.tsv (TSV from GSE135430; Scholes 2019 BY4741 bulk TPM data in mid-log YPD[samples x genes])
# LARS_2019_COUNTS.tsv (TSV from GSE122392; Nadal-Ribelles 2019 BY4741 single-cell count data in mid-log YPD [cells x genes])
# inferelator/
# jackson_2019_figureXX.py (Python script to run the network inference with the inferelator v0.3.0 for the associated figure)
# network/
# signed_network.tsv (TSV signed [-1, 0, 1] network of regulatory relationships [genes x TFs])
# COND_signed_network.tsv (TSV signed [-1, 0, 1] network of regulatory relationships [genes x TFs] for each of 11 conditions)
# priors/
# Tchourine_gold_standard.tsv.gz (TSV with gold standard from Tchourine et al 2018)
# ATAC-motif_priors.tsv.gz (TSV with atac-motif priors from Castro et al 2019)
# YEASTRACT_priors_20181118.tsv.gz (TSV with YEASTRACT priors downloaded from YEASTRACT 11/18/2018)
# YEASTRACT_20190713_BOTH.tsv (TSV with YEASTRACT priors downloaded from YEASTRACT 07/13/2019)
# YEASTRACT_20190713_DNABINDING.tsv (TSV with YEASTRACT DNA-binding interaction data downloaded from YEASTRACT 07/13/2019)
# YEASTRACT_20190713_EXPRESSION.tsv (TSV with YEASTRACT expression change interaction data downloaded from YEASTRACT 07/13/2019)
# BUSSEMAKER_priors_2008.tsv.gz (TSV with priors from Ward & Bussemaker 2008)
Source code 1
A ‘tar.gz’ archive containing R scripts used to generate Figures 2–7 and accompanying supplementary figures with a README detailing the necessary R environment to run them locally.
It also contains a data folder with the raw count matrix as a TSV file (103118_SS_Data.tsv.gz), the simulated negative data count matrix as a TSV file (110518_SS_NEG_Data.tsv.gz), a gene name metadata TSV file (yeast_gene_names.tsv), supplemental tables 5 (STable5.tsv) and 6 (STable6.tsv) as TSV files, and the yeast gene ontology slim mapping as a TAB file (go_slim_mapping.tab). Source code 1 also contains a priors folder with the Gold Standard, the three sets of priors data tested in this work, and the YEASTRACT comparison data, all as TSV files. Source code 1 also contains a network folder with the network learned in this paper (signed_network.tsv) as a TSV file, and the networks for each experimental condition (COND_signed_network.tsv) as 11 separate TSV files. Source code 1 also contains an inferelator folder with the python scripts used to generate the networks for Figures 5, 6, 7.
- https://cdn.elifesciences.org/articles/51254/elife-51254-code1-v3.tar.gz
Source code 2
The raw count matrix as a gzipped TSV file.
This file contains 38,225 observations (cells). Doublets and low-count cells have already been removed; gene expression values are unmodified transcript counts after deartifacting using UMIs (these values are directly produced by the cellranger count pipeline)
- https://cdn.elifesciences.org/articles/51254/elife-51254-code2-v3.tsv.gz
Source code 3
The network learned in this paper as a TSV file.
- https://cdn.elifesciences.org/articles/51254/elife-51254-code3-v3.tsv
Source code 4
A ‘.tar.gz’ archive containing the sequences used for mapping reads.
It also contains a FASTA file containing the genotype-specific barcodes (bcdel_1_barcodes.fasta), a FASTA file containing the yeast S288C genome modified with markers (Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.Marker.fa), and a GTF file containing the yeast gene annotations modified to include untranslated regions at the 5’ and 3’ end, and with markers (Saccharomyces_cerevisiae.R64-1-1.Marker.UTR.notRNA.gtf).
- https://cdn.elifesciences.org/articles/51254/elife-51254-code4-v3.tar.gz
Source code 5
A zipped HTML document containing the raw R output figures for Figures 2–7 and accompanying supplementary Figures.
The R markdown file to create this document is contained in Source code 1.
- https://cdn.elifesciences.org/articles/51254/elife-51254-code5-v3.zip
Supplementary file 1
An excel file containing Supplemental Tables 1-6.
Supplemental Table 1 contains all primer sequences used in this work. Supplemental Table 2 contains all Saccharomyces cerevisiae strains used in this work. Supplemental Table 3 contains all plasmids used in this work. Supplemental Table 4 contains all media formulations used in this work. Supplemental Table 5 contains the source data for modeling performance (as AUPR) that is reported graphically in Figure 5. Supplemental Table 6 contains the gene categorizations (cell cycle stage, RP, RiBi, etc) used in Figure 3.
- https://cdn.elifesciences.org/articles/51254/elife-51254-supp1-v3.xlsx
Transparent reporting form
- https://cdn.elifesciences.org/articles/51254/elife-51254-transrepform-v3.pdf
Download links
Tuesday, November 10, 2020
Saturday, November 7, 2020
Wednesday, November 4, 2020
Broad TERRA workspace
https://terra.bio/covid19
https://support.terra.bio/hc/en-us/articles/360041068771--COVID-19-workspaces-data-and-tools-in-Terra
https://terra.bio/covid19
Some broad sequences in SRA
Monday, November 2, 2020
medical image MNIST data set
https://arxiv.org/abs/2010.14925
"We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28 * 28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets."
https://github.com/MedMNIST/MedMNIST
Wednesday, September 16, 2020
*** government response stringency index, COVID19
university of oxford
COVID19
https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker
https://github.com/OxCGRT/covid-policy-tracker/raw/master/data/timeseries/OxCGRT_timeseries_all.xlsx
https://github.com/OxCGRT/covid-policy-tracker
this is related to Twitter sentiment analysis
Tuesday, September 15, 2020
mass spec data base
MassIVE.quant: a community resource of quantitative mass spectrometry–based proteomics datasets
x
https://www.nature.com/articles/s41592-020-0955-0