Showing posts with label cancer. Show all posts
Showing posts with label cancer. Show all posts

Thursday, July 22, 2021

Wednesday, January 18, 2017

Wisconsin breast cancer diagnostic data set, machine learning analysis

This must an old data set

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29


Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004. [View Context].

Krzysztof Grabczewski and Wl/odzisl/aw Duch. Heterogeneous Forests of Decision Trees. ICANN. 2002. [View Context].

András Antos and Balázs Kégl and Tamás Linder and Gábor Lugosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3. 2002. [View Context].

Kristin P. Bennett and Ayhan Demiriz and Richard Maclin. Exploiting unlabeled data in ensemble methods. KDD. 2002. [View Context].

Hussein A. Abbass. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine, 25. 2002. [View Context].

Baback Moghaddam and Gregory Shakhnarovich. Boosted Dyadic Kernel Discriminants. NIPS. 2002. [View Context].

Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001. [View Context].

Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001. [View Context].

Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000. [View Context].

Yuh-Jeng Lee. Smooth Support Vector Machines. Preliminary Thesis Proposal Computer Sciences Department University of Wisconsin. 2000. [View Context].

Justin Bradley and Kristin P. Bennett and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000. [View Context].

Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Improved Generalization Through Explicit Optimization of Margins. Machine Learning, 38. 2000. [View Context].

P. S and Bradley K. P and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000. [View Context].

Chun-Nan Hsu and Hilmar Schuschel and Ya-Ting Yang. The ANNIGMA-Wrapper Approach to Neural Nets Feature Selection for Knowledge Discovery and Data Mining. Institute of Information Science. 1999. [View Context].

Huan Liu and Hiroshi Motoda and Manoranjan Dash. A Monotonic Measure for Optimal Feature Selection. ECML. 1998. [View Context].

Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Direct Optimization of Margins Improves Generalization in Combined Classifiers. NIPS. 1998. [View Context].

W. Nick Street. A Neural Network Model for Prognostic Prediction. ICML. 1998. [View Context].

Yk Huhtala and Juha Kärkkäinen and Pasi Porkka and Hannu Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE. 1998. [View Context].

Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997. [View Context].

Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997. [View Context].

Rudy Setiono and Huan Liu. NeuroLinear: From neural networks to oblique decision rules. Neurocomputing, 17. 1997. [View Context].

Erin J. Bredensteiner and Kristin P. Bennett. Feature Minimization within Decision Trees. National Science Foundation. 1996. [View Context].

Ismail Taha and Joydeep Ghosh. Characterization of the Wisconsin Breast cancer Database Using a Hybrid Symbolic-Connectionist System. Proceedings of ANNIE. 1996. [View Context].

Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996. [View Context].

Geoffrey I. Webb. OPUS: An Efficient Admissible Algorithm for Unordered Search. J. Artif. Intell. Res. (JAIR, 3. 1995. [View Context].

Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California. [View Context].

Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding. [View Context].

Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. An Ant Colony Based System for Data Mining: Applications to Medical Data. CEFET-PR, CPGEI Av. Sete de Setembro, 3165. [View Context].

Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. PART FOUR: ANT COLONY OPTIMIZATION AND IMMUNE SYSTEMS Chapter X An Ant Colony Algorithm for Classification Rule Discovery. CEFET-PR, Curitiba. [View Context].

Adam H. Cannon and Lenore J. Cowen and Carey E. Priebe. Approximate Distance Classification. Department of Mathematical Sciences The Johns Hopkins University. [View Context].

Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall. [View Context].

Bart Baesens and Stijn Viaene and Tony Van Gestel and J. A. K Suykens and Guido Dedene and Bart De Moor and Jan Vanthienen and Katholieke Universiteit Leuven. An Empirical Assessment of Kernel Type Performance for Least Squares Support Vector Machine Classifiers. Dept. Applied Economic Sciences. [View Context].

Adil M. Bagirov and Alex Rubinov and A. N. Soukhojak and John Yearwood. Unsupervised and supervised data classification via nonsmooth and global optimization. School of Information Technology and Mathematical Sciences, The University of Ballarat. [View Context].

Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore. [View Context].

Huan Liu. A Family of Efficient Rule Generators. Department of Information Systems and Computer Science National University of Singapore. [View Context].

Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore. [View Context].

Jarkko Salojarvi and Samuel Kaski and Janne Sinkkonen. Discriminative clustering in Fisher metrics. Neural Networks Research Centre Helsinki University of Technology. [View Context].

Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University. [View Context].

Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics. [View Context].

breast cancer, R, random forest

From:
https://shiring.github.io/machine_learning/2017/01/15/rfe_ga_post


The data I am going to use to explore feature selection methods is the Breast Cancer Wisconsin (Diagnostic) Dataset:
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792–796, 1995.
The data was downloaded from the UC Irvine Machine Learning Repository. The features in these datasets characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses.
Included are three datasets. The first dataset is small with only 9 features, the other two datasets have 30 and 33 features and vary in how strongly the two predictor classes cluster in PCA. I want to explore the effect of different feature selection methods on datasets with these different properties.

Tuesday, January 3, 2017

cancer reading notes

Driver and passenger mutations

Solid tumors have 33-66 somatic nonsynonymous mutations.
140 such genes on cell fate, cell survival and genome maintenance
Heterogeneity in tumor cells


KEGG A collection of manually curated pathway models
www.genome.jp/kegg/pathway.html
BIOCARTA A collection of pathways for more than 300 species
http://www.biocarta.com
PID A collection of curated pathways for human signaling and regulatory processes
http://pid.nci.nih.gov/

Pathguide A meta- database providing an overview of all web-accessible biological pathway and network databases
http://www.pathguide.org/
REACTOME A resource of curated human pathways http://www.reactome.org/

MSidDB A collection of diseased-related gene sets
http://www.broadinstitute.org/gsea/msigdb/collections.jsp

Saturday, September 24, 2016

cancer network analysis, 2014 Leiserson et al, Nature genetics



2014 Nature genetics
 Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes

Mark D M Leiserson1,2,14, Fabio Vandin1,2,13,14, Hsin-Ta Wu1,2, Jason R Dobson1–3, Jonathan V Eldridge1, Jacob L Thomas1, Alexandra Papoutsaki1, Younhun Kim1, Beifang Niu4, Michael McLellan4, Michael S Lawrence5, Abel Gonzalez-Perez6, David Tamborero6, Yuwei Cheng7, Gregory A Ryslik8, Nuria Lopez-Bigas6,9, Gad Getz5,10, Li Ding4,11,12 & Benjamin J Raphael1,2



"URLs. HI2012 interactome, http://interactome.dfci.harvard.edu/; HotNet2 pan-cancer analysis website, http://compbio.cs.brown.edu/pancancer/hotnet2/; RNA expression data used for the TCGA pan-cancer data set, https://www.synapse.org/#!Synapse:syn1734155; pan-cancer mutations with additional germline variant filtering, https://www.synapse.org/#!Synapse:syn1729383; HotNet2 software release, http://compbio.cs.brown.edu/software." 

Wednesday, June 22, 2016

install CCLE RPackage, version problems, fixed using binary files

install.packages("ggplot2", "scales")

install.packages("gplots", "RColorBrewer", "vioplot", "dplyr", "reshape2", "plyr")

'glmnet' not available for R 3.0.3
Fix this problem by installing the binary tar.gz file r-release: glmnet_2.0-5.tgz from 

install.packages("~/github/JZ_compound/CCLE/RPackage/DRANOVA_1.0.tar.gz", repos = NULL, type = "source")


 install.packages("~/github/JZ_compound/CCLE/RPackage/CCLE.GDSC.compare_1.0.4.tar.gz", repos = NULL, type = "source")

OR

install.packages("~/github/JZ_compound/CCLE/RPackage/CCLE.GDSC.compare_1.0.4.tar.gz", repos = NULL, type = "source", dependencies = TRUE)




Tuesday, June 21, 2016

CCLE, cancer cell cline encyclopedia

Cancer cell line encyclopedia

http://www.broadinstitute.org/ccle/data/browseData?conversationPropagation=begin

After login, there are "published" and "in process" data available for download. These data can also be sent to Genospace (though it did not work for me).

CCLE R codes
http://www.broadinstitute.org/ccle/Rpackage/

From the 2015 nature publications, Elastic nets and Ridge line regressions were used there.

Genomespace:
https://gsui.genomespace.org/jsui/


http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse36139

Monday, April 18, 2016

***, compound profiling, initial thoughts

These compounds are 'newly designed and synthesized' and their targets are needed to be identified and verified.

PCA, then, Euclean distance,  MCL
PCA, then K-means

For comparison: Elastic net

Score criteria: jaccard index with original annotation.

https://en.wikipedia.org/wiki/Jaccard_index

Tuesday, December 22, 2015

COSMIC Catalogue of somatic mutations in cancer

It seems that tumor is a positive sample. Tumor seem to have negative controls. Comparison of the positive and negative samples should lead to mutations in tumors.



Sunday, December 20, 2015

Cancer genomics resource


McFarland, Sunyaev, Mirny PNAS, 2013 impact of deleterious passernger mutations on cancer progression.
Catalogue of Somatic Mutations in Cancer (COSMIC) and The Cancer Genome Atlas (TCGA). We classified them as driver and passenger mutation groups and then characterized their effects
using PolyPhen, a tool widely used in population and medical genetics to predict the damaging effect of missense mutations (15).

Ref 15: Boyko AR, et al. (2008) Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4(5):e1000083.





Sunday, December 28, 2014

toread, interaction based discovery of cancer genes

2014 Feb;42(3):e18. doi: 10.1093/nar/gkt1305. Epub 2013 Dec 19.

Interaction-based discovery of functionally important genes in cancers.


http://www.ncbi.nlm.nih.gov/pubmed/24362839

Wednesday, December 17, 2014

toread, pan-cancer network, somatic mutaitons

Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes, nature genetics, 2014