Thursday, September 30, 2021

Wednesday, September 29, 2021

cpsc4180 midterm Q&A

 In zoom breakout room, I went over student project one on one. 



Monday, September 27, 2021

cpsc4180 Q&A midterm projects

 == pre-class to do: 

calendar email invitation: including guests;  done. 

socrative questions (midterm exam, questions on contents from last lecture ). 

update Canvas course materials, update learning objectives. assignments as needed.  done. 

Test-run code: Rmd -> HTML report with content.  not today

learning objectives:  not today

== In-class to do: 

clean up desktop space, calendars, 

announce midterm surveys. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

go over student problems, go over final projects, add sample student project presentation videos. 

need to generate final project sign up sheets. 



Sunday, September 26, 2021

validation for deep learning


It seems that "validation data sets" may be used in different ways in practice.  

https://stackoverflow.com/questions/46308374/what-is-validation-data-used-for-in-a-keras-sequential-model


Qin: 

See 

https://www.tensorflow.org/guide/keras/train_and_evaluate#using_a_validation_dataset

model.fit(train_dataset, epochs=1, validation_data=val_dataset)

Thanks,

 

From TP: 

"After the meeting I wasn't 100% satisfied with our explanation of what the validation set is used for. I realized if we train using the training set, then applying the loss of the validation set to the training set is useless.

 

I found two articles to this question which sum up the answer very well:

 

 

To summarize,

 

You use the validation set to determine how well your model is learning during training. It is mostly used for hyperparameter training as you can retrain the model with different parameters and see how it compares. The idea is that it is also trained on so you can see how fast the model picks it up.

 

Overall though, we would use the Test set at the very end to gauge the accuracy of the model on completely new data it's never seen before.

 

To me, this seems like it can be done with the training set alone, however I understand the concept to just check a small subset of the training data to see how quickly the model will learn it. Since it isn't too difficult, I will incorporate this into the models and try to add some graphs to chart the training. This way, I can do some hyperparameter tuning once the transfer learning is set up and working.

 "

Wednesday, September 22, 2021

cpsc4180 9/22 Colab, permutation-Zscore, midterm exam

== pre-class to do: 

calendar email invitation: including guests;  done. 

socrative questions (midterm exam, questions on contents from last lecture ).  done. 

update Canvas course materials, update learning objectives. assignments as needed.  done. 

Test-run code: Rmd -> HTML report with content.  done. 

learning objectives:  done. 

== In-class to do: 

clean up desktop space, calendars, 

announce midterm surveys. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

Redo CoLab (without GoogleDrive) with R

Graph permutation, part 2 on Z-score calculation. 

Midterm exams, sharing tips



Monday, September 20, 2021

ms02 randomness verification

 

use network with known theoretical random permutation


this direction might be too theoretic and has very little practical importance. 


single cell multiplexed and image and proteomic data

Automated assignment of cell identity from single-cell multiplexed imaging and proteomic data

Geuenich Michael; Hou Jinyu; Lee Sunyun; Ayub Shanza;  Jackson Hartland;  Campbell Kieran 


https://zenodo.org/record/5156049#.YUjywGZKjzc


cpsc4180 9/20 data science, CoLab, graph permutation

== pre-class to do: 

calendar email invitation: including guests;  done. 

socrative questions (midterm exam, questions on contents from last lecture ).  done. 

update Canvas course materials, update learning objectives. assignments as needed.  done. 

Test-run code: Rmd -> HTML report with content.  done. 

learning objectives:  done. 

== In-class to do: 

clean up destk top space, calendars, 

announce midterm surveys. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

In the 48 continental states example, I found out that california and florida are always neighbors. So, it seems power-law seems to put the nodes in limited search spaces. 



Sunday, September 19, 2021

ssh -x ecs323gpustation

 ssh -x user@ecs323gpustation

gs *pdf

a X windown poped up at remote local computer. 

Thursday, September 16, 2021

Wednesday, September 15, 2021

cpsc4180 9/15 simple statistic with us election results

== pre-class to do: 

calendar email invitation: including guests; done

socrative questions (midterm exam, questions on contents from last lecture ). done

update Canvas course materials, update learning objectives. assignments as needed. done. 

Test-run code: Rmd -> HTML report with content. done

learning objectives: done

== In-class to do: 

clean up destktop space, calendars, 

announce midterm surveys. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 


  CoLab and Google Drive for midterm projects. 

UTC MS thesis

Dickerson, Jessica <Jessica-Dickerson@utc.edu>

The requirements to finish your thesis is to have six credit of satisfactory progress and defend your thesis. If you are at the six credit limit then you need to continue registering for thesis credit hours until you defend your thesis. Note that you need to register for at least two credit hours of thesis in the semester you are defending in. 

As for evaluating your thesis credits, the evaluation criteria is decided by your thesis advisor. So I would encourage you to discuss this with your advisor and see what are the final deliverables for a satisfactory progress grade. 


Monday, September 13, 2021

COVID19 accounts

 

https://twitter.com/FenixAmmunition



CPSC 4180 R, input output

   == pre-class to do: 

calendar email invitation: including faculty peer evaluators.  done

socrative questions (midterm exam, questions on contents from last lecture ).  done. 

update Canvas course materials, update learning objectives. assignments as needed.  done

Test-run code: Rmd -> HTML report with content.  done

== In-class to do: 

clean up destktop space, calendars, 

announce midterm surveys. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

 input, output. 

  CoLab and Google Drive for midterm projects. 

Saturday, September 11, 2021

Thursday, September 9, 2021

mummer, mummerplot ecs323gpu



mummer -maxmatch -n -l 100 ratg13.fasta prC31.fasta > ratg13-prc31.mumm

mummer -maxmatch -n -l 50 cov-fasta/ncbi-ref.fasta cov-fasta/ratg13.fasta > output/ncbiref-ratg13-09091457.mumm


mummerplot output/ncbiref-ratg13-09091457.mumm

mummerplot -x "[0,32000]" -y "[0,32000]" --png   output/ncbiref-ratg13-09091457.mumm



JC Question: Do mutation hotspots cover spike and Mpro regions? 

Based on NCBI NC045512: 
S gene is 21563 - 25384, which contain a large gap. 
3CL nsp5Ais 10055 - 10972.  The maxmatch results are: 
 ratg    wuhan  
 9227  9339 57
10487 10615 77
10661 10791 63
So,  the beginning and middle sections of 3CL are mutation hotspots too. 

MUMMER installation ecs323gpu workstation

 under root

sudo apt isinstall autoconf, automake, libtool

install yaggo. 

then download release tarball. 

./configure prefix=/opt/mummer

make

sudo make install. 

Wednesday, September 8, 2021

Student project,


RW
I uploaded the notebook I'm working with to my own github. I can also add it to the one you shared with me in our initial meeting.

https://github.com/rwedell/covid/blob/main/COVID-19%20Variant%20Distribution.ipynb

CPSC 4180 Sep 8, R coding

  == pre-class to do: 

calendar email invitation: including faculty peer evaluators. done. 

socrative questions (rbind, cbind, merge, questions on contents from last lecture ). done. 

update Canvas course materials, update learning objectives. assignments as needed. done

Test-run code: Rmd -> HTML report with content. done

== In-class to do: 

clean up destktop space, calendars, 

announce midterm projects. 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

 R coding

slides, 

make solutions

take even

Monday, September 6, 2021

GISAID tracking variant

 

https://www.gisaid.org/hcov19-variants/

Alpha peaked and then dropped. 

Delta increasing and accelerating. 



Thursday, September 2, 2021

Landon REU 2021 GISAID resampling

 

https://rpubs.com/landon2000/790207


GISAID lineage information

From: 

https://www.gisaid.org/references/statements-clarifications/clade-and-lineage-nomenclature-aids-in-genomic-epidemiology-of-active-hcov-19-viruses/

Clade and lineage nomenclature aids in genomic epidemiology studies of active hCoV-19 viruses

Due to the naturally expanding genetic diversity of hCoV-19 viruses, GISAID introduced a nomenclature system for major clades, developed by Sebastian Maurer-Stroh et al, based on marker mutations within 8 high-level phylogenetic groupings from the early split of S and L, to the further evolution of L into V and G, and later of G into GH, GR and GV, and more recently GR into GRY.

GISAID clades are augmented with more detailed lineages assigned by the Phylogenetic Assignment of Named Global Outbreak LINeages (Pango lineage) tool, aiding in the understanding of patterns and determinants of the global spread of the pandemic strain causing COVID-19. A third effort uses a Year-Letter nomenclature to facilitate discussion of large-scale diversity patterns of hCoV-19 and label clades that persist for at least several months and have significant geographic spread. 



The list of the marker variants is as follows:

   S: C8782T,T28144C includes NS8-L84S
   L: C241,C3037,A23403,C8782,G11083,G26144,T28144 (early clade markers in WIV04-reference sequence)
   V: G11083T,G26144T NSP6-L37F + NS3-G251V
   G: C241T,C3037T,A23403G includes S-D614G
   GK: C241T,C3037T,A23403G,C22995A S-D614G + S-T478K
   GH: C241T,C3037T,A23403G,G25563T includes S-D614G + NS3-Q57H
   GR: C241T,C3037T,A23403G,G28882A includes S-D614G + N-G204R
   GV: C241T,C3037T,A23403G,C22227T includes S-D614G + S-A222V
   GRY: C241T,C3037T,21765-21770del,21991-21993del,A23063T,A23403G,G28882A includes S-H69del, S-V70del, S-Y144del, S-N501Y + S-D614G + N-G204R




SARS-COV-1 info

SRS coronavirus Tor2 complete genome

https://www.ncbi.nlm.nih.gov/nuccore/NC_004718.3 

https://www.ncbi.nlm.nih.gov/assembly/GCF_000864885.1/?&utm_source=None

Tor2 is the Toronto strain. Urbani strain is the Asian strain. 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1125963/

"The differences between the two strains turn out to be minor. Both comprise about 30000 nucleotides, making the genome of SARS-CoV the largest of any RNA virus. It is possible but unlikely that the differences are a result of sequencing errors."

"The structural differences from other coronaviruses, and the lack of evidence of recombination, suggest that the SARS virus is not a result of other viruses swapping DNA with a previously benign coronavirus that already lived unnoticed in humans."

"Rather, the researchers say, the evidence indicates that SARS is genuinely new in humans and until recently inhabited an unknown animal species, probably in Guangdong province, China."

Wednesday, September 1, 2021

CPSC 4180 data science Sep 1, weather + COVID19

 == pre-class to do: 

calendar email invitation: including faculty peer evaluators. done

socrative questions (rbind, cbind, merge, questions on contents from last lecture ): done. 

update Canvas course materials, update learning objectives. assignments as needed: done

Test-run code: Rmd -> HTML report with content. done

== In-class to do: 

clean up destktop space, calendars, 

ZOOM, live transcript (start video recording).  Turn computer speaker on. 

Socrative sign in 

   Review Chapter 3

R-COVID19  Chapter 4.  weather.  + socrative questions. 



yeast quantitative genetics cross study

 


---

title: "yeast power study"

author: "H Qin"

date: "8/31/2021"

output:

  pdf_document: default

  html_document: default

---


```{r simulate genotypes}

rm(list=ls())

N = 150

nuc_means = rpois(10, lambda=10)

mit_means = rpois(15, lambda=10)

summary(nuc_means)

summary(mit_means)


b0= 0

b1= 1 # mito influence on phenotype

b2= 1 # nuclear influence on phenotype

b3 =0.2  # mit X nuc interaction influence on phenotype, p << 0.001

b3 =0.1  # mit X nuc interaction influence on phenotype, p=0.049

#b3 = 0.05 # p = 0.3

```


```{r simulate-phenotype}

debug = 0


phenotype_mit_nuc = function(b0, b1, b2, b3, mit_single_mean, nuc_single_mean, debug){

  y = b0 + b1*mit_single_mean + b2*nuc_single_mean + b3*mit_single_mean * nuc_single_mean

  if (debug > 0) {

    print( paste("pmn:: mit_single_mean =", mit_single_mean, "nuc_single_mean", nuc_single_mean) )

  }

  return (y)

}


nuc_genotypes = sample(1:10, N, replace=TRUE)

mit_genotypes = sample(1:15, N, replace=TRUE)

y = 1:N

for ( i in 1:N ){

  #print(paste("i:", i, "mit_genotypes[i]",mit_genotypes[i] ))

  y[i] = phenotype_mit_nuc(b0, b1, b2, b3, mit_means[mit_genotypes[i]], nuc_means[nuc_genotypes[i]], debug=0) + rnorm(1)

}  


tb = data.frame( cbind( y, mit_genotypes, nuc_genotypes)) 

tb$mit_genotypes = factor( tb$mit_genotypes)

tb$nuc_genotypes = factor( tb$nuc_genotypes)

summary(tb)

```


```{r}

library(nlme);


m1a = glm(y ~ mit_genotypes  , tb, family='gaussian');


m2 = glm(y ~ mit_genotypes + nuc_genotypes , tb, family='gaussian');


m3 =  glm( y ~ mit_genotypes + nuc_genotypes + mit_genotypes:nuc_genotypes, data=tb)

```


```{r}

#summary(m1a)

```


```{r}

anova( m1a, m2, test='F')

```


```{r}

summary(m2)

summary(m3)

anova(m2, m3, test='F')

```