Thursday, January 31, 2013

codeml on polymorphism data

  Question: codeml on polymorphism data for selection?? treating polymorphism as divergence? Using codeml on multiple genome data is common. Is this an abuse of codeml? 

Tuesday, January 29, 2013

*** Journals related to quantitative biology, mathematic biology, computational biology, aging

Journals related to quantitative biology, mathematic biology, computational biology, aging

-> Springer, Bulletin of Mathematical Biology impact factor 2.0

-> Mathematical Biosciences
E Voit (editor)

->eLife, computational biology, Chris Ponting

-> Elsiver Jounrnal of computational and applied mathematics, impact factor 1.1

-> Springer, Journal of Mathematical biology, impact factor 2-3.

> Springer, Quantitative biology (not included in PubMed)


>BMC systems biology, impact factor 3.1

>BMC Biology

->Aging cell, Impact factor 6
Open access, publication charge

-> Peer J

-> Journal of Aging Research. Gavirolv and Gavrilova's journal.

-> Journal of theoretic biology, impact factor 2.5 

Biology Open, BMC Biology, the four EMBO scientific publications, and the PLOS journals.


Royal society transactions

Aging, New York

FEMS Yeast
impact factor 2.4

Mechanism of aging and development, IF 3.2,
Open access is a choice

Experimental gerontology, IF 3.9

Eukaryotic cell, IF 3.6

HHMI journal

Thursday, January 24, 2013

Cilantro and genetics

Materials on cilantro and human genetics


Failed attempt to fix a mountain lion laptop that failed to start

Hold "alt-option" key when the laptop was starting, choose "the recovery disk".

Then choose Disk Utility to verify HDD. In this case, the HDD needs to be repaired.

Select "Macintosh HD" (750G),  Click "repair Disk":
... ... (this took about 1 mintues)
... inncorrect number of thread records ... 
.... (another 1 mintue)
repairing volume

"Disk Unitliy stopped repairing Macintosh HD". Backup as many files as possible, reformat the disk, and restore your backup files.

Choose my 'backup disk" (it is a backup disk from a different Mountain Lion laptop), drag "Macintosh HD" to "Destination", ..., click "Erase".

... ... 'copying blocks', estimated time 2 hours... ...

After 1.5 hours, restart computer, chose 'restore system from backup'.  "Searching for disks" is gray and seem to be stuck.

18:43, Unpluged the external HD and restart the laptop, choose reinstall Mountain Lion osX. It shows that the harddisk is locked.  I tried again, it said that installation information is not available.

Wednesday, January 23, 2013

Notes on Yang & dos Reis 2000 MBE branch-site model

Statistical Properties of the Branch-Site Test of Positive Selection, Ziheng Yang and Mario dos Rei, 2000, MBE

what is foreground w2 and background?
w2 in the foreground branch are often infinite and unreliable.

It is sometimesattemptedtousetheestimateofx2asameasure
of the strength of selection in the gene on the foreground
branch.As discussed by Bakewell et al. (2007) and Nozawaet al.
(2009a), point estimates of x2 produced by the codeml program
are often infinite and unreliable.


Nozawa M, Suzuki Y, Nei M. 2009a. Reliabilities of identifying
positive selection by the branch-site and the site-prediction
methods. Proc Natl Acad Sci U S A. 106:6700–6705.

Bakewell MA, Shi P, Zhang J. 2007. More genes underwent positive
selection in chimpanzee evolution than in human evolution.
Proc Natl Acad Sci U S A. 104:7489–7494.

Spelman Crossregistration site for AUC students

Tuesday, January 22, 2013

For Spelman Faculty, How to access research resources & title III form

Based on the instructions that I received, the steps are

 1. Go to Lotus Notes Dashboard

 2. Click "Category": Document Libraries, and "Applications": Research Resources.

3. Click "Open selected app

4. Click "By category" on the left column

5. Select Research Resources Forms or Title III Forms

6. Click on the appropriate form

Wednesday, January 16, 2013

Use unison to synchronize directories between two apple computers

I have a mac server (Mountain Lion). I installed unison in /opt/local/bin/unison by download the mac GUI Universal Binary 2.27.72, tested on Leopard.  (Later, I reinstalled to 2.32).

I then install the unison with GUI on my Snow Leopard laptop.  I set up a small test directory /Users/hongqin/demo/unison/. I 'touched' a file 'hello.ace.txt' in this directory. On my snow leopard, I double-clicked the "U" icon, set up a profile:
Profile name: test ace->tower
First root: File: /Users/hongqin/demo/unison/
Second root: Remote, User: YYY, Host:

After save this profile, I "open" it. Password was asked for the remote server.
10:14, looking for changes, waiting for changes from server. This seems to be stuck.

$ unison -version # 2.32
$ ssh unison -version #2.27
OK, unison have different version on the two computers.

Reinstall 2.32 on the server side, but the /opt/local/bin/unison was not updated automatically. I have to manually cp a 2.32 unison to the server directory.

On 'ace'
$ unison demo/unison/  ssh://
This worked.

I then tried
$ unison projects  ssh://

During the run of unison, I tried to run a GUI version for another job, but the server did not respond. It seems there is only one instance of unison running on the server side. Not sure whether this is my configuration problem or just the way it is.

Updated on 2013 Feb 12:
The ssh method often drops when working on large number of files. Later, I mounted the remote disks locally, and run unison locally. This method seems more reliable than ssh.

The unison log file 'unison.log' is often located in user's home directory.

Note: 'unison' developer has stopped working on it. It is time for a new software tool.

Updated on June 24, 2015

Some common bugs in bioinformatics programming and lessons in dealing with them

Many ad hoc scripts are written for bioinformatics projects.  I intend to list here some common bugs that I often made. This is going to be a working list. Hopefully, this can leads to better way of coding and managing.
  • Hard-coded links that were not updated after projects have evolved. This often lead to wrong files, or results are output to unintended directories. 
  • A hard-coded debugging variable that was not turned back during production runs. 
  • Mixing of variables names, file names, etc. 
  • Compatibility problems. This can occur after software upgrades. For example, after upgrade to perl 5.10.0, I have re-install bioperl to get previous codes working. 
  •  File format problems. With myriads of data format, this problem is going to keep bugging us. 
  • Typos 
  • logical mistakes, often occur in ifelse statement.
Some experiences and lessons on dealing with these problems:
  • Correct the blind spot. 
    I spent 3 hours in fixing a directory problem in a perl script for batch run. I noticed the job was not running in the right directory even when I chdir $homedir every step. The problem is that I copy-paste the directory twice in variable $homedir, so perl always choose the current directory by default. I was so sure that $homedir was correct because I copy-pasted it, and did not check it. I found this out when I copy-pasted the long directory again and found the length did not match. I spent 2 hours on this before 1am. I then decided to go sleep and look it fresh again in the morning. The fresh morning working energy helped me spotted this error.

  • Switch between different syntax
In perl, qw and string quotes use different syntax. In qw, no comma is needed.
 case 1: qw(results.H0.txt results.H1C.Gblocks.model1.txt results.H2C1S1.txt);
 case 2: qw(results.H0.txt, results.H1C.Gblocks.model1.txt, results.H2C1S1.txt);
In case 2, the file name will actually be treated as "results.H0.txt,". This extra comma is a obvious mistake.

Find out funding budget/department on Lotus Notes

At Lotus Note dash board, choose "Business & Financial Affairs", then choose "budget transfer", and "open selected App". 

A new page will appear with all budgets. I found it convenient to choose "By Department", and then look for the specific fund/department.

Change email signature on Lotus Notes

Under email, choose "More" -> "preference" . Then under "Mail", choose "Signature".

Fix bioperl linking problem on a Apple Snow Leopard laptop

Bioperl related library cannot be linked. The problem seems to be that perl 5.10.0 is installed on this laptop, but bioperl is still tied to 5.8.9.

Add the following line to perl script:
 BEGIN { unshift(@INC,"/opt/local/lib/perl5/site_perl/5.8.9"); }
This fixes the problem, although I need to add this line to every perl script that calls bioperl.

I also rebuilt bioperl and reinstalled it, which went under site_perl/5.10.0 directory. Hopefully, this will fix the bioperl links for all scripts. I tested,, and it indeed worked.

Gblocks command line usage for codon alignments

# For codon alignment
 Gblocks _tmp.aln -t=c

The filename does not have a switch.
-t for data type, 'c' for codon
-e for file extension, and '.gb' is probably for GBlocks.


Monday, January 14, 2013

Sharing internet connection through wireless on a Mountain Lion laptop

 On the Mountain Lion laptop (biolaptopmac2), I went to system preference, opened sharing, chose internet sharing through WiFi.  This laptop was connected to internet through an Ethernet cable.

On my iPad, I select the biolaptopmac2 wireless network, and the internet connection works.

Later, I figure to set up a password protected wireless network, I should do this: Open "sharing", click "Wi-Fi Options", specify a wireless network name, and choose WPA, input password twice. The passwords have to be longer enough to make "OK" highlighted. After "OK", choose "Internet Sharing".  This way, I can created a password-protected wireless network using the Mountain Lion laptop.


Saturday, January 12, 2013

Yeast MSH2 project in an undergraduate course

I will teach a section of BIO125 molecular biology and genomics in Spring semester 2013. This course is based on Gammie and Erdeniz "Characterization of pathogenic human MSH2 missense mutations using yeast as a model system: a laboratory in molecular biology", Cell Biology Education, 2004.  BIO125 is a project based course. We expect  21 students,  working in 7 groups, in each section. Each group will work on one msh2 mutant. Each yeast msh2 mutant is designed based on equivalent mutations on human hMSH2.

Growth media with selection are designed to be -HIS-TRP-LEU-THR-URA, in which  -HIS for the pRS413 plasmid that can carry MSH2 genes and their mutants, -TRP for pSH44 reporter pladmids with dinucleotide-URA3, -LEU-THR to induce expression of the GT16-URE3 chimera, and -URA to "ensure no instabilities of dinucleotide repeats occur before the assays are inititated" (a statement that I do not grasp at the first reading).

To evaluate the dinucleotide instability, GE04 used synthetic media -HIS-TRP-LEU-THR+5FOA. 5FOA is purchased from Toronto Research Chemicals, Inc. To evaluate single bp mutations, the media is -HIS-TRP-ARG+CAN, canavanine (Sigma).

GE04 said that yeast strains were grown in liquid synthetic medium -HIS-TRP overnight at 30C. For the reporter strain AGY75 which does not have pRS413, I think it must have been grown in +HIS media, because -HIS is used to select for the HIS3 marker on pRS413 , a yeast centromere vector. The genotype of AGY75 is MATa ade2-1 trp1-1 ura3-1, leu2-3,112 his3-11,15, msh2::LEU2 RAD5.   GE04 indicated that AGY75 is derived from W303 (see footnote of Table 2).  Due to the ade2 mutation, most colonies should be pink. Interestingly, yeast strains different msh2 mutations have different frequencies for white colonies, based on the previous experience in Bio125.

I want to connect mismatch repair stress with cell cycle, and oxidative stress response. So, comparison of cell cycle status, DHE and DHR staining maybe an interesting questions. I should go back to review Burhans's paper on replication stress and (chronological?) life span.

GE illustra puReTag Ready-to-go PCR beads is used in this class.

Some major items that I think will be needed include 8 set of replica blocks and velveteen squares (7 for students groups and 1 for instructor) available from;

msh2 mutant grow very slow in SD-Trp-His. It can take 2.5 day to reach OD=0.6 for FOA assays.

Yeast media from Sunrise:

Qty Item # Description Total Price
1709-300 SD-Trp Powder, 10 x 0.5 liter pouches
Price Each: $99.00
$99.00 Remove
1709-010 SD-Trp Powder, 10 grams
Price Each: $45.00
$45.00 Remove
1705-100 SD-His Powder, 100 Grams
Price Each: $76.00
$76.00 Remove
1715-500 SD-His-Trp Powder, 500 Grams
Price Each: $275.00
$275.00 Remove
1715-300 SD-His-Trp Powder, 10 x 0.5 liter pouches
Price Each: $99.00
$99.00 Remove

See also

MSH2 resources
Wikipedia entry

Critical thinking, is there a formal way to do this?

I will teach a new course this spring semester on critical thinking in biology.

According to The Foundation for Critical Thinking (FCT), "critical thinking is the art of analyzing and evaluating thinking with a view to improving it". "A well cultivated critical thinker 1) raises vital questions and problems, formulating them clearly and precisely; 2) gathers and assesses relevant information, using abstract ideas to interpret it effectively; 3) comes to well-reasoned conclusions and solutions, testing them against relevant criteria and standards; 4) thinks open mindedly within alternative systems of thought, recognizing and assessing, as need be, their assumptions, implications, and practical consequences; and 5) communicates effectively with others in figuring out solutions to complex problems."  In FCT's pamphlet, it states: "Critical thinking is, in short, self-directed, self-disciplined, self-monitored, and self-corrective thinking. It requires rigorous standards of excellence and mindful command of their use. It entails effective communication and problem solving abilities and a commitment to overcome our native egocentrism and sociocentrism. "

FCT generalizes elements of thought as: point of view, purpose, question at issue, information, interpretation and inference, concepts, assumptions, and implications. It describes intellectual standards as clarity, accuracy, relevance, depth, breadth, significance, and fairness.

FCT advocates that after training in critical thinking, students should achieve intellectual humility, intellectual autonomy, intellectual integrity, intellectual courage, intellectual perseverance, confidence in reason, intellectual empathy, and fairmindness.

As a trained scientist, I found the FCT materials quite useful, and gave me a new appreciation of critical thinking. FCT's  Critical thinking in everyday life: 9 strategies  really provide some practical solutions on this subject.

By Googling "critical thinking strategies" (which is really an irony here), I found teaching strategies to help promote critical thinking.  This site promotes a definition: Critical thinking is thinking that assesses itself. This definition can be actually linked back to a FTC's webpage, "structures for student self-assessment".  (It is funny that FTC teaching PDF materials did not mention this). I like this definition, because it emphasizes on the self-critique nature of critical thinking: people should be willing be admit and correct mistakes and adjust their own views. 

In the end, though, critical thinking is a skill that has to be learned by practice.  People have to learn swimming by actually doing it in the water, not through textbooks, social networking,  or clickering it, if a joke could be made here.

I need a list of potential projects with actual data for students to work on. One project could be the analysis of Olympics medals ~ GDP, population, etc. The 2012 London Olympics metal data is at
I can ask students for find out the GDP data.

Another data source is the World Bank Public Data:,d.eWU
This sites also provides plot for descriptive statistics.

The Google Public Data directory can be a place for students for look for data sets for their own research interests.

One of my college suggests to use the case of arsenic in apple juice,
FDA's test results are at

To facilitate with student projects, I would ask all of them to set up Dropbox folders and share with me. 

  1. The Foundation for Critical Thinking

A note on the learning behavior of college students in USA

A note on the learning behavior of Generation Y students

Many faculty at Colleges in USA may found students do not expect themselves to spent too much effort on studying.  Mark Taylor thinks the center-of-the-world rearing environment, child centralism in his words, plays major in shaping the behavior of NeXt students, if I understood his points correctly. Dr. Taylor just gave a wonderful talk at Spelman College on how to educate the NeXt generation. His views were well received, and he was applauded by hundreds of faculty. I have a different view on this. Most of the Asian kids are treated as center of the world in their family. In fact they are called "little emperors" in China. However Asian students are mostly better at compliance and meeting teacher's expectations at schools. So, the fact that American kids can get all the As without too much effort have to do with many other factors.  Dr. Taylor also acknowledges the cultural difference on performance: American culture's emphasis on talents versus Asian culture's emphasis on effort.  But the societal and cultural factors are probably more important than Dr. Taylor acknowledged. During my first year of teaching at Spelman College, I remembered a student with exemplary academic record was asked to share her experiences to first year students. She bragged about how little effort she spent on studying. At that time, I thought that was just due to a individual student's personality.  After a few years of seeing this again and again, I now have a whole different perspective on this issue.  Although many students of mine work really hard and do not appear to share this view of 'effortless success', I do see this in many students. If American kids are continually to be raised to think that they can compete with the rest of the world by talents but not efforts, big problems are yet to come.

Success is one percent aspiration and ninety-nine percent perspiration.  Students need to focus on the effort part.

It is wise for me to make it clear that all views are my own, here.


Updated on 2013 March 2.
I read a book comment by David Brook, titled "East vs. West: The Chinese superstudent vs. the American slacker".  The difference is that Westerners learn to understand and master the external world. Asians learn in order to cultivate virtues inside the self. The comment is based on a book by Jin Li. The article concludes,
I’d just note that cultures that do fuse the academic and the moral, like Confucianism or Jewish Torah study, produce these awesome motivation explosions. It might be possible to champion other moral/academic codes to boost motivation in places where it is absent.


Fix bluetooth mouse connection problem for a Apple Snow Leopard laptop

Bluetooth Setup  "Pairing Attempt Was Unsuccessful" for a wireless mouse to a Snow Leopard laptop. 

$ cd /Library/Preferences

Remove the old Bluetooth list.  
$ sudo mv
Now, it shows "No mouse found"

Shutdown and restart

Wonderful, the Bluetooth mouse now works even at log-in. 

2015 Jan 13. After update this laptop to Yosemite, the bluetooth mouse connection problem re-ocurred. I tried the same procedure. However, bluetooth mouse was still not connect at the start. It seems that bluetooth mouse does not even show up on the ddetected devices.



Friday, January 11, 2013

Seven steps to improve teaching, Mark Taylor

Mark Taylor, Seven steps to improve teaching

step 1. Improve student's future orientation.
      Helping students see themselves in the future improve their learning.
step 2: Identify class goals / link to student goals
step 3: Improve student understanding of class expectations
step 4: Move content learning out of class
step 5: Create the necessity of preparing for and attending class
     50% of the grade should be preparation, advocated by Taylor.
     Assignment should be appropriate for the learners.
step 6: Increase classroom learning activity and engagement
step 7: Improve assessments and accountability

For class management, choose a bell, a trumpet, or a gong to start and stop students.
Taylor used Pink-up (or power-up) get people engaged.  Taylor used paper, scissor, and rock to decide roles among teams.

Taylor emphasized Responsibility, Compliance, Activity, and Construction. To change student attitude, getting people do it can make people really care, suggested Taylor.

 Taylor's articles on teaching

gammie AGY and AG shipping list

Thursday, January 10, 2013

Fix Dropbox sync problem on a Leopard laptop

Dropbox is having sync problem on my Snow Leopad laptop. There is a triangle but the uploading is stuck.

I deleted ./dropboxcache and the problem persists.

Quit and restart Dropbox by 'open': Starting at 19:52:25 -> uploading up-traingle 19:54:10 -> still stuck.

19:57:23, $ mv

Download  Dropbox.1.6.13.dmg

20:00:00, Reinstall Dropbox
Icon appears at up-panel with a spinning double-arrow circle, "indexing, connecting, uploading 291 files .."
20:02:  it shows "Uploading 291 files ... "
By 20:10, it is still showing uploading 291 files

Maybe it is the 2G encrypted drive that makes the sync looks stuck. But the previous icon did not show a spining cirlc, it was flashing square). In any case, maybe it is working now. I decided to remove this 2G encryptic disk.

20:28, Uploading 479 files, indexing 53 files.

20:31, uploading 523 files, connecting ...

20:35, "Unlink" the laptop on  Dropbox website. (This seems to solves the problem)
20:37,  open
 Requesting link .. ...
 20:39 chose previous setting
 20:39, indexing 591 files, downloading file list
 20:42, Uploading 2694 files, indexing 14,771 files, downloading file list
 20:44, Uploading 4156 files, indexing 11,582 files, ...
 20:56, Uploading 241 files, indexing 8,590 files ... 

 Around 22:07, all of the files are finally synced.

21:13, I found file '.dropbox' in the $Home directory. Maybe I should examine this directory next time.

I realizing the slow Spelman network also plays a role here. The uploading speed is 1,288KB/sec. Because the encrypted disk is a 2G file, updating it takes very long time.

I spent 1.5 hours on fixing Dropbox sync problem.

Tuesday, January 8, 2013

A look at identical twins from the systems biology perspective

After I  listened to a BBC podcast about Poincaré and bifurcation in dynamic systems, I had an interesting question.  A key point of dynamic systems is the so-called butterfly effect - small changes in initial conditions can lead to large changes in the future. I thought about the identical twins in biology. They always look physically similar, presumably because structural development has a lot of constraints, its layout was set during the early stages of development, and there was not enough time for changes (noises) to propagate. The personality differences between the identical twins could be understood by the butterfly effect because neurological and emotional development took decades and allow environmental input signals to be amplified in the systems.  I do not have expert understanding of both developmental biology and dynamics systems, and I wonder what other factors and principles are behind this. This could be an interesting story for discussion in an introductory course on systems biology or developmental biology.

After I posted the initial paragraph, a friend reminded me of Stuart Kauffman's argument that complex systems ought to adapt to the border between chaos and order during evolution. Kauffman's argument indeed makes sense. In fact, if I can speak freely, bio-diversity might have root in the butterfly effect.  Although many species have similar sets of orthologous genes,  their phenotypic diversity could be explained by the different trajectories caused by the changes in the initial conditions such as the divergent genotypes. A good example maybe chimpanzees and humans: Small genotypic differences (initial conditions) are amplified to drastically different phenotypes in non-structural traits.



Friday, January 4, 2013

Compare tree topologies using the ape package in R

The following R code read a newick file with multiple trees, and calculate the topological distance between all pairs.


trees = read.tree("treesall.nwk") #list of 10
dist.topo( trees[[1]], trees[[10]])

d.tree = matrix( nrow=10, ncol=10 )
for ( i in 1:10 ) {
  for (j in i:10) {
    print ( c(i, j))
    d.tree[i,j] = dist.topo( trees[[i]], trees[[j]]  )


An output from the 10 tree for Bacillus essential genes are: 
> d.tree
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    2    2    4    6    2    4    4    4     4
 [2,]   NA    0    4    6    6    4    2    6    6     6
 [3,]   NA   NA    0    2    4    4    6    6    6     6
 [4,]   NA   NA   NA    0    2    4    6    6    6     8
 [5,]   NA   NA   NA   NA    0    6    6    8    8    10
 [6,]   NA   NA   NA   NA   NA    0    2    4    4     6
 [7,]   NA   NA   NA   NA   NA   NA    0    6    6     8
 [8,]   NA   NA   NA   NA   NA   NA   NA    0    2     8
 [9,]   NA   NA   NA   NA   NA   NA   NA   NA    0     8
[10,]   NA   NA   NA   NA   NA   NA   NA   NA   NA     0

CONSEL result interpretation and reformat using perl

From "program.pdf" in CONSEL package,

2.4 catpv
The p-values calculated by consel are stored in the pv ¯le, and its contents are
shown by the command catpv. For example,

catpv foo

reads foo.pv and shows the contents like below.

# reading foo.pv
# rank item obs au np | bp kh sh wkh wsh |
# 1 1 -2.7 0.789 0.575 | 0.579 0.639 0.944 0.639 0.948 |

# 2 3 2.7 0.516 0.318 | 0.312 0.361 0.799 0.361 0.791 |
# 3 2 7.4 0.114 0.037 | 0.036 0.122 0.575 0.122 0.422 |
# 4 5 17.6 0.076 0.014 | 0.013 0.044 0.178 0.044 0.210 |

'au' The p-value of the approximately unbiased test. This is the main result of
consel, while rest of the p-values in the table are supplementary. It is derived
from the theory of the signed distance and the curvature of the boundary
as explained later. Let us denote it AUi for the item i. We may reject the
possibility that item i has the largest expected value of Yi when AUi < 0.05
at the significance level 0.05. 

 In its foo example, the trees are ranked based on Y_i test statistic. Difference of Y_i, \delta Y_i is shown in the report. I noticed that au p-value are often the highest for the top ranked tree. When a tree's au pvalue is less than 0.05, this tree can be considered as rejected.  Y_i statistic and p-value orders can give different orders, and this can be see in the foo example.

The following perl code can convert CONSEL report to a tab-delimited file for R.

use strict; use warnings;
use lib '/Users/hongqin/lib/perl'; use Util;

my $debug= 2 ;

my $inclusterfile = '_test.csv';
my $sourcedir = '~coat/ortholog.analysis/essetial.mcl.rbh.090208/bacillus.phylip.I3.1/wkdir';

open (IN,"<$inclusterfile"); my @lines = <IN>; close (IN); shift @lines;
my @clusterdirs = ();
foreach my $line (@lines) {
  chomp $line;
  my ($num, $bg, @res) = split( /\t/, $line);
  print "[$num][$bg]------";
  push (@clusterdirs, $bg);

open ( OUT, ">" );
print OUT "BG\trank\titem\tobs\tau\tnp\tbp\tpp\tkh\tsh\twkh\twsh\n";

my $count = 0;
foreach my $mydir ( @clusterdirs ) {
  chomp $mydir;
  if ( !( -e "wkdir/$mydir/$" ) ) {
    print "Error: $mydir has no consel report";
  } else {
    $count ++; print "Cluster:$count I am now working on [$mydir]::\n";
      open (IN, "<wkdir/$mydir/$"); @lines = <IN>; close (IN);
      shift @lines; shift @lines; shift @lines;  #remove the first three lines
      pop @lines; #remove the last empty line
      foreach my $line (@lines) {
         $line =~ s/^#\s+//o;
         $line =~ s/\|//g;
         chomp $line;
         my @els = split( /\s+/, $line);
         print OUT $mydir;
         foreach my $el (@els) { print OUT "\t$el"; }
         print OUT "\n";    
      }  }

close (OUT);


Wednesday, January 2, 2013

First try to enumerate truncated trees for bacilli

import warnings
import re
from Bio import Phylo

stree = "((Bha,Bcl), (Bpu,(Bli,(Bam,(Bmo,Bsu)))),  (Bwe,(Bce,(Ban,Bth))))"
stree2 = re.sub( r'[\(\)\s+]', '', stree)
taxa = re.split( ',', stree2)
outgrouptaxa = taxa[0:2]
bsutaxa = taxa[2:7]
bcetaxa = taxa[7:12]
bacillustaxa = bsutaxa + bcetaxa

#tmpfl = "/tmp/_tmpfile2013Jan2.nwk"
#tmpFH = open( tmpfl, 'w')
#tmpFH.write( stree + ";\n" )
#mastertree =, 'newick')

def lookup_by_names(tree):
    names = {}
    for clade in tree.get_terminals():
            if in names:
                raise ValueError("Duplicate key: %s" %
            names[] = clade
    return names

from StringIO import StringIO
#mastertree =, 'newick')
#names = lookup_by_names(mastertree)

#single leaf-node deletion
for node in bacillustaxa:
    tree =, 'newick')
    names = lookup_by_names(tree)
    Phylo.write( tree, 'sandbox/-' + node + '.nwk', 'newick')

#double leaf node deletion
for i in range(len(bacillustaxa)):
    for j in range( (i+1), len(bacillustaxa)):
        print i, j
        tree =, 'newick')
        names = lookup_by_names(tree)
        Phylo.write( tree, 'sandbox/_' + bacillustaxa[i] + bacillustaxa[j]+'.nwk', 'newick')

$ pwd

This is not bad for a one-hour coding on a first try. I should try to write multiple trees in a single file. I should try a generic loop or more likely, a recursive call.