Saturday, January 3, 2015

parse single mutant rows from rls.csv in python

I wrote a python 2.7 script to pick single-gene mutant from set_genotype and output into a new csv file for R analysis.

I used the 'SceORF_name.csv' to generate a dictionary first.
(Python picked up two rows MF(ALPHA)1 and MF(ALPHA)2 without comma. I nano fixed this.)

I then pick set_genotype with a single element that exist in SceORF_name. For R analysis, I output ORF and NAME.

Trouble: 'rls.csv' stores quoted raw lifespan with commas. These quotes disappeared after python parsing and cause confusion in csv format.
Option1: output in tab format. ==> Still have problems.
Option 2: add quotes back to set_lifespans. ==> Seems to works fine in Excel, but not in R. 
(There are 29839 rows in Excel, but R read 35K rows). 
Option 3: convert csv to xlsx and read.xlsx(). ==> not enough memory to run read.xlsx()

11:43pm. Through converting xlsx to csv, I found many extra empty columns in the file. I copy-pasted the 34 columns into a new csv file. This time, load into R show 29838 observations. This is correct.
File 'single_gene_mutants_frm_filerlscsv_20150103.csv' is the output file. (wrong format!)


Python script is parse_rlscsv_20150103.py

20150104: checking in R found wrong values in many columns. Perhaps the best way to parse 'rls.csv' in R directly.

Option 4: remove the set_lifepans









#####################parse_rlscsv_20150103.py
import StringIO
import csv
import re

# parse ORF name pairs
FL1 = open('SceORF_name.csv', 'rb')
Dic = {} # dictionary
reader1 = csv.reader(FL1, dialect='excel', delimiter=',')
for row in reader1:
ORF = row[0]
NAME = row[1]
if ( not ( ORF in Dic.keys() ) ):
Dic[ORF] = ORF
Dic[NAME] = ORF  


outfile = open( 'single_gene_mutants_frm_filerlscsv_20150103.csv', 'w')
header="ORF,NAME,id,experiments,set_name,set_strain,set_background,set_mating_type,set_locus_tag,set_genotype,set_media,set_temperature,set_lifespan_start_count,set_lifespan_count,set_lifespan_mean,set_lifespan_stdev,set_lifespans,ref_name,ref_strain,ref_background,ref_mating_type,ref_locus_tag,ref_genotype,ref_media,ref_temperature,ref_lifespan_start_count,ref_lifespan_count,ref_lifespan_mean,ref_lifespan_stdev,ref_lifespans,percent_change,ranksum_u,ranksum_p,pooled_by\n"
#header = header.replace(',', '\t')
outfile.write(header)

csvfile = open('rls.csv','rb')
reader = csv.reader(csvfile, dialect='excel', delimiter=',')

for row in reader:
elements = re.split('\s+', row[7] )
current_name = elements[0].upper()
if (len(elements) == 1) & (current_name in Dic.keys()):
row[14] = "\"" + row[14] + "\""
row[27] = "\"" + row[27] + "\""
#outfile.write( Dic[current_name]+'\t'+current_name + '\t'+'\t'.join(row)+ '\n')
outfile.write( Dic[current_name]+','+current_name + ','+','.join(row)+ '\n')

outfile.close()

#####################end of parse_rlscsv_20150103.py



No comments:

Post a Comment