Monday, April 18, 2022

parse yeast PPI from biogrid 4.4.208

code at  https://github.com/QinLab/Biogrid-Qin2022 


download BIOGRID-ALL-4.4.208.tab3.zip 

write a python code to parse out yeast entries

myfile ='data-large-unsynced/BIOGRID-ALL-4.4.208.tab3.txt'
df = pd.read_csv(myfile,sep='\t', header=(0))
df = df[df['Organism Name Interactor A'].str.contains('Saccharomyces cerevisiae') ]
df = df[df['Organism Name Interactor B'].str.contains('Saccharomyces cerevisiae') ]

Remove duplicated  interactions

def alphabetic_ordered_tag(in_tag1, in_tag2):
    tmp = [str(in_tag1), str(in_tag2)]
    tmp.sort()
    return( str(in_tag1) + "_" + str(in_tag2))
df['alphabetic_ordered_tag'] = df.apply(lambda x: alphabetic_ordered_tag(x['Systematic Name Interactor A'], x['Systematic Name Interactor B']), axis=1)
df2 = df.drop_duplicates(subset=['alphabetic_ordered_tag'])

Output a lean version form small file size

df3 = df2[['Systematic Name Interactor A', 'Systematic Name Interactor B', 'Official Symbol Interactor A', 'Official Symbol Interactor B', 'alphabetic_ordered_tag' ]]
df3.to_csv("biogrid_s288c_4.4.208.lean.csv")

Output a dictionary from systematic names to symbols. 

dicA = df3[['Systematic Name Interactor A', 'Official Symbol Interactor A']]
dicB = df3[['Systematic Name Interactor B', 'Official Symbol Interactor B']]
dicA.columns = ['Name', 'Symbol']
dicB.columns = ['Name', 'Symbol']
dic = pd.concat([dicA, dicB])
dic2 = dic.drop_duplicates(subset=['Name', 'Symbol'])
dic2.to_csv("Sce_Name2Symbol.csv")

A total 627732 interactions and 6155 unique names/symbols were found for s288c biogrid data set. 

Note: Self-interactions were included. 









No comments:

Post a Comment