code at https://github.com/QinLab/Biogrid-Qin2022
download BIOGRID-ALL-4.4.208.tab3.zip
write a python code to parse out yeast entries
myfile ='data-large-unsynced/BIOGRID-ALL-4.4.208.tab3.txt'
df = pd.read_csv(myfile,sep='\t', header=(0))
df = df[df['Organism Name Interactor A'].str.contains('Saccharomyces cerevisiae') ]
df = df[df['Organism Name Interactor B'].str.contains('Saccharomyces cerevisiae') ]
df = pd.read_csv(myfile,sep='\t', header=(0))
df = df[df['Organism Name Interactor A'].str.contains('Saccharomyces cerevisiae') ]
df = df[df['Organism Name Interactor B'].str.contains('Saccharomyces cerevisiae') ]
Remove duplicated interactions
def alphabetic_ordered_tag(in_tag1, in_tag2):
tmp = [str(in_tag1), str(in_tag2)]
tmp.sort()
return( str(in_tag1) + "_" + str(in_tag2))
df['alphabetic_ordered_tag'] = df.apply(lambda x: alphabetic_ordered_tag(x['Systematic Name Interactor A'], x['Systematic Name Interactor B']), axis=1)
df2 = df.drop_duplicates(subset=['alphabetic_ordered_tag'])
tmp = [str(in_tag1), str(in_tag2)]
tmp.sort()
return( str(in_tag1) + "_" + str(in_tag2))
df['alphabetic_ordered_tag'] = df.apply(lambda x: alphabetic_ordered_tag(x['Systematic Name Interactor A'], x['Systematic Name Interactor B']), axis=1)
df2 = df.drop_duplicates(subset=['alphabetic_ordered_tag'])
Output a lean version form small file size
df3 = df2[['Systematic Name Interactor A', 'Systematic Name Interactor B', 'Official Symbol Interactor A', 'Official Symbol Interactor B', 'alphabetic_ordered_tag' ]]
df3.to_csv("biogrid_s288c_4.4.208.lean.csv")
df3.to_csv("biogrid_s288c_4.4.208.lean.csv")
Output a dictionary from systematic names to symbols.
dicA = df3[['Systematic Name Interactor A', 'Official Symbol Interactor A']]
dicB = df3[['Systematic Name Interactor B', 'Official Symbol Interactor B']]
dicA.columns = ['Name', 'Symbol']
dicB.columns = ['Name', 'Symbol']
dic = pd.concat([dicA, dicB])
dic2 = dic.drop_duplicates(subset=['Name', 'Symbol'])
dic2.to_csv("Sce_Name2Symbol.csv")
dicB = df3[['Systematic Name Interactor B', 'Official Symbol Interactor B']]
dicA.columns = ['Name', 'Symbol']
dicB.columns = ['Name', 'Symbol']
dic = pd.concat([dicA, dicB])
dic2 = dic.drop_duplicates(subset=['Name', 'Symbol'])
dic2.to_csv("Sce_Name2Symbol.csv")
A total 627732 interactions and 6155 unique names/symbols were found for s288c biogrid data set.
Note: Self-interactions were included.
No comments:
Post a Comment