I am trying make a dataframe in R which includes fasta headers and sequences. I used the code below to do this however now I would like to make columns in my df using information from the fasta headers.
Here is the content of the header that I would like to use to make columns in my df. Ideally each piece of information between brackets ([]) would be a column. The main thing I need is the location as a column.
lcl|FR839628.1_cds_CCA36173.1_1 [locus_tag=PP7435_CHR1-0001] [db_xref=EnsemblGenomes-Gn:PP7435_Chr1-0001,EnsemblGenomes-Tr:CCA36173,UniProtKB/TrEMBL:F2QL95] [protein=Hypothetical_protein] [protein_id=CCA36173.1] [location=5023..6504] [gbkey=CDS]
Thanks for your help!
I tried this and it worked for making a df but now I want to make columns from the df$seq_name
library("Biostrings")
fastaFile <- readDNAStringSet("my.fasta")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name, sequence)
I tried to use this string split command but I am not sure how to do it in a way that saves the outputs into columns of the df.
string = df$seq_name
strsplit(string,split='[', fixed=TRUE)
You could try with tidyverse...you might need to modify depending on what pieces of info you're trying to extract but I think it should look something like..
some empty lists to store extracted information
Loop through each line in the fasta
Update current header and sequence
Store the last sequence info
Create a df