Support
Quality
Security
License
Reuse
kandi has reviewed rumble-tools and discovered the below as its top functions. This is intended to give you an instant insight into rumble-tools implemented functionality, and help decide if they suit your requirements.
Open source tools, libraries, and datasets related to the Rumble Network Discovery product and associated research
No Code Snippets are available at this moment for rumble-tools.
QUESTION
search for regex match between two files using python
Asked 2022-Apr-09 at 00:49I´m working with two text files that look like this: File 1
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
...
File 2:
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:
GCF_000739415.1
and file 2 this one:
GCA_000739415.1
The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"
I´m working on a python code:
# PART 1: Open and read text file
with open("assembly_summary_genbank.txt", 'r') as f_1:
contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
contents_2 = f_2.readlines()
# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
# PART 3: Match between files
# Seudocode
for line in matches_1:
if matches_1 == matches_2:
print("PATTERN THAT ONLY EXIST IN ONE FILE")
Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?
ANSWER
Answered 2022-Apr-09 at 00:49Perhaps you are after this?
import re
given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
altered_example = "GCA_000739415.1 GCTEST_000739415.1"
# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
regex = r"GC[AF]_\d+.\d+"
matches_1 = re.findall(regex, given_example)
matches_2 = re.findall(regex, altered_example)
# Iteration for intersection
for match in matches_1:
if match in matches_2:
print(f"{match} is in both files")
Prints
GCA_000739415.1 is in both files
GCA_000739415.1 is in both files
But I would recommend:
# The preferred method for intersection, where order is not important
matches = list(set(matches_1) & set(matches_2))
Which saves as:
['GCA_000739415.1']
Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]
. Let me know if this is not what you are after
I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"
Which can be done with literation:
# Iteration
# A set has no duplicates, and is unordered
sym_dif = set()
for match in matches_1:
if match not in matches_2:
sym_dif.add(match)
>>> list(sym_dif)
['GCF_001297745.1', 'GCA_001297745.1']
I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2
. The lists won't be the same. You should check if it is not in
the other set.
Or using this set notation which is the preferred method:
>>> list(set(matches_1).symmetric_difference(set(matches_2)))
['GCF_001297745.1', 'GCA_001297745.1']
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
Explore Related Topics
Save this library and start creating your kit