kandi background
kandi background
Explore Kits
kandi background
Explore Kits
this is related to Genetic Computation/ Evolutionary Computation

Popular New Releases in Genomics

DeepVariant 1.3.0

Fix rolling_groupby and expanding_groupby for pandas 1.3.0

0.9.1 Dockerfile release

Alpha release: bug fix

4.2.6.1

deepvariant

DeepVariant 1.3.0

pandarallel

Fix rolling_groupby and expanding_groupby for pandas 1.3.0

OpenWorm

0.9.1 Dockerfile release

STAR

Alpha release: bug fix

gatk

4.2.6.1

Popular Libraries in Genomics

Trending New libraries in Genomics

Top Authors in Genomics

1

66 Libraries

480

2

56 Libraries

3776

3

31 Libraries

129

4

28 Libraries

972

5

28 Libraries

5317

6

27 Libraries

1543

7

25 Libraries

2057

8

25 Libraries

448

9

22 Libraries

1057

10

22 Libraries

1161

1

66 Libraries

480

2

56 Libraries

3776

3

31 Libraries

129

4

28 Libraries

972

5

28 Libraries

5317

6

27 Libraries

1543

7

25 Libraries

2057

8

25 Libraries

448

9

22 Libraries

1057

10

22 Libraries

1161

Trending Kits in Genomics

javascript-genomics

11 best JavaScript Genomics

JavaScript has many modern libraries, which can be used to develop beautiful web-applications. The main advantage of javascript is that it is very easy to learn and use. It is a high level programming language supported by all major browsers. Genomics is a branch of molecular biology concerned with the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes. Genomics aims at the collective characterization and quantification of genes, which direct the production of proteins with the aid of enzymes and messenger molecules. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics. Some of the most widely used open source libraries for JavaScript Genomics among developers include: igv.js - Embeddable genomic visualization component based; jbrowse - A modern genome browser built with JavaScript and HTML5; dna2json - Formats your genome file as JSON.

python-genomics

14 best Python Genomics

Python has become a primary language in the field of bioinformatics and computational biology. It is one of the best programming languages for scientific computing, data analysis, and analytics. It is also widely used by mathematicians and statisticians to create data-driven applications. The advent of next-generation sequencing technologies has enabled a revolution in genomic research. Python is also very popular in genomics and bioinformatics community due to the fact that it provides high level of abstraction, large number of available packages and great visualization tools. Genomics is a rapidly growing field with many new tools and techniques being developed every year. It can be used for various applications such as data analysis, statistical analysis, simulation and visualization. They have been tested on several different systems. A few of the most popular Python Genomics open source libraries for developers are: deepvariant - analysis pipeline that uses a deep neural network; hail - Scalable genomic data analysis; pyGenomeTracks - python module to plot beautiful.

cpp-genomics

5 best C++ Genomics

C++ is an object-oriented programming language that is fast, efficient, and powerful. C++ is one of the most popular languages for implementing and distributing bioinformatics software. Genomics is the study of genes and their functions. It includes the sequencing and analysis of genomes, which are complete sets of DNA within a single cell of an organism. In the field of genomics, with the proliferation of next-generation sequencing (NGS), the amount of DNA sequence data generated has increased exponentially. This has led to the development of new tools and algorithms to handle these enormous levels of data. Several open source libraries have been created that allow developers to quickly and easily build genomic analysis tools without having to start from scratch. There are several popular C++ Genomics open source libraries available for developers: nucleus - Python and C code for reading and writing genomics data; abyss - Assemble large genomes using short reads; vcftools - A set of tools written in Perl and C for working with VCF files, such as those generated by the 1000 Genomes Project.

go-genomics

6 best Go Genomics

Go Genomics is a new framework for writing and executing distributed bioinformatics pipelines. Its goal is to make it as easy to analyze genomic data as it is to work with data in the web development world. It provides a consistent interface for working with biological sequence data, focused on performance, interoperability and clean abstractions. Genomics is the field of molecular biology that focuses on the study of genomes. A genome is an organism’s complete set of DNA, including all of its genes. A genome can be mapped through various means. A map, in turn, simplifies identification and isolation of desired specific genes for further analysis. Knowledge about a genome can also be used to identify genetic diseases and genetic predispositions for various diseases within a population. Popular Go Genomics open source libraries include: arvados -open source platform; goleft - bioinformatics tools distributed under MIT license; lollipops - Lollipopstyle mutation diagrams for annotating genetic variations.

ruby-genomics

6 best Ruby Genomics

Ruby programming language is the best language for building bioinformatics applications. Ruby programming language is flexible and dynamic nature makes it a great fit for bioinformatics and genomics projects. Ruby has been the go-to language for many bioinformaticians for decades. It's easy to use, highly expressive, and supports both object-oriented and functional programming styles. It's also a great choice for quick scripting tasks. While the last decade has seen the growth of high-performance computing in bioinformatics, the processing power available to most researchers is still limited. It is designed specifically for working with biological data, making it ideal for a wide variety of applications including machine learning, epidemiology, and systems biology. Popular Ruby Genomics open source libraries for developers include: sequenceserver - Intuitive local web frontend for the BLAST bioinformatics tool; dgidb - Rails frontend to The Genome Institute; nimbus - Ruby gem to implement Random Forest algorithms.

Trending Discussions on Genomics

    search for regex match between two files using python
    Is there a way to permute inside using to variables in bash?
    BigQuery Regex to extract string between two substrings
    how to stop letter repeating itself python
    Split multiallelic to biallelic in vcf by plink 1.9 and its variant name
    Delete specific letter in a FASTA sequence
    How to get the words within the first single quote in r using regex?
    Does Apache Spark 3 support GPU usage for Spark RDDs?
    Aggregating and summing columns across 1500 files by matching IDs in R (or bash)
    Usage of compression IO functions in apache arrow

QUESTION

search for regex match between two files using python

Asked 2022-Apr-09 at 00:49

I´m working with two text files that look like this: File 1

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7

File 2:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11

So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12

and file 2 this one:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13

The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"

I´m working on a python code:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28

Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?

ANSWER

Answered 2022-Apr-09 at 00:49

Perhaps you are after this?

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42

Prints

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44

But I would recommend:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46

Which saves as:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47

Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]. Let me know if this is not what you are after

Regex demo here


Edit

I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"

Which can be done with literation:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55

I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2. The lists won't be the same. You should check if it is not in the other set.

Or using this set notation which is the preferred method:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55>>> list(set(matches_1).symmetric_difference(set(matches_2)))
56['GCF_001297745.1', 'GCA_001297745.1']
57

Source https://stackoverflow.com/questions/71789818

Community Discussions contain sources that include Stack Exchange Network

    search for regex match between two files using python
    Is there a way to permute inside using to variables in bash?
    BigQuery Regex to extract string between two substrings
    how to stop letter repeating itself python
    Split multiallelic to biallelic in vcf by plink 1.9 and its variant name
    Delete specific letter in a FASTA sequence
    How to get the words within the first single quote in r using regex?
    Does Apache Spark 3 support GPU usage for Spark RDDs?
    Aggregating and summing columns across 1500 files by matching IDs in R (or bash)
    Usage of compression IO functions in apache arrow

QUESTION

search for regex match between two files using python

Asked 2022-Apr-09 at 00:49

I´m working with two text files that look like this: File 1

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7

File 2:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11

So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12

and file 2 this one:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13

The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"

I´m working on a python code:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28

Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?

ANSWER

Answered 2022-Apr-09 at 00:49

Perhaps you are after this?

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42

Prints

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44

But I would recommend:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46

Which saves as:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47

Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]. Let me know if this is not what you are after

Regex demo here


Edit

I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"

Which can be done with literation:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55

I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2. The lists won't be the same. You should check if it is not in the other set.

Or using this set notation which is the preferred method:

copy icondownload icon

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55>>> list(set(matches_1).symmetric_difference(set(matches_2)))
56['GCF_001297745.1', 'GCA_001297745.1']
57

Source https://stackoverflow.com/questions/71789818

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Genomics

Tutorials and Learning Resources are not available at this moment for Genomics