Popular New Releases in Genomics
deepvariant
DeepVariant 1.3.0
pandarallel
Fix rolling_groupby and expanding_groupby for pandas 1.3.0
OpenWorm
0.9.1 Dockerfile release
STAR
Alpha release: bug fix
gatk
4.2.6.1
Popular Libraries in Genomics
by jeroenjanssens html
3036 NOASSERTION
Data Science at the Command Line
by biopython python
2836 NOASSERTION
Official git repository for Biopython (originally converted from CVS)
by google python
2390 BSD-3-Clause
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
by nalepae python
2064 BSD-3-Clause
A simple and efficient tool to parallelize Pandas operations on all available CPUs
by openworm python
1603 MIT
Repository for the main Dockerfile with the Openworm software stack and project-wide issues
by bioconda shell
1383 MIT
Conda recipes for the bioconda channel.
by alexdobin c
1227 MIT
RNA-seq aligner
by broadinstitute java
1216 NOASSERTION
Official code repository for GATK versions 4 and up
by samtools c
1186 NOASSERTION
Tools (written in C using htslib) for manipulating next-generation sequencing data
Trending New libraries in Genomics
by cov-lineages python
340 GPL-3.0
Software package for assigning SARS-CoV-2 genome sequences to global lineages.
by TimothyStiles go
257 MIT
A Go package for engineering organisms.
by rrwick python
176 GPL-3.0
A tool for generating consensus long-read assemblies for bacterial genomes
by sqjin r
176 GPL-3.0
R toolkit for inference, visualization and analysis of cell-cell communication from single-cell data
by malonge python
171 MIT
Tools for fast and flexible genome assembly scaffolding and improvement
by theislab python
169 BSD-3-Clause
CellRank for directed single-cell fate mapping
by smarco c
165 NOASSERTION
Wavefront alignment algorithm (WFA): Fast and exact gap-affine pairwise alignment
by marbl shell
150 NOASSERTION
k-mer based assembly evaluation
by whatshap c++
145 MIT
Read-based phasing of genomic variants, also called haplotype assembly
Top Authors in Genomics
1
66 Libraries
480
2
56 Libraries
3776
3
31 Libraries
129
4
28 Libraries
972
5
28 Libraries
5317
6
27 Libraries
1543
7
25 Libraries
2057
8
25 Libraries
448
9
22 Libraries
1057
10
22 Libraries
1161
1
66 Libraries
480
2
56 Libraries
3776
3
31 Libraries
129
4
28 Libraries
972
5
28 Libraries
5317
6
27 Libraries
1543
7
25 Libraries
2057
8
25 Libraries
448
9
22 Libraries
1057
10
22 Libraries
1161
Trending Kits in Genomics
Java is the programming language created by Sun Microsystems. Java is a popular choice for many bioinformatics projects due to its platform independence and versatility. It is one of the most popular coding languages in the world. There are many Java Genomics libraries available in the market. Java has always been one of the best programming languages for bioinformatics research. These powerful tools can help us to do everything from writing more efficient code to distributing the code across the internet. The world of genomics is growing at an astonishing pace. It provides a library of algorithms and data structures for working with biological data in Java. Some of the most popular Java Genomics Open Source libraries among developers are: igv - Integrative Genomics Viewer; cbioportal - cBioPortal for Cancer Genomics; gridss - GRIDSS: the Genomic Rearrangement IDentification Software Suite.
Genomics is basically the study of genes and their functions. This involves intense data and information processing. The ability of computers to analyze and interpret DNA sequences for humans is a crucial necessity in the field of genomics. With the huge increase in the amount of genetic data, genomics has become one of the most important fields of study in modern medicine. Open-source libraries have made programming with genomic data, it is easier and more accessible. It provides a wide range of capabilities, from nucleotide sequence manipulation to reading and writing a variety of file formats. Developers tend to use some of the following C# Genomics open source libraries are: sharpneat - SharpNEAT Evolution of Neural Networks; Nirvana - The nimble & robust variant annotator; CromwellOnAzure - Microsoft Genomics supported implementation; BLSS - unique bioinformatics tools for the brave explorer.
Ruby programming language is the best language for building bioinformatics applications. Ruby programming language is flexible and dynamic nature makes it a great fit for bioinformatics and genomics projects. Ruby has been the go-to language for many bioinformaticians for decades. It's easy to use, highly expressive, and supports both object-oriented and functional programming styles. It's also a great choice for quick scripting tasks. While the last decade has seen the growth of high-performance computing in bioinformatics, the processing power available to most researchers is still limited. It is designed specifically for working with biological data, making it ideal for a wide variety of applications including machine learning, epidemiology, and systems biology. Popular Ruby Genomics open source libraries for developers include: sequenceserver - Intuitive local web frontend for the BLAST bioinformatics tool; dgidb - Rails frontend to The Genome Institute; nimbus - Ruby gem to implement Random Forest algorithms.
Python has become a primary language in the field of bioinformatics and computational biology. It is one of the best programming languages for scientific computing, data analysis, and analytics. It is also widely used by mathematicians and statisticians to create data-driven applications. The advent of next-generation sequencing technologies has enabled a revolution in genomic research. Python is also very popular in genomics and bioinformatics community due to the fact that it provides high level of abstraction, large number of available packages and great visualization tools. Genomics is a rapidly growing field with many new tools and techniques being developed every year. It can be used for various applications such as data analysis, statistical analysis, simulation and visualization. They have been tested on several different systems. A few of the most popular Python Genomics open source libraries for developers are: deepvariant - analysis pipeline that uses a deep neural network; hail - Scalable genomic data analysis; pyGenomeTracks - python module to plot beautiful.
C++ is an object-oriented programming language that is fast, efficient, and powerful. C++ is one of the most popular languages for implementing and distributing bioinformatics software. Genomics is the study of genes and their functions. It includes the sequencing and analysis of genomes, which are complete sets of DNA within a single cell of an organism. In the field of genomics, with the proliferation of next-generation sequencing (NGS), the amount of DNA sequence data generated has increased exponentially. This has led to the development of new tools and algorithms to handle these enormous levels of data. Several open source libraries have been created that allow developers to quickly and easily build genomic analysis tools without having to start from scratch. There are several popular C++ Genomics open source libraries available for developers: nucleus - Python and C code for reading and writing genomics data; abyss - Assemble large genomes using short reads; vcftools - A set of tools written in Perl and C for working with VCF files, such as those generated by the 1000 Genomes Project.
Go Genomics is a new framework for writing and executing distributed bioinformatics pipelines. Its goal is to make it as easy to analyze genomic data as it is to work with data in the web development world. It provides a consistent interface for working with biological sequence data, focused on performance, interoperability and clean abstractions. Genomics is the field of molecular biology that focuses on the study of genomes. A genome is an organism’s complete set of DNA, including all of its genes. A genome can be mapped through various means. A map, in turn, simplifies identification and isolation of desired specific genes for further analysis. Knowledge about a genome can also be used to identify genetic diseases and genetic predispositions for various diseases within a population. Popular Go Genomics open source libraries include: arvados -open source platform; goleft - bioinformatics tools distributed under MIT license; lollipops - Lollipopstyle mutation diagrams for annotating genetic variations.
JavaScript has many modern libraries, which can be used to develop beautiful web-applications. The main advantage of javascript is that it is very easy to learn and use. It is a high level programming language supported by all major browsers. Genomics is a branch of molecular biology concerned with the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes. Genomics aims at the collective characterization and quantification of genes, which direct the production of proteins with the aid of enzymes and messenger molecules. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics. Some of the most widely used open source libraries for JavaScript Genomics among developers include: igv.js - Embeddable genomic visualization component based; jbrowse - A modern genome browser built with JavaScript and HTML5; dna2json - Formats your genome file as JSON.
Trending Discussions on Genomics
search for regex match between two files using python
Is there a way to permute inside using to variables in bash?
BigQuery Regex to extract string between two substrings
how to stop letter repeating itself python
Split multiallelic to biallelic in vcf by plink 1.9 and its variant name
Delete specific letter in a FASTA sequence
How to get the words within the first single quote in r using regex?
Does Apache Spark 3 support GPU usage for Spark RDDs?
Aggregating and summing columns across 1500 files by matching IDs in R (or bash)
Usage of compression IO functions in apache arrow
QUESTION
search for regex match between two files using python
Asked 2022-Apr-09 at 00:49I´m working with two text files that look like this: File 1
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7
File 2:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11
So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12
and file 2 this one:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13
The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"
I´m working on a python code:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28
Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?
ANSWER
Answered 2022-Apr-09 at 00:49Perhaps you are after this?
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42
Prints
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44
But I would recommend:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46
Which saves as:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47
Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]
. Let me know if this is not what you are after
Edit
I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"
Which can be done with literation:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51 if match not in matches_2:
52 sym_dif.add(match)
53
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51 if match not in matches_2:
52 sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55
I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2
. The lists won't be the same. You should check if it is not in
the other set.
Or using this set notation which is the preferred method:
1# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
4GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
5GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
6...
7# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
10GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open("assembly_summary_genbank.txt", 'r') as f_1:
15 contents_1 = f_1.readlines()
16with open("assembly_summary_refseq.txt", 'r') as f_2:
17 contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
21matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26 if matches_1 == matches_2:
27 print("PATTERN THAT ONLY EXIST IN ONE FILE")
28import re
29
30given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
31altered_example = "GCA_000739415.1 GCTEST_000739415.1"
32
33# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
34regex = r"GC[AF]_\d+.\d+"
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40 if match in matches_2:
41 print(f"{match} is in both files")
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) & set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51 if match not in matches_2:
52 sym_dif.add(match)
53>>> list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55>>> list(set(matches_1).symmetric_difference(set(matches_2)))
56['GCF_001297745.1', 'GCA_001297745.1']
57
QUESTION
Is there a way to permute inside using to variables in bash?
Asked 2021-Dec-09 at 23:50I'm using the software plink2 (https://www.cog-genomics.org/plink/2.0/) and I'm trying to iterate over 3 variables.
This software admits an input file with .ped extention file and an exclude file with .txt extention which contains a list of names to be excluded from the input file.
The idea is to iterate over the input files and then over exclude files to generate single outputfiles.
- Input files: Highland.ped - Midland.ped - Lowland.ped
- Exclude-map files: HighlandMidland.txt - HighlandLowland.txt - MidlandLowland.txt
- Output files: HighlandMidland - HighlandLowland - MidlandHighland - MidlandLowland - LowlandHighland - LowlandMidland
The general code is:
1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7
To avoid repeating this code 6 different times I would like to use the variables listed above (1, 2 and 3) to create single output files. Outputfiles are a permutation with replacements of the inputfile names.
ANSWER
Answered 2021-Dec-09 at 23:50Honestly, I think your current code is quite clear; but if you really want to write this as a loop, here's one possibility:
1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7lands=(Highland Midland Lowland)
8for (( i = 0 ; i < ${#lands[@]} ; ++i )) ; do
9 for (( j = i + 1 ; j < ${#lands[@]} ; ++j )) ; do
10 plink2 --file "${lands[i]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[i]}${lands[j]}"
11 plink2 --file "${lands[j]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[j]}${lands[i]}"
12 done
13done
14
and here's another:
1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7lands=(Highland Midland Lowland)
8for (( i = 0 ; i < ${#lands[@]} ; ++i )) ; do
9 for (( j = i + 1 ; j < ${#lands[@]} ; ++j )) ; do
10 plink2 --file "${lands[i]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[i]}${lands[j]}"
11 plink2 --file "${lands[j]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[j]}${lands[i]}"
12 done
13done
14lands=(Highland Midland Lowland)
15for (( i = 0 ; i < ${#lands[@]} ; ++i )) ; do
16 for (( j = 0 ; j < ${#lands[@]} ; ++j )) ; do
17 if [[ "$i" != "$j" ]] ; then
18 plink2 \
19 --file "${lands[i]}" \
20 --exclude "$lands[i < j ? i : j]}$lands[i < j ? j : i]}.txt" \
21 --out "${lands[i]}${lands[j]}"
22 fi
23 done
24done
25
. . . but one common factor between both of the above is that they're much less clear than your current code!
QUESTION
BigQuery Regex to extract string between two substrings
Asked 2021-Dec-09 at 01:11From this example string:
1{&q;somerandomtext&q;:{&q;Product&q;:{&q;TileID&q;:0,&q;Stockcode&q;:1234,&q;variant&q;:&q;genomics&q;,&q;available&q;:0"}
2
I'm trying to extract the Stockcode only.
1{&q;somerandomtext&q;:{&q;Product&q;:{&q;TileID&q;:0,&q;Stockcode&q;:1234,&q;variant&q;:&q;genomics&q;,&q;available&q;:0"}
2REGEXP_REPLACE(col, r".*,&q;Stockcode&q;:/([^/$]*)\,&q;.*", r"\1")
3
So the result should be
1234
however my Regex still returns the entire contents.
ANSWER
Answered 2021-Dec-09 at 01:11QUESTION
how to stop letter repeating itself python
Asked 2021-Nov-25 at 18:33I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
1import json
2
3val = open("data.json")
4val1 = json.load(val)#loads the list
5
6
7a = input("Enter a Jumbled word ")#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12 for somethin in a:#iterates letters from list
13 if somethin in list(x):#checks if the letter is in the iterated word
14 continue
15 else:
16 break
17 else:#checks if the loop ended correctly (that means word has same letters)
18 if len(a) != len(list(x)):#checks if it has same number of letters
19 continue#returns
20 else:
21 print(x)#continues the loop to see if there are more like that
22
EDIT: many people wanted the json file so here it is
1import json
2
3val = open("data.json")
4val1 = json.load(val)#loads the list
5
6
7a = input("Enter a Jumbled word ")#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12 for somethin in a:#iterates letters from list
13 if somethin in list(x):#checks if the letter is in the iterated word
14 continue
15 else:
16 break
17 else:#checks if the loop ended correctly (that means word has same letters)
18 if len(a) != len(list(x)):#checks if it has same number of letters
19 continue#returns
20 else:
21 print(x)#continues the loop to see if there are more like that
22['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
23
Just a example and the dictionary is full of items
ANSWER
Answered 2021-Nov-25 at 18:33As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
1import json
2
3val = open("data.json")
4val1 = json.load(val)#loads the list
5
6
7a = input("Enter a Jumbled word ")#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12 for somethin in a:#iterates letters from list
13 if somethin in list(x):#checks if the letter is in the iterated word
14 continue
15 else:
16 break
17 else:#checks if the loop ended correctly (that means word has same letters)
18 if len(a) != len(list(x)):#checks if it has same number of letters
19 continue#returns
20 else:
21 print(x)#continues the loop to see if there are more like that
22['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
23sorted_jumbled_word = sorted(a)
24for word in val1:
25 if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
26 print(word)
27
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")
QUESTION
Split multiallelic to biallelic in vcf by plink 1.9 and its variant name
Asked 2021-Nov-17 at 13:56I am trying to use plink1.9 to split multiallelic into biallelic. The input is that
11 chr1:930939:G:A 0 930939 G A
21 chr1:930947:G:A 0 930947 A G
31 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
4
What it done is:
11 chr1:930939:G:A 0 930939 G A
21 chr1:930947:G:A 0 930947 A G
31 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
41 chr1:930939:G:A 0 930939 G A
51 chr1:930947:G:A 0 930947 A G
61 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
71 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
8
What I expect is:
11 chr1:930939:G:A 0 930939 G A
21 chr1:930947:G:A 0 930947 A G
31 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
41 chr1:930939:G:A 0 930939 G A
51 chr1:930947:G:A 0 930947 A G
61 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
71 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
81 chr1:930939:G:A 0 930939 G A
91 chr1:930947:G:A 0 930947 A G
101 chr1:930952:G:A 0 930952 A G
111 chr1:930952:G:C 0 930952 A G
12
Please help me to make a vcf or ped or map file like what I expect. Thank you.
ANSWER
Answered 2021-Nov-17 at 09:45I used bcftools to complete the task.
QUESTION
Delete specific letter in a FASTA sequence
Asked 2021-Oct-12 at 21:00I have a FASTA file that has about 300000 sequences but some of the sequences are like these
1>Spike|hCoV-19/Wuhan/WH02/2019|2019-12-31|EPI_ISL_406799|Original|hCoV-19^^Wuhan|Human|General Hospital of Central Theater Command of People's Liberation Army of China|BGI & Institute of Microbiology|Hunter|China
2MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVITEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
3
4>Spike|hCoV-19/England/PORT-2DE4EF/2020|2020-00-00|EPI_ISL_1310367|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
5MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLDILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
6
7>Spike|hCoV-19/England/PORT-2DE616/2020|2020-00-00|EPI_ISL_1310384|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
8MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGSAAYYVGYLQLRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYYLLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
9
10
I want to delete all the sequences that contain the letter x in them, how can I do that?
ANSWER
Answered 2021-Oct-12 at 20:28You can match your non-X containing FASTA entries with the regex >.+\n[^X]+\n
. This checks for a substring starting with >
having a first line of anything (the FASTA header), which is followed by characters not containing an X until you reach a line break.
For example:
1>Spike|hCoV-19/Wuhan/WH02/2019|2019-12-31|EPI_ISL_406799|Original|hCoV-19^^Wuhan|Human|General Hospital of Central Theater Command of People's Liberation Army of China|BGI & Institute of Microbiology|Hunter|China
2MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVITEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
3
4>Spike|hCoV-19/England/PORT-2DE4EF/2020|2020-00-00|EPI_ISL_1310367|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
5MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLDILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
6
7>Spike|hCoV-19/England/PORT-2DE616/2020|2020-00-00|EPI_ISL_1310384|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
8MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGSAAYYVGYLQLRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYYLLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
9
10no_X_FASTA = "".join(re.findall(r">.+\n[^X]+\n",text))
11
QUESTION
How to get the words within the first single quote in r using regex?
Asked 2021-Oct-04 at 22:27For example, I have two strings:
1stringA = "'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']"
2
I am not familiar how to do regex and stuck to extract words within the first single quotes.
Expected
1stringA = "'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']"
2## do regex here
3gsub("'(.*)'", "\\1", stringA) # not working
4
5> "contentX"
6
ANSWER
Answered 2021-Oct-04 at 22:27For your example your pattern would be:
1stringA = "'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']"
2## do regex here
3gsub("'(.*)'", "\\1", stringA) # not working
4
5> "contentX"
6gsub("^'(.*?)'.*", "\\1", stringA)
7
https://regex101.com/r/bs3lwJ/1
First we assert we're at the beginning of the string and that the following character is a single quote with ^'
. Then we capture everything up until the next single quote in group 1, using (.*?)'
.
Note that we need the ?
in .*?
otherwise .*
will be "greedy" and match all the way through to the last occurrence of a single quote, rather then the next single quote.
QUESTION
Does Apache Spark 3 support GPU usage for Spark RDDs?
Asked 2021-Sep-23 at 05:53I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage.
I tried spark-rapids library start an on-premise slurm cluster with gpu nodes. I was able to initialise the cluster. However, when I tried running hail tasks, the executors keep getting killed.
On querying in Hail forum, I got the response that
That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any Spark-SQL interfaces, only the RDD interfaces.
So, does Spark3 not support GPU usage for RDD interfaces?
ANSWER
Answered 2021-Sep-23 at 05:53As of now, spark-rapids doesn't support GPU usage for RDD interfaces.
Source: Link
Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations. This requires no API changes from the user. The plugin will replace SQL operations it supports with GPU accelerated versions. If an operation is not supported it will fall back to using the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs directly.
Here, an answer from spark-rapids team
Source: Link
We do not support running the RDD API on GPUs at this time. We only support the SQL/Dataframe API, and even then only a subset of the operators. This is because we are translating individual Catalyst operators into GPU enabled equivalent operators. I would love to be able to support the RDD API, but that would require us to be able to take arbitrary java, scala, and python code and run it on the GPU. We are investigating ways to try to accomplish some of this, but right now it is very difficult to do. That is especially true for libraries like Hail, which use python as an API, but the data analysis is done in C/C++.
QUESTION
Aggregating and summing columns across 1500 files by matching IDs in R (or bash)
Asked 2021-Sep-07 at 13:09I have 1500 files with the same format (the .scount file format from PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount), an example is below:
1#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
2LP5987245 10 0 6 53 0 52 0 67 70 32
3LP098324 34 51 10 37 100 12 59 11 49 0
4LP908325 0 45 39 54 68 48 51 58 31 2
5LP0932325 7 72 0 2 92 64 13 52 0 100
6LP08324 92 93 95 39 23 0 27 75 49 14
7LP034252 85 46 10 69 20 8 80 81 94 23
8
In reality each file has 80000 IIDs and is roughly 1-10MB in size. Each IID is unique and found once per file.
I would like to create a single file matched by IID with each column value summed. The column names are the same across files.
I have tried:
1#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
2LP5987245 10 0 6 53 0 52 0 67 70 32
3LP098324 34 51 10 37 100 12 59 11 49 0
4LP908325 0 45 39 54 68 48 51 58 31 2
5LP0932325 7 72 0 2 92 64 13 52 0 100
6LP08324 92 93 95 39 23 0 27 75 49 14
7LP034252 85 46 10 69 20 8 80 81 94 23
8fnames <- list.files(pattern = "\\.scount")
9df_list <- lapply(fnames, read.table, header = TRUE)
10df_all <- do.call(rbind, df_list)
11x <- aggregate(IID ~ , data = df_all, sum)
12
But this is really slow for the number of files and the # at the start of the #IID column is a real pain to work around.
Any help would be greatly appreciated
ANSWER
Answered 2021-Sep-07 at 11:10a tidyverse
solution
1#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
2LP5987245 10 0 6 53 0 52 0 67 70 32
3LP098324 34 51 10 37 100 12 59 11 49 0
4LP908325 0 45 39 54 68 48 51 58 31 2
5LP0932325 7 72 0 2 92 64 13 52 0 100
6LP08324 92 93 95 39 23 0 27 75 49 14
7LP034252 85 46 10 69 20 8 80 81 94 23
8fnames <- list.files(pattern = "\\.scount")
9df_list <- lapply(fnames, read.table, header = TRUE)
10df_all <- do.call(rbind, df_list)
11x <- aggregate(IID ~ , data = df_all, sum)
12df2 <- df
13df3 <- df
14
15df_list <- list(df,df2,df3)
16
17df_all <- do.call(rbind, df_list)
18
19library(dplyr)
20
21df_all %>%
22group_by(IID) %>%
23summarise_all(sum)
24
solution with data.table
1#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
2LP5987245 10 0 6 53 0 52 0 67 70 32
3LP098324 34 51 10 37 100 12 59 11 49 0
4LP908325 0 45 39 54 68 48 51 58 31 2
5LP0932325 7 72 0 2 92 64 13 52 0 100
6LP08324 92 93 95 39 23 0 27 75 49 14
7LP034252 85 46 10 69 20 8 80 81 94 23
8fnames <- list.files(pattern = "\\.scount")
9df_list <- lapply(fnames, read.table, header = TRUE)
10df_all <- do.call(rbind, df_list)
11x <- aggregate(IID ~ , data = df_all, sum)
12df2 <- df
13df3 <- df
14
15df_list <- list(df,df2,df3)
16
17df_all <- do.call(rbind, df_list)
18
19library(dplyr)
20
21df_all %>%
22group_by(IID) %>%
23summarise_all(sum)
24df_list <- list(df,df2,df3)
25
26df_all <- do.call(rbind, df_list)
27
28library(data.table)
29
30setDT(df_all)
31df_all[, lapply(.SD, sum), by=IID]
32
to ignore '#' see Cannot read file with "#" and space using read.table or read.csv in R
QUESTION
Usage of compression IO functions in apache arrow
Asked 2021-Jun-02 at 18:58I have been implementing a suite of RecordBatchReaders for a genomics toolset. The standard unit of work is a RecordBatch. I ended up implementing a lot of my own compression and IO tools instead of using the existing utilities in the arrow cpp platform because I was confused about them. Are there any clear examples of using the existing compression and file IO utilities to simply get a file stream that inflates standard zlib data? Also, an object diagram for the cpp platform would be helpful in ramping up.
ANSWER
Answered 2021-Jun-02 at 18:58Here is an example program that inflates a compressed zlib file and reads it as CSV.
1#include <iostream>
2
3#include <arrow/api.h>
4#include <arrow/csv/api.h>
5#include <arrow/io/api.h>
6#include <arrow/util/compression.h>
7#include <arrow/util/logging.h>
8
9arrow::Status RunMain(int argc, char **argv) {
10
11 if (argc < 2) {
12 return arrow::Status::Invalid(
13 "You must specify a gzipped CSV file to read");
14 }
15
16 std::string file_to_read = argv[1];
17 ARROW_ASSIGN_OR_RAISE(auto in_file,
18 arrow::io::ReadableFile::Open(file_to_read));
19 ARROW_ASSIGN_OR_RAISE(auto codec,
20 arrow::util::Codec::Create(arrow::Compression::GZIP));
21 ARROW_ASSIGN_OR_RAISE(
22 auto compressed_in,
23 arrow::io::CompressedInputStream::Make(codec.get(), in_file));
24
25 auto read_options = arrow::csv::ReadOptions::Defaults();
26 auto parse_options = arrow::csv::ParseOptions::Defaults();
27 auto convert_options = arrow::csv::ConvertOptions::Defaults();
28 ARROW_ASSIGN_OR_RAISE(
29 auto table_reader,
30 arrow::csv::TableReader::Make(arrow::io::default_io_context(),
31 std::move(compressed_in), read_options,
32 parse_options, convert_options));
33
34 ARROW_ASSIGN_OR_RAISE(auto table, table_reader->Read());
35 std::cout << "The table had " << table->num_rows() << " rows and "
36 << table->num_columns() << " columns." << std::endl;
37
38 return arrow::Status::OK();
39}
40
41int main(int argc, char **argv) {
42 arrow::Status st = RunMain(argc, argv);
43 if (!st.ok()) {
44 std::cerr << st << std::endl;
45 return 1;
46 }
47 return 0;
48}
49
Compression is handled in different ways in different parts of Arrow. The file readers typically accept an arrow::io::InputStream
. You should be able to use arrow::io::CompressedInputStream
to wrap an arrow::io::InputStream
with decompression. This gives you whole-file compression. This is fine for something like CSV.
For Parquet, this approach does not work (ParquetFileReader::Open
expects arrow::io::RandomAccessFile
). For IPC, this approach is inefficient (unless you are reading the entire file). Effective reading of these formats involves seekable reads which is not possible with whole-file compression. Both formats support their own format-specific compression options. You only need to specify these options on write. On read the compression will be detected from the metadata (the metadata is stored uncompressed) of the file itself. If you are writing data you can find the information in parquet::ArrowWriterProperties
and arrow::ipc::WriteOptions
.
Since whole-file compression is still a thing for CSV the datasets API has recently (as of 4.0.0) added support for detecting compression from file extensions for CSV datasets. More details can be found here.
As for documentation and an object diagram, those are excellent topics for the user mailing list, or you are welcome to provide a pull request.
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Genomics
Tutorials and Learning Resources are not available at this moment for Genomics