Genomics

Explore all libraries in Genomics

this is related to Genetic Computation/ Evolutionary Computation

Popular New Releases in Genomics

deepvariant

DeepVariant 1.3.0

pandarallel

Fix rolling_groupby and expanding_groupby for pandas 1.3.0

OpenWorm

0.9.1 Dockerfile release

STAR

Alpha release: bug fix

gatk

4.2.6.1

Popular Libraries in Genomics

data-science-at-the-command-line

by jeroenjanssens html

3036 NOASSERTION

Data Science at the Command Line

biopython

by biopython python

2836 NOASSERTION

Official git repository for Biopython (originally converted from CVS)

deepvariant

by google python

2390 BSD-3-Clause

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

pandarallel

by nalepae python

2064 BSD-3-Clause

A simple and efficient tool to parallelize Pandas operations on all available CPUs

OpenWorm

by openworm python

1603 MIT

Repository for the main Dockerfile with the Openworm software stack and project-wide issues

bioconda-recipes

by bioconda shell

1383 MIT

Conda recipes for the bioconda channel.

STAR

by alexdobin c

1227 MIT

RNA-seq aligner

gatk

by broadinstitute java

1216 NOASSERTION

Official code repository for GATK versions 4 and up

samtools

by samtools c

1186 NOASSERTION

Tools (written in C using htslib) for manipulating next-generation sequencing data

Explore all libraries in Genomics

Trending New libraries in Genomics

pangolin

by cov-lineages python

340 GPL-3.0

Software package for assigning SARS-CoV-2 genome sequences to global lineages.

poly

by TimothyStiles go

257 MIT

A Go package for engineering organisms.

Trycycler

by rrwick python

176 GPL-3.0

A tool for generating consensus long-read assemblies for bacterial genomes

CellChat

by sqjin r

176 GPL-3.0

R toolkit for inference, visualization and analysis of cell-cell communication from single-cell data

RagTag

by malonge python

171 MIT

Tools for fast and flexible genome assembly scaffolding and improvement

cellrank

by theislab python

169 BSD-3-Clause

CellRank for directed single-cell fate mapping

WFA

by smarco c

165 NOASSERTION

Wavefront alignment algorithm (WFA): Fast and exact gap-affine pairwise alignment

merqury

by marbl shell

150 NOASSERTION

k-mer based assembly evaluation

whatshap

by whatshap c++

145 MIT

Read-based phasing of genomic variants, also called haplotype assembly

Top Authors in Genomics

NCBI-Hackathons

66 Libraries

480

broadinstitute

56 Libraries

3776

cran

31 Libraries

129

bcgsc

28 Libraries

972

lh3

28 Libraries

5317

tseemann

27 Libraries

1543

theislab

25 Libraries

2057

ekg

25 Libraries

448

nanoporetech

22 Libraries

1057

sanger-pathogens

22 Libraries

1161

NCBI-Hackathons

66 Libraries

480

broadinstitute

56 Libraries

3776

cran

31 Libraries

129

bcgsc

28 Libraries

972

lh3

28 Libraries

5317

tseemann

27 Libraries

1543

theislab

25 Libraries

2057

ekg

25 Libraries

448

nanoporetech

22 Libraries

1057

sanger-pathogens

22 Libraries

1161

Trending Kits in Genomics

8 best Java Genomics libraries

Java is the programming language created by Sun Microsystems. Java is a popular choice for many bioinformatics projects due to its platform independence and versatility. It is one of the most popular coding languages in the world. There are many Java Genomics libraries available in the market. Java has always been one of the best programming languages for bioinformatics research. These powerful tools can help us to do everything from writing more efficient code to distributing the code across the internet. The world of genomics is growing at an astonishing pace. It provides a library of algorithms and data structures for working with biological data in Java. Some of the most popular Java Genomics Open Source libraries among developers are: igv - Integrative Genomics Viewer; cbioportal - cBioPortal for Cancer Genomics; gridss - GRIDSS: the Genomic Rearrangement IDentification Software Suite.

6 best C# Genomics libraries

Genomics is basically the study of genes and their functions. This involves intense data and information processing. The ability of computers to analyze and interpret DNA sequences for humans is a crucial necessity in the field of genomics. With the huge increase in the amount of genetic data, genomics has become one of the most important fields of study in modern medicine. Open-source libraries have made programming with genomic data, it is easier and more accessible. It provides a wide range of capabilities, from nucleotide sequence manipulation to reading and writing a variety of file formats. Developers tend to use some of the following C# Genomics open source libraries are: sharpneat - SharpNEAT Evolution of Neural Networks; Nirvana - The nimble & robust variant annotator; CromwellOnAzure - Microsoft Genomics supported implementation; BLSS - unique bioinformatics tools for the brave explorer.

6 best Ruby Genomics libraries

Ruby programming language is the best language for building bioinformatics applications. Ruby programming language is flexible and dynamic nature makes it a great fit for bioinformatics and genomics projects. Ruby has been the go-to language for many bioinformaticians for decades. It's easy to use, highly expressive, and supports both object-oriented and functional programming styles. It's also a great choice for quick scripting tasks. While the last decade has seen the growth of high-performance computing in bioinformatics, the processing power available to most researchers is still limited. It is designed specifically for working with biological data, making it ideal for a wide variety of applications including machine learning, epidemiology, and systems biology. Popular Ruby Genomics open source libraries for developers include: sequenceserver - Intuitive local web frontend for the BLAST bioinformatics tool; dgidb - Rails frontend to The Genome Institute; nimbus - Ruby gem to implement Random Forest algorithms.

14 best Python Genomics libraries

Python has become a primary language in the field of bioinformatics and computational biology. It is one of the best programming languages for scientific computing, data analysis, and analytics. It is also widely used by mathematicians and statisticians to create data-driven applications. The advent of next-generation sequencing technologies has enabled a revolution in genomic research. Python is also very popular in genomics and bioinformatics community due to the fact that it provides high level of abstraction, large number of available packages and great visualization tools. Genomics is a rapidly growing field with many new tools and techniques being developed every year. It can be used for various applications such as data analysis, statistical analysis, simulation and visualization. They have been tested on several different systems. A few of the most popular Python Genomics open source libraries for developers are: deepvariant - analysis pipeline that uses a deep neural network; hail - Scalable genomic data analysis; pyGenomeTracks - python module to plot beautiful.

5 best C++ Genomics libraries

C++ is an object-oriented programming language that is fast, efficient, and powerful. C++ is one of the most popular languages for implementing and distributing bioinformatics software. Genomics is the study of genes and their functions. It includes the sequencing and analysis of genomes, which are complete sets of DNA within a single cell of an organism. In the field of genomics, with the proliferation of next-generation sequencing (NGS), the amount of DNA sequence data generated has increased exponentially. This has led to the development of new tools and algorithms to handle these enormous levels of data. Several open source libraries have been created that allow developers to quickly and easily build genomic analysis tools without having to start from scratch. There are several popular C++ Genomics open source libraries available for developers: nucleus - Python and C code for reading and writing genomics data; abyss - Assemble large genomes using short reads; vcftools - A set of tools written in Perl and C for working with VCF files, such as those generated by the 1000 Genomes Project.

6 best Go Genomics libraries

Go Genomics is a new framework for writing and executing distributed bioinformatics pipelines. Its goal is to make it as easy to analyze genomic data as it is to work with data in the web development world. It provides a consistent interface for working with biological sequence data, focused on performance, interoperability and clean abstractions. Genomics is the field of molecular biology that focuses on the study of genomes. A genome is an organism’s complete set of DNA, including all of its genes. A genome can be mapped through various means. A map, in turn, simplifies identification and isolation of desired specific genes for further analysis. Knowledge about a genome can also be used to identify genetic diseases and genetic predispositions for various diseases within a population. Popular Go Genomics open source libraries include: arvados -open source platform; goleft - bioinformatics tools distributed under MIT license; lollipops - Lollipopstyle mutation diagrams for annotating genetic variations.

11 best JavaScript Genomics libraries

JavaScript has many modern libraries, which can be used to develop beautiful web-applications. The main advantage of javascript is that it is very easy to learn and use. It is a high level programming language supported by all major browsers. Genomics is a branch of molecular biology concerned with the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes. Genomics aims at the collective characterization and quantification of genes, which direct the production of proteins with the aid of enzymes and messenger molecules. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics. Some of the most widely used open source libraries for JavaScript Genomics among developers include: igv.js - Embeddable genomic visualization component based; jbrowse - A modern genome browser built with JavaScript and HTML5; dna2json - Formats your genome file as JSON.

More kits in Genomics

Trending Discussions on Genomics

search for regex match between two files using python

Is there a way to permute inside using to variables in bash?

BigQuery Regex to extract string between two substrings

how to stop letter repeating itself python

Split multiallelic to biallelic in vcf by plink 1.9 and its variant name

Delete specific letter in a FASTA sequence

How to get the words within the first single quote in r using regex?

Does Apache Spark 3 support GPU usage for Spark RDDs?

Aggregating and summing columns across 1500 files by matching IDs in R (or bash)

Usage of compression IO functions in apache arrow

search for regex match between two files using python

Is there a way to permute inside using to variables in bash?

BigQuery Regex to extract string between two substrings

how to stop letter repeating itself python

Split multiallelic to biallelic in vcf by plink 1.9 and its variant name

Delete specific letter in a FASTA sequence

How to get the words within the first single quote in r using regex?

Does Apache Spark 3 support GPU usage for Spark RDDs?

Aggregating and summing columns across 1500 files by matching IDs in R (or bash)

Usage of compression IO functions in apache arrow

QUESTION

search for regex match between two files using python

Asked 2022-Apr-09 at 00:49

I´m working with two text files that look like this: File 1

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7

File 2:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11

So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12

and file 2 this one:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13

The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"

I´m working on a python code:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28

Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?

ANSWER

Answered 2022-Apr-09 at 00:49

Perhaps you are after this?

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42

Prints

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44

But I would recommend:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) &amp; set(matches_2))
46

Which saves as:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) &amp; set(matches_2))
46['GCA_000739415.1']
47

Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]. Let me know if this is not what you are after

Regex demo here

Edit

I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"

Which can be done with literation:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) &amp; set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) &amp; set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53&gt;&gt;&gt; list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55

I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2. The lists won't be the same. You should check if it is not in the other set.

Or using this set notation which is the preferred method:

1#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
2# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
3GCF_000739415.1 PRJNA224116 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCA_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1         na
4GCF_001263815.1 PRJNA224116 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCA_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1            na
5GCF_001297745.1 PRJNA224116 SAMD00040429    BCBV00000000.1  na  837 837 Porphyromonas gingivalis    strain=Ando     latest  Scaffold    Major   Full    2015/09/17  ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1            an
6...
7#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
8# assembly_accession    bioproject  biosample   wgs_master  refseq_category taxid   species_taxid   organism_name   infraspecific_name  isolate version_status  assembly_level  release_type    genome_rep  seq_rel_date    asm_name    submitter   gbrs_paired_asm paired_asm_comp ftp_path    excluded_from_refseq    relation_to_type_material   asm_not_live_date
9GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         na
10GCA_001263815.1 PRJNA276132 SAMN03366764        na  837 837 Porphyromonas gingivalis    strain=A7436        latest  Complete Genome Major   Full    2015/08/11  ASM126381v1 University of Florida   GCF_001263815.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1            na
11GCF_000739415.1
12GCA_000739415.1
13# PART 1: Open and read text file
14with open(&quot;assembly_summary_genbank.txt&quot;, 'r') as f_1:
15    contents_1 = f_1.readlines()
16with open(&quot;assembly_summary_refseq.txt&quot;, 'r') as f_2:
17    contents_2 = f_2.readlines()
18
19# PART 2: Search for IDs
20matches_1 = re.findall(&quot;GCF_[0-9]*\.[0-9]&quot;, str(contents_1))
21matches_2 = re.findall(&quot;GCA_[0-9]*\.[0-9]&quot;, str(contents_2))
22
23# PART 3: Match between files
24# Seudocode
25for line in matches_1:
26    if matches_1 == matches_2:
27        print(&quot;PATTERN THAT ONLY EXIST IN ONE FILE&quot;)
28import re
29
30given_example = &quot;GCA_000739415.1 PRJNA245225 SAMN02732406        na  837 837 Porphyromonas gingivalis    strain=HG66     latest  Chromosome  Major   Full    2014/08/14  ASM73941v1  University of Louisville    GCF_000739415.1 identical   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1         an&quot;
31altered_example = &quot;GCA_000739415.1 GCTEST_000739415.1&quot;
32
33# GX[A or F]_[number; digit &gt;= 1].[number; digit &gt;= 1]
34regex = r&quot;GC[AF]_\d+.\d+&quot;
35
36matches_1 = re.findall(regex, given_example)
37matches_2 = re.findall(regex, altered_example)
38# Iteration for intersection
39for match in matches_1:
40    if match in matches_2:
41        print(f&quot;{match} is in both files&quot;)
42GCA_000739415.1 is in both files
43GCA_000739415.1 is in both files
44# The preferred method for intersection, where order is not important
45matches = list(set(matches_1) &amp; set(matches_2))
46['GCA_000739415.1']
47# Iteration
48# A set has no duplicates, and is unordered
49sym_dif = set()
50for match in matches_1:
51    if match not in matches_2:
52        sym_dif.add(match)
53&gt;&gt;&gt; list(sym_dif)
54['GCF_001297745.1', 'GCA_001297745.1']
55&gt;&gt;&gt; list(set(matches_1).symmetric_difference(set(matches_2)))
56['GCF_001297745.1', 'GCA_001297745.1']
57

Source https://stackoverflow.com/questions/71789818

QUESTION

Is there a way to permute inside using to variables in bash?

Asked 2021-Dec-09 at 23:50

I'm using the software plink2 (https://www.cog-genomics.org/plink/2.0/) and I'm trying to iterate over 3 variables.

This software admits an input file with .ped extention file and an exclude file with .txt extention which contains a list of names to be excluded from the input file.

The idea is to iterate over the input files and then over exclude files to generate single outputfiles.

Input files: Highland.ped - Midland.ped - Lowland.ped
Exclude-map files: HighlandMidland.txt - HighlandLowland.txt - MidlandLowland.txt
Output files: HighlandMidland - HighlandLowland - MidlandHighland - MidlandLowland - LowlandHighland - LowlandMidland

The general code is:

1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7

To avoid repeating this code 6 different times I would like to use the variables listed above (1, 2 and 3) to create single output files. Outputfiles are a permutation with replacements of the inputfile names.

ANSWER

Answered 2021-Dec-09 at 23:50

Honestly, I think your current code is quite clear; but if you really want to write this as a loop, here's one possibility:

1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7lands=(Highland Midland Lowland)
8for (( i = 0 ; i &lt; ${#lands[@]} ; ++i )) ; do
9  for (( j = i + 1 ; j &lt; ${#lands[@]} ; ++j )) ; do
10    plink2 --file &quot;${lands[i]}&quot; --exclude &quot;${lands[i]}${lands[j]}.txt&quot; --out &quot;${lands[i]}${lands[j]}&quot;
11    plink2 --file &quot;${lands[j]}&quot; --exclude &quot;${lands[i]}${lands[j]}.txt&quot; --out &quot;${lands[j]}${lands[i]}&quot;
12  done
13done
14

and here's another:

1plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
2plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
3plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
4plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
5plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
6plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
7lands=(Highland Midland Lowland)
8for (( i = 0 ; i &lt; ${#lands[@]} ; ++i )) ; do
9  for (( j = i + 1 ; j &lt; ${#lands[@]} ; ++j )) ; do
10    plink2 --file &quot;${lands[i]}&quot; --exclude &quot;${lands[i]}${lands[j]}.txt&quot; --out &quot;${lands[i]}${lands[j]}&quot;
11    plink2 --file &quot;${lands[j]}&quot; --exclude &quot;${lands[i]}${lands[j]}.txt&quot; --out &quot;${lands[j]}${lands[i]}&quot;
12  done
13done
14lands=(Highland Midland Lowland)
15for (( i = 0 ; i &lt; ${#lands[@]} ; ++i )) ; do
16  for (( j = 0 ; j &lt; ${#lands[@]} ; ++j )) ; do
17    if [[ &quot;$i&quot; != &quot;$j&quot; ]] ; then
18      plink2 \
19        --file &quot;${lands[i]}&quot; \
20        --exclude &quot;$lands[i &lt; j ? i : j]}$lands[i &lt; j ? j : i]}.txt&quot; \
21        --out &quot;${lands[i]}${lands[j]}&quot;
22    fi
23  done
24done
25

. . . but one common factor between both of the above is that they're much less clear than your current code!

Source https://stackoverflow.com/questions/70298074

QUESTION

BigQuery Regex to extract string between two substrings

Asked 2021-Dec-09 at 01:11

From this example string:

1{&amp;q;somerandomtext&amp;q;:{&amp;q;Product&amp;q;:{&amp;q;TileID&amp;q;:0,&amp;q;Stockcode&amp;q;:1234,&amp;q;variant&amp;q;:&amp;q;genomics&amp;q;,&amp;q;available&amp;q;:0&quot;}
2

I'm trying to extract the Stockcode only.

1{&amp;q;somerandomtext&amp;q;:{&amp;q;Product&amp;q;:{&amp;q;TileID&amp;q;:0,&amp;q;Stockcode&amp;q;:1234,&amp;q;variant&amp;q;:&amp;q;genomics&amp;q;,&amp;q;available&amp;q;:0&quot;}
2REGEXP_REPLACE(col, r&quot;.*,&amp;q;Stockcode&amp;q;:/([^/$]*)\,&amp;q;.*&quot;, r&quot;\1&quot;)
3

So the result should be

1234

however my Regex still returns the entire contents.

ANSWER

Answered 2021-Dec-09 at 01:11

use regexp_extract(col, r"&q;Stockcode&q;:([^/$]*?),&q;.*")

if applied to sample data in your question - output is

Source https://stackoverflow.com/questions/70283253

QUESTION

how to stop letter repeating itself python

Asked 2021-Nov-25 at 18:33

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it

1import json
2
3val = open(&quot;data.json&quot;)
4val1 = json.load(val)#loads the list
5
6
7a = input(&quot;Enter a Jumbled word &quot;)#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12    for somethin in a:#iterates letters from list
13        if somethin in list(x):#checks if the letter is in the iterated word
14            continue
15        else:
16            break
17    else:#checks if the loop ended correctly (that means word has same letters)
18        if len(a) != len(list(x)):#checks if it has same number of letters
19            continue#returns
20        else:
21            print(x)#continues the loop to see if there are more like that
22

EDIT: many people wanted the json file so here it is

1import json
2
3val = open(&quot;data.json&quot;)
4val1 = json.load(val)#loads the list
5
6
7a = input(&quot;Enter a Jumbled word &quot;)#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12    for somethin in a:#iterates letters from list
13        if somethin in list(x):#checks if the letter is in the iterated word
14            continue
15        else:
16            break
17    else:#checks if the loop ended correctly (that means word has same letters)
18        if len(a) != len(list(x)):#checks if it has same number of letters
19            continue#returns
20        else:
21            print(x)#continues the loop to see if there are more like that
22['Torres Strait Creole', 'good bye', 'agon', &quot;queen's guard&quot;, 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', &quot;devil's advocate&quot;, 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', &quot;couldn't-care-less attitude&quot;, 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
23

Just a example and the dictionary is full of items

ANSWER

Answered 2021-Nov-25 at 18:33

As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.

1import json
2
3val = open(&quot;data.json&quot;)
4val1 = json.load(val)#loads the list
5
6
7a = input(&quot;Enter a Jumbled word &quot;)#takes a word from user
8a = list(a)#changes into list to iterate
9
10
11for x in val1:#iterates words from list
12    for somethin in a:#iterates letters from list
13        if somethin in list(x):#checks if the letter is in the iterated word
14            continue
15        else:
16            break
17    else:#checks if the loop ended correctly (that means word has same letters)
18        if len(a) != len(list(x)):#checks if it has same number of letters
19            continue#returns
20        else:
21            print(x)#continues the loop to see if there are more like that
22['Torres Strait Creole', 'good bye', 'agon', &quot;queen's guard&quot;, 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', &quot;devil's advocate&quot;, 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', &quot;couldn't-care-less attitude&quot;, 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
23sorted_jumbled_word = sorted(a)
24for word in val1:
25    if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
26        print(word)
27

Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.

There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

Source https://stackoverflow.com/questions/70112201

QUESTION

Split multiallelic to biallelic in vcf by plink 1.9 and its variant name

Asked 2021-Nov-17 at 13:56

I am trying to use plink1.9 to split multiallelic into biallelic. The input is that

11       chr1:930939:G:A 0       930939  G       A
21       chr1:930947:G:A 0       930947  A       G
31       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
4

What it done is:

11       chr1:930939:G:A 0       930939  G       A
21       chr1:930947:G:A 0       930947  A       G
31       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
41       chr1:930939:G:A 0       930939  G       A
51       chr1:930947:G:A 0       930947  A       G
61       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
71       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
8

What I expect is:

11       chr1:930939:G:A 0       930939  G       A
21       chr1:930947:G:A 0       930947  A       G
31       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
41       chr1:930939:G:A 0       930939  G       A
51       chr1:930947:G:A 0       930947  A       G
61       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
71       chr1:930952:G:A;chr1:930952:G:C 0       930952  A       G
81       chr1:930939:G:A 0       930939  G       A
91       chr1:930947:G:A 0       930947  A       G
101       chr1:930952:G:A 0       930952  A       G
111       chr1:930952:G:C 0       930952  A       G
12

Please help me to make a vcf or ped or map file like what I expect. Thank you.

ANSWER

Answered 2021-Nov-17 at 09:45

I used bcftools to complete the task.

https://github.com/samtools/bcftools/issues/1193

Source https://stackoverflow.com/questions/70001737

QUESTION

Delete specific letter in a FASTA sequence

Asked 2021-Oct-12 at 21:00

I have a FASTA file that has about 300000 sequences but some of the sequences are like these

1&gt;Spike|hCoV-19/Wuhan/WH02/2019|2019-12-31|EPI_ISL_406799|Original|hCoV-19^^Wuhan|Human|General Hospital of Central Theater Command of People's Liberation Army of China|BGI &amp; Institute of Microbiology|Hunter|China
2MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVITEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
3
4&gt;Spike|hCoV-19/England/PORT-2DE4EF/2020|2020-00-00|EPI_ISL_1310367|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
5MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLDILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
6
7&gt;Spike|hCoV-19/England/PORT-2DE616/2020|2020-00-00|EPI_ISL_1310384|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
8MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGSAAYYVGYLQLRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYYLLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
9
10

I want to delete all the sequences that contain the letter x in them, how can I do that?

ANSWER

Answered 2021-Oct-12 at 20:28

You can match your non-X containing FASTA entries with the regex >.+\n[^X]+\n. This checks for a substring starting with > having a first line of anything (the FASTA header), which is followed by characters not containing an X until you reach a line break.

For example:

1&gt;Spike|hCoV-19/Wuhan/WH02/2019|2019-12-31|EPI_ISL_406799|Original|hCoV-19^^Wuhan|Human|General Hospital of Central Theater Command of People's Liberation Army of China|BGI &amp; Institute of Microbiology|Hunter|China
2MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVITEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
3
4&gt;Spike|hCoV-19/England/PORT-2DE4EF/2020|2020-00-00|EPI_ISL_1310367|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
5MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLDILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
6
7&gt;Spike|hCoV-19/England/PORT-2DE616/2020|2020-00-00|EPI_ISL_1310384|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
8MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGSAAYYVGYLQLRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYYLLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
9
10no_X_FASTA = &quot;&quot;.join(re.findall(r&quot;&gt;.+\n[^X]+\n&quot;,text))
11

Source https://stackoverflow.com/questions/69545912

QUESTION

How to get the words within the first single quote in r using regex?

Asked 2021-Oct-04 at 22:27

For example, I have two strings:

1stringA = &quot;'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']&quot;
2

I am not familiar how to do regex and stuck to extract words within the first single quotes.

Expected

1stringA = &quot;'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']&quot;
2## do regex here
3gsub(&quot;'(.*)'&quot;, &quot;\\1&quot;, stringA) # not working
4
5&gt; &quot;contentX&quot;
6

ANSWER

Answered 2021-Oct-04 at 22:27

For your example your pattern would be:

1stringA = &quot;'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']&quot;
2## do regex here
3gsub(&quot;'(.*)'&quot;, &quot;\\1&quot;, stringA) # not working
4
5&gt; &quot;contentX&quot;
6gsub(&quot;^'(.*?)'.*&quot;, &quot;\\1&quot;, stringA)
7

https://regex101.com/r/bs3lwJ/1

First we assert we're at the beginning of the string and that the following character is a single quote with ^'. Then we capture everything up until the next single quote in group 1, using (.*?)'.

Note that we need the ? in .*? otherwise .* will be "greedy" and match all the way through to the last occurrence of a single quote, rather then the next single quote.

Source https://stackoverflow.com/questions/69442717

QUESTION

Does Apache Spark 3 support GPU usage for Spark RDDs?

Asked 2021-Sep-23 at 05:53

I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage.

I tried spark-rapids library start an on-premise slurm cluster with gpu nodes. I was able to initialise the cluster. However, when I tried running hail tasks, the executors keep getting killed.

On querying in Hail forum, I got the response that

That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any Spark-SQL interfaces, only the RDD interfaces.

So, does Spark3 not support GPU usage for RDD interfaces?

ANSWER

Answered 2021-Sep-23 at 05:53

As of now, spark-rapids doesn't support GPU usage for RDD interfaces.

Source: Link

Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations. This requires no API changes from the user. The plugin will replace SQL operations it supports with GPU accelerated versions. If an operation is not supported it will fall back to using the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs directly.

Here, an answer from spark-rapids team

Source: Link

We do not support running the RDD API on GPUs at this time. We only support the SQL/Dataframe API, and even then only a subset of the operators. This is because we are translating individual Catalyst operators into GPU enabled equivalent operators. I would love to be able to support the RDD API, but that would require us to be able to take arbitrary java, scala, and python code and run it on the GPU. We are investigating ways to try to accomplish some of this, but right now it is very difficult to do. That is especially true for libraries like Hail, which use python as an API, but the data analysis is done in C/C++.

Source https://stackoverflow.com/questions/69273205

QUESTION

Aggregating and summing columns across 1500 files by matching IDs in R (or bash)

Asked 2021-Sep-07 at 13:09

I have 1500 files with the same format (the .scount file format from PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount), an example is below:

1#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
2LP5987245   10  0   6   53  0   52  0   67  70  32
3LP098324    34  51  10  37  100 12  59  11  49  0
4LP908325    0   45  39  54  68  48  51  58  31  2
5LP0932325   7   72  0   2   92  64  13  52  0   100
6LP08324 92  93  95  39  23  0   27  75  49  14
7LP034252    85  46  10  69  20  8   80  81  94  23
8

In reality each file has 80000 IIDs and is roughly 1-10MB in size. Each IID is unique and found once per file.

I would like to create a single file matched by IID with each column value summed. The column names are the same across files.

I have tried:

1#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
2LP5987245   10  0   6   53  0   52  0   67  70  32
3LP098324    34  51  10  37  100 12  59  11  49  0
4LP908325    0   45  39  54  68  48  51  58  31  2
5LP0932325   7   72  0   2   92  64  13  52  0   100
6LP08324 92  93  95  39  23  0   27  75  49  14
7LP034252    85  46  10  69  20  8   80  81  94  23
8fnames &lt;- list.files(pattern = &quot;\\.scount&quot;)
9df_list &lt;- lapply(fnames, read.table, header = TRUE)
10df_all &lt;- do.call(rbind, df_list)
11x &lt;- aggregate(IID ~ , data = df_all, sum)
12

But this is really slow for the number of files and the # at the start of the #IID column is a real pain to work around.

Any help would be greatly appreciated

ANSWER

Answered 2021-Sep-07 at 11:10

a tidyverse solution

1#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
2LP5987245   10  0   6   53  0   52  0   67  70  32
3LP098324    34  51  10  37  100 12  59  11  49  0
4LP908325    0   45  39  54  68  48  51  58  31  2
5LP0932325   7   72  0   2   92  64  13  52  0   100
6LP08324 92  93  95  39  23  0   27  75  49  14
7LP034252    85  46  10  69  20  8   80  81  94  23
8fnames &lt;- list.files(pattern = &quot;\\.scount&quot;)
9df_list &lt;- lapply(fnames, read.table, header = TRUE)
10df_all &lt;- do.call(rbind, df_list)
11x &lt;- aggregate(IID ~ , data = df_all, sum)
12df2 &lt;- df
13df3 &lt;- df
14
15df_list &lt;- list(df,df2,df3)
16
17df_all &lt;- do.call(rbind, df_list)
18
19library(dplyr)
20
21df_all %&gt;%
22group_by(IID) %&gt;%
23summarise_all(sum)
24

solution with data.table

1#IID    HOM_REF_CT  HOM_ALT_SNP_CT  HET_SNP_CT  DIPLOID_TRANSITION_CT   DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT   DIPLOID_SINGLETON_CT    HAP_REF_INCL_FEMALE_Y_CT    HAP_ALT_INCL_FEMALE_Y_CT    MISSING_INCL_FEMALE_Y_CT
2LP5987245   10  0   6   53  0   52  0   67  70  32
3LP098324    34  51  10  37  100 12  59  11  49  0
4LP908325    0   45  39  54  68  48  51  58  31  2
5LP0932325   7   72  0   2   92  64  13  52  0   100
6LP08324 92  93  95  39  23  0   27  75  49  14
7LP034252    85  46  10  69  20  8   80  81  94  23
8fnames &lt;- list.files(pattern = &quot;\\.scount&quot;)
9df_list &lt;- lapply(fnames, read.table, header = TRUE)
10df_all &lt;- do.call(rbind, df_list)
11x &lt;- aggregate(IID ~ , data = df_all, sum)
12df2 &lt;- df
13df3 &lt;- df
14
15df_list &lt;- list(df,df2,df3)
16
17df_all &lt;- do.call(rbind, df_list)
18
19library(dplyr)
20
21df_all %&gt;%
22group_by(IID) %&gt;%
23summarise_all(sum)
24df_list &lt;- list(df,df2,df3)
25
26df_all &lt;- do.call(rbind, df_list)
27
28library(data.table)
29
30setDT(df_all)
31df_all[, lapply(.SD, sum), by=IID]
32

to ignore '#' see Cannot read file with "#" and space using read.table or read.csv in R

Source https://stackoverflow.com/questions/69086946

QUESTION

Usage of compression IO functions in apache arrow

Asked 2021-Jun-02 at 18:58

I have been implementing a suite of RecordBatchReaders for a genomics toolset. The standard unit of work is a RecordBatch. I ended up implementing a lot of my own compression and IO tools instead of using the existing utilities in the arrow cpp platform because I was confused about them. Are there any clear examples of using the existing compression and file IO utilities to simply get a file stream that inflates standard zlib data? Also, an object diagram for the cpp platform would be helpful in ramping up.

ANSWER

Answered 2021-Jun-02 at 18:58

Here is an example program that inflates a compressed zlib file and reads it as CSV.

1#include &lt;iostream&gt;
2
3#include &lt;arrow/api.h&gt;
4#include &lt;arrow/csv/api.h&gt;
5#include &lt;arrow/io/api.h&gt;
6#include &lt;arrow/util/compression.h&gt;
7#include &lt;arrow/util/logging.h&gt;
8
9arrow::Status RunMain(int argc, char **argv) {
10
11  if (argc &lt; 2) {
12    return arrow::Status::Invalid(
13        &quot;You must specify a gzipped CSV file to read&quot;);
14  }
15
16  std::string file_to_read = argv[1];
17  ARROW_ASSIGN_OR_RAISE(auto in_file,
18                        arrow::io::ReadableFile::Open(file_to_read));
19  ARROW_ASSIGN_OR_RAISE(auto codec,
20                        arrow::util::Codec::Create(arrow::Compression::GZIP));
21  ARROW_ASSIGN_OR_RAISE(
22      auto compressed_in,
23      arrow::io::CompressedInputStream::Make(codec.get(), in_file));
24
25  auto read_options = arrow::csv::ReadOptions::Defaults();
26  auto parse_options = arrow::csv::ParseOptions::Defaults();
27  auto convert_options = arrow::csv::ConvertOptions::Defaults();
28  ARROW_ASSIGN_OR_RAISE(
29      auto table_reader,
30      arrow::csv::TableReader::Make(arrow::io::default_io_context(),
31                                    std::move(compressed_in), read_options,
32                                    parse_options, convert_options));
33
34  ARROW_ASSIGN_OR_RAISE(auto table, table_reader-&gt;Read());
35  std::cout &lt;&lt; &quot;The table had &quot; &lt;&lt; table-&gt;num_rows() &lt;&lt; &quot; rows and &quot;
36            &lt;&lt; table-&gt;num_columns() &lt;&lt; &quot; columns.&quot; &lt;&lt; std::endl;
37
38  return arrow::Status::OK();
39}
40
41int main(int argc, char **argv) {
42  arrow::Status st = RunMain(argc, argv);
43  if (!st.ok()) {
44    std::cerr &lt;&lt; st &lt;&lt; std::endl;
45    return 1;
46  }
47  return 0;
48}
49

Compression is handled in different ways in different parts of Arrow. The file readers typically accept an arrow::io::InputStream. You should be able to use arrow::io::CompressedInputStream to wrap an arrow::io::InputStream with decompression. This gives you whole-file compression. This is fine for something like CSV.

For Parquet, this approach does not work (ParquetFileReader::Open expects arrow::io::RandomAccessFile). For IPC, this approach is inefficient (unless you are reading the entire file). Effective reading of these formats involves seekable reads which is not possible with whole-file compression. Both formats support their own format-specific compression options. You only need to specify these options on write. On read the compression will be detected from the metadata (the metadata is stored uncompressed) of the file itself. If you are writing data you can find the information in parquet::ArrowWriterProperties and arrow::ipc::WriteOptions.

Since whole-file compression is still a thing for CSV the datasets API has recently (as of 4.0.0) added support for detecting compression from file extensions for CSV datasets. More details can be found here.

As for documentation and an object diagram, those are excellent topics for the user mailing list, or you are welcome to provide a pull request.

Source https://stackoverflow.com/questions/67799265

Community Discussions contain sources that include Stack Exchange Network