smoove | structural variant | Genomics library
kandi X-RAY | smoove Summary
Support
Quality
Security
License
Reuse
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
smoove Key Features
smoove Examples and Code Snippets
Trending Discussions on Genomics
Trending Discussions on Genomics
QUESTION
I´m working with two text files that look like this: File 1
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCF_000739415.1 PRJNA224116 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCA_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/739/415/GCF_000739415.1_ASM73941v1 na
GCF_001263815.1 PRJNA224116 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCA_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/815/GCF_001263815.1_ASM126381v1 na
GCF_001297745.1 PRJNA224116 SAMD00040429 BCBV00000000.1 na 837 837 Porphyromonas gingivalis strain=Ando latest Scaffold Major Full 2015/09/17 ASM129774v1 Lab. of Plant Genomics and Genetics, Department of Plant Genome Research, Kazusa DNA Research Institute GCA_001297745.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/297/745/GCF_001297745.1_ASM129774v1 an
...
File 2:
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 na
GCA_001263815.1 PRJNA276132 SAMN03366764 na 837 837 Porphyromonas gingivalis strain=A7436 latest Complete Genome Major Full 2015/08/11 ASM126381v1 University of Florida GCF_001263815.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/263/815/GCA_001263815.1_ASM126381v1 na
So, I want to search for a specific pattern using regex. For example, file 1 has this pattern:
GCF_000739415.1
and file 2 this one:
GCA_000739415.1
The difference is the third character: F versus A. However, sometimes numbers differ. Difference between files is the third row of data. These two files have a lot of patterns like the previous one, however, there are some differences. My goal is to search for the pattern that only exists in one file and not in the other file. For example, "GCF_001297745.1 in the third row in the file 1 but not in the file 2. This should be a GCA_001297745.1"
I´m working on a python code:
# PART 1: Open and read text file
with open("assembly_summary_genbank.txt", 'r') as f_1:
contents_1 = f_1.readlines()
with open("assembly_summary_refseq.txt", 'r') as f_2:
contents_2 = f_2.readlines()
# PART 2: Search for IDs
matches_1 = re.findall("GCF_[0-9]*\.[0-9]", str(contents_1))
matches_2 = re.findall("GCA_[0-9]*\.[0-9]", str(contents_2))
# PART 3: Match between files
# Seudocode
for line in matches_1:
if matches_1 == matches_2:
print("PATTERN THAT ONLY EXIST IN ONE FILE")
Part 3 refers to doing a for loop that searches for each line in both files and prints the patterns that only exist in one file and not in the other one. Any idea for doing this for loop?
ANSWER
Answered 2022-Apr-09 at 00:49Perhaps you are after this?
import re
given_example = "GCA_000739415.1 PRJNA245225 SAMN02732406 na 837 837 Porphyromonas gingivalis strain=HG66 latest Chromosome Major Full 2014/08/14 ASM73941v1 University of Louisville GCF_000739415.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/739/415/GCA_000739415.1_ASM73941v1 an"
altered_example = "GCA_000739415.1 GCTEST_000739415.1"
# GX[A or F]_[number; digit >= 1].[number; digit >= 1]
regex = r"GC[AF]_\d+.\d+"
matches_1 = re.findall(regex, given_example)
matches_2 = re.findall(regex, altered_example)
# Iteration for intersection
for match in matches_1:
if match in matches_2:
print(f"{match} is in both files")
Prints
GCA_000739415.1 is in both files
GCA_000739415.1 is in both files
But I would recommend:
# The preferred method for intersection, where order is not important
matches = list(set(matches_1) & set(matches_2))
Which saves as:
['GCA_000739415.1']
Note the regex matches in a form of GX[A or F]_[number; digit >= 1].[number; digit >= 1]
. Let me know if this is not what you are after
Edit
I believe you are after the symmetric difference of sets for files 1 and 2. Which is a fancy way of saying "things in A & B, that are not in both"
Which can be done with literation:
# Iteration
# A set has no duplicates, and is unordered
sym_dif = set()
for match in matches_1:
if match not in matches_2:
sym_dif.add(match)
>>> list(sym_dif)
['GCF_001297745.1', 'GCA_001297745.1']
I think your mistake was not using a set, you should't have any duplicates, and using matches_1 == matches_2
. The lists won't be the same. You should check if it is not in
the other set.
Or using this set notation which is the preferred method:
>>> list(set(matches_1).symmetric_difference(set(matches_2)))
['GCF_001297745.1', 'GCA_001297745.1']
QUESTION
I'm using the software plink2 (https://www.cog-genomics.org/plink/2.0/) and I'm trying to iterate over 3 variables.
This software admits an input file with .ped extention file and an exclude file with .txt extention which contains a list of names to be excluded from the input file.
The idea is to iterate over the input files and then over exclude files to generate single outputfiles.
- Input files: Highland.ped - Midland.ped - Lowland.ped
- Exclude-map files: HighlandMidland.txt - HighlandLowland.txt - MidlandLowland.txt
- Output files: HighlandMidland - HighlandLowland - MidlandHighland - MidlandLowland - LowlandHighland - LowlandMidland
The general code is:
plink2 --file Highland --exclude HighlandMidland.txt --out HighlandMidland
plink2 --file Highland --exclude HighlandLowland.txt --out HighlandLowland
plink2 --file Midland --exclude HighlandMidland.txt --out MidlandHighland
plink2 --file Midland --exclude MidlandLowland.txt --out MidlandLowland
plink2 --file Lowland --exclude HighlandLowland.txt --out LowlandHighland
plink2 --file Lowland --exclude MidlandLowland.txt --out LowlandMidland
To avoid repeating this code 6 different times I would like to use the variables listed above (1, 2 and 3) to create single output files. Outputfiles are a permutation with replacements of the inputfile names.
ANSWER
Answered 2021-Dec-09 at 23:50Honestly, I think your current code is quite clear; but if you really want to write this as a loop, here's one possibility:
lands=(Highland Midland Lowland)
for (( i = 0 ; i < ${#lands[@]} ; ++i )) ; do
for (( j = i + 1 ; j < ${#lands[@]} ; ++j )) ; do
plink2 --file "${lands[i]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[i]}${lands[j]}"
plink2 --file "${lands[j]}" --exclude "${lands[i]}${lands[j]}.txt" --out "${lands[j]}${lands[i]}"
done
done
and here's another:
lands=(Highland Midland Lowland)
for (( i = 0 ; i < ${#lands[@]} ; ++i )) ; do
for (( j = 0 ; j < ${#lands[@]} ; ++j )) ; do
if [[ "$i" != "$j" ]] ; then
plink2 \
--file "${lands[i]}" \
--exclude "$lands[i < j ? i : j]}$lands[i < j ? j : i]}.txt" \
--out "${lands[i]}${lands[j]}"
fi
done
done
. . . but one common factor between both of the above is that they're much less clear than your current code!
QUESTION
From this example string:
{&q;somerandomtext&q;:{&q;Product&q;:{&q;TileID&q;:0,&q;Stockcode&q;:1234,&q;variant&q;:&q;genomics&q;,&q;available&q;:0"}
I'm trying to extract the Stockcode only.
REGEXP_REPLACE(col, r".*,&q;Stockcode&q;:/([^/$]*)\,&q;.*", r"\1")
So the result should be
1234
however my Regex still returns the entire contents.
ANSWER
Answered 2021-Dec-09 at 01:11use regexp_extract(col, r"&q;Stockcode&q;:([^/$]*?),&q;.*")
if applied to sample data in your question - output is
QUESTION
I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items
ANSWER
Answered 2021-Nov-25 at 18:33As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")
QUESTION
I am trying to use plink1.9 to split multiallelic into biallelic. The input is that
1 chr1:930939:G:A 0 930939 G A
1 chr1:930947:G:A 0 930947 A G
1 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
What it done is:
1 chr1:930939:G:A 0 930939 G A
1 chr1:930947:G:A 0 930947 A G
1 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
1 chr1:930952:G:A;chr1:930952:G:C 0 930952 A G
What I expect is:
1 chr1:930939:G:A 0 930939 G A
1 chr1:930947:G:A 0 930947 A G
1 chr1:930952:G:A 0 930952 A G
1 chr1:930952:G:C 0 930952 A G
Please help me to make a vcf or ped or map file like what I expect. Thank you.
ANSWER
Answered 2021-Nov-17 at 09:45I used bcftools to complete the task.
QUESTION
I have a FASTA file that has about 300000 sequences but some of the sequences are like these
>Spike|hCoV-19/Wuhan/WH02/2019|2019-12-31|EPI_ISL_406799|Original|hCoV-19^^Wuhan|Human|General Hospital of Central Theater Command of People's Liberation Army of China|BGI & Institute of Microbiology|Hunter|China
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVITEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/England/PORT-2DE4EF/2020|2020-00-00|EPI_ISL_1310367|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLDILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/England/PORT-2DE616/2020|2020-00-00|EPI_ISL_1310384|Original|hCoV-19^^England|Human|Centre for Enzyme Innovation|COVID-19 Genomics UK (COG-UK) Consortium|Robson|United Kingdom
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGSAAYYVGYLQLRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYYLLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
I want to delete all the sequences that contain the letter x in them, how can I do that?
ANSWER
Answered 2021-Oct-12 at 20:28You can match your non-X containing FASTA entries with the regex >.+\n[^X]+\n
. This checks for a substring starting with >
having a first line of anything (the FASTA header), which is followed by characters not containing an X until you reach a line break.
For example:
no_X_FASTA = "".join(re.findall(r">.+\n[^X]+\n",text))
QUESTION
For example, I have two strings:
stringA = "'contentX' is not one of ['Illumina NovaSeq 6000', 'Other', 'Ion Torrent PGM', 'Illumina HiSeq X Ten', 'Illumina HiSeq 4000', 'Illumina NextSeq', 'Complete Genomics', 'Illumina Genome Analyzer II']"
I am not familiar how to do regex and stuck to extract words within the first single quotes.
Expected
## do regex here
gsub("'(.*)'", "\\1", stringA) # not working
> "contentX"
ANSWER
Answered 2021-Oct-04 at 22:27For your example your pattern would be:
gsub("^'(.*?)'.*", "\\1", stringA)
https://regex101.com/r/bs3lwJ/1
First we assert we're at the beginning of the string and that the following character is a single quote with ^'
. Then we capture everything up until the next single quote in group 1, using (.*?)'
.
Note that we need the ?
in .*?
otherwise .*
will be "greedy" and match all the way through to the last occurrence of a single quote, rather then the next single quote.
QUESTION
I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage.
I tried spark-rapids library start an on-premise slurm cluster with gpu nodes. I was able to initialise the cluster. However, when I tried running hail tasks, the executors keep getting killed.
On querying in Hail forum, I got the response that
That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any Spark-SQL interfaces, only the RDD interfaces.
So, does Spark3 not support GPU usage for RDD interfaces?
ANSWER
Answered 2021-Sep-23 at 05:53As of now, spark-rapids doesn't support GPU usage for RDD interfaces.
Source: Link
Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations. This requires no API changes from the user. The plugin will replace SQL operations it supports with GPU accelerated versions. If an operation is not supported it will fall back to using the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs directly.
Here, an answer from spark-rapids team
Source: Link
We do not support running the RDD API on GPUs at this time. We only support the SQL/Dataframe API, and even then only a subset of the operators. This is because we are translating individual Catalyst operators into GPU enabled equivalent operators. I would love to be able to support the RDD API, but that would require us to be able to take arbitrary java, scala, and python code and run it on the GPU. We are investigating ways to try to accomplish some of this, but right now it is very difficult to do. That is especially true for libraries like Hail, which use python as an API, but the data analysis is done in C/C++.
QUESTION
I have 1500 files with the same format (the .scount file format from PLINK2 https://www.cog-genomics.org/plink/2.0/formats#scount), an example is below:
#IID HOM_REF_CT HOM_ALT_SNP_CT HET_SNP_CT DIPLOID_TRANSITION_CT DIPLOID_TRANSVERSION_CT DIPLOID_NONSNP_NONSYMBOLIC_CT DIPLOID_SINGLETON_CT HAP_REF_INCL_FEMALE_Y_CT HAP_ALT_INCL_FEMALE_Y_CT MISSING_INCL_FEMALE_Y_CT
LP5987245 10 0 6 53 0 52 0 67 70 32
LP098324 34 51 10 37 100 12 59 11 49 0
LP908325 0 45 39 54 68 48 51 58 31 2
LP0932325 7 72 0 2 92 64 13 52 0 100
LP08324 92 93 95 39 23 0 27 75 49 14
LP034252 85 46 10 69 20 8 80 81 94 23
In reality each file has 80000 IIDs and is roughly 1-10MB in size. Each IID is unique and found once per file.
I would like to create a single file matched by IID with each column value summed. The column names are the same across files.
I have tried:
fnames <- list.files(pattern = "\\.scount")
df_list <- lapply(fnames, read.table, header = TRUE)
df_all <- do.call(rbind, df_list)
x <- aggregate(IID ~ , data = df_all, sum)
But this is really slow for the number of files and the # at the start of the #IID column is a real pain to work around.
Any help would be greatly appreciated
ANSWER
Answered 2021-Sep-07 at 11:10a tidyverse
solution
df2 <- df
df3 <- df
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(dplyr)
df_all %>%
group_by(IID) %>%
summarise_all(sum)
solution with data.table
df_list <- list(df,df2,df3)
df_all <- do.call(rbind, df_list)
library(data.table)
setDT(df_all)
df_all[, lapply(.SD, sum), by=IID]
to ignore '#' see Cannot read file with "#" and space using read.table or read.csv in R
QUESTION
I have been implementing a suite of RecordBatchReaders for a genomics toolset. The standard unit of work is a RecordBatch. I ended up implementing a lot of my own compression and IO tools instead of using the existing utilities in the arrow cpp platform because I was confused about them. Are there any clear examples of using the existing compression and file IO utilities to simply get a file stream that inflates standard zlib data? Also, an object diagram for the cpp platform would be helpful in ramping up.
ANSWER
Answered 2021-Jun-02 at 18:58Here is an example program that inflates a compressed zlib file and reads it as CSV.
#include
#include
#include
#include
#include
#include
arrow::Status RunMain(int argc, char **argv) {
if (argc < 2) {
return arrow::Status::Invalid(
"You must specify a gzipped CSV file to read");
}
std::string file_to_read = argv[1];
ARROW_ASSIGN_OR_RAISE(auto in_file,
arrow::io::ReadableFile::Open(file_to_read));
ARROW_ASSIGN_OR_RAISE(auto codec,
arrow::util::Codec::Create(arrow::Compression::GZIP));
ARROW_ASSIGN_OR_RAISE(
auto compressed_in,
arrow::io::CompressedInputStream::Make(codec.get(), in_file));
auto read_options = arrow::csv::ReadOptions::Defaults();
auto parse_options = arrow::csv::ParseOptions::Defaults();
auto convert_options = arrow::csv::ConvertOptions::Defaults();
ARROW_ASSIGN_OR_RAISE(
auto table_reader,
arrow::csv::TableReader::Make(arrow::io::default_io_context(),
std::move(compressed_in), read_options,
parse_options, convert_options));
ARROW_ASSIGN_OR_RAISE(auto table, table_reader->Read());
std::cout << "The table had " << table->num_rows() << " rows and "
<< table->num_columns() << " columns." << std::endl;
return arrow::Status::OK();
}
int main(int argc, char **argv) {
arrow::Status st = RunMain(argc, argv);
if (!st.ok()) {
std::cerr << st << std::endl;
return 1;
}
return 0;
}
Compression is handled in different ways in different parts of Arrow. The file readers typically accept an arrow::io::InputStream
. You should be able to use arrow::io::CompressedInputStream
to wrap an arrow::io::InputStream
with decompression. This gives you whole-file compression. This is fine for something like CSV.
For Parquet, this approach does not work (ParquetFileReader::Open
expects arrow::io::RandomAccessFile
). For IPC, this approach is inefficient (unless you are reading the entire file). Effective reading of these formats involves seekable reads which is not possible with whole-file compression. Both formats support their own format-specific compression options. You only need to specify these options on write. On read the compression will be detected from the metadata (the metadata is stored uncompressed) of the file itself. If you are writing data you can find the information in parquet::ArrowWriterProperties
and arrow::ipc::WriteOptions
.
Since whole-file compression is still a thing for CSV the datasets API has recently (as of 4.0.0) added support for detecting compression from file extensions for CSV datasets. More details can be found here.
As for documentation and an object diagram, those are excellent topics for the user mailing list, or you are welcome to provide a pull request.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install smoove
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits
Save this library and start creating your kit
Share this Page