F4 | Assessing confidence in introgression among four populations | Genomics library
kandi X-RAY | F4 Summary
kandi X-RAY | F4 Summary
Assessing confidence in introgression among four populations. F4 calculates the f4-statistic from allele frequencies of four populations and uses coalescent simulations to test whether this value could be the result of incomplete lineage sorting. The f4-statistic was introduced by Reich et al. (2009) and is a powerful measure to distinguish introgression from incomplete lineage sorting, based on allele frequencies of four populations. With populations A, B, C, and D, and the assumed population topology (A,B),(C,D), the f4-statistic is calculated as the product of the difference of allele frequencies between A and B, and between C and D. Thus, if at a particular SNP, the frequency of the base "G" was 0.4 in population A, 1.0 in population B, 0.2 in population C, and 0.0 in population D, the f4-statistic for this SNP would be (0.4-1.0)x(0.2-0.0) = -0.6x0.2 = -0.12. With more than one SNP, the f4-statistic of the whole data set is simply the mean of the f4 values of all individual SNPs. What's interesting about this measure is that under incomplete lineage sorting alone, the allele frequency differences between A and B should be independent of those between C and D, and the f4-statistic should thus be zero. If there is introgression however between the two pairs of populations (e.g. A introgressed into C), this would lead to non-zero f4 values. The tricky part however is not to calculate the f4-statistic, but to assess support for it being different from zero, and thus for introgression. As implemented in the program fourpop which comes as part of the Treemix package (Pickrell & Pritchard 2012), block jackknifing is one way to do this, which is commonly used. This means that the data set is chopped into blocks of a particular size, and the f4-statistic is calculated individually for each of these blocks. By then comparing the overall f4 value to the standard error of f4 values taken from jackknife blocks, a z-score is calculated and serves as a measure of support. There are a few problems to this approach. One is that SNP data sets, e.g. those obtained by RAD sequencing often contain linked groups of SNPs that can confound the standard error of jackknife block f4 values. Further, the use of z-scores as support assumes that the underlying data is normally distributed, but often, f4 values of jackknife blocks are not. This is due to a large proportion of these values being exactly zero if the block from which they are taken does not show any evidence of either introgression or incomplete lineage sorting (or worse, if monomorphic SNPs were not removed from the data set, which they should be for this analysis). Another issue is that the chosen value of the jackknife block size can (and often does) influence the standard error and thus the z-score supporting introgression. The approach chosen by F4 is therefore to run simulations to assess support for introgression. After all, we're not so much interested in whether the observed f4-statistic is different from zero, but in whether or not it could be produced by incomplete lineage sorting alone, without any introgression. F4 uses the coalescent software fastsimcoal2 (Excoffier et al. 2013) to produce SNP data sets that resemble the actual SNP data set, with migration rates set to zero and therefore strictly without introgression. All simulated data sets have the same number of individuals and SNPs, and are masked to include the same amount of missing data per population and SNP as the original data set. Further, a burn-in phase is used to automatically adjust simulation parameters (effective population size and relative divergence times) so that the resulting SNP variation matches the observed, both between all populations and between the two pairs of populations. Thus, the simulated data sets should be equivalent to the original data set, but have been produced completely without introgression. Finally, F4 calculates the f4-statistic for each simulated data set and reports the proportion of these that is more extreme than the observed. If the number of simulations is large enough, this proportion of more extreme values can be taken as the probability of obtaining this f4 value (or a more extreme one) without introgression. Thus if this probability is small enough (e.g. < 0.05), it supports introgression between the two pairs of populations. It does not, however, indicate the directionality of introgression. F4 is written in python3, therefore this version of python must be installed. It also uses a number of python packages, most of these however are likely to be installed on your machine anyway, if you have python. Only the two packages numpy and scipy may require additional installations. Importantly, F4 uses fastsimcoal2 for coalescent simulations, thus this program needs to be installed, and needs to be executable with fsc252, for F4 to run.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of F4
F4 Key Features
F4 Examples and Code Snippets
Community Discussions
Trending Discussions on F4
QUESTION
The objective is calculate subset column average based on the multi condition in a multiindex dataframe.
The first condition is to get the average group by the first level of the multiindex.
The second condition is to get the average based on the dict_ref
value below
ANSWER
Answered 2022-Apr-16 at 12:16You can flip dict_ref
so each item in the value array becomes the key, perform a replacement, and group by the new ch
:
QUESTION
I have a dataframe df1 with columns id1 and id2:
...ANSWER
Answered 2022-Apr-12 at 09:00You could stack df2 and then perform a double merge on df1 to get the features for the relevant id combinations:
QUESTION
Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
...ANSWER
Answered 2022-Mar-25 at 19:331 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384
GFLOP/S. (Using vectors of 4 double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the __kmpc_end_serialized_parallel
stuff).
There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.
As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Once you're already in an if
block, you unconditionally know the else
block should not run, so you jmp
over it instead of jcc
(even if FLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc
to it, then it jmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret
instead of jumping to a ret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)
QUESTION
My Dataframe looks like this:
...ANSWER
Answered 2022-Mar-23 at 09:28There may be several approaches, create a data frame for the annotation, group by column value and list the indexes. Set annotations in the created data frame. In this data example, more strings overlap, so we change the offset values only for the indices we do not want to overlap.
QUESTION
I have src_list
and dst_list
, two lists of the same length.
src_list
contains paths to existing files.
dst_list
contains paths to maybe existing files to maybe overwrite (not folders!).
src_list[i]
should correspond to dst_list[i]
.
I want to copy every src_list[i]
to dst_list[i]
, in (multiprocessing, not threading) parallel.
I want the call to be blocking, meaning all processes should be joined before moving on.
Examples of lists:
...ANSWER
Answered 2022-Feb-21 at 08:56from concurrent.futures import ProcessPoolExecutor
from shutil import copyfile
with ProcessPoolExecutor() as executor:
executor.map(copyfile, src_list, dst_list)
QUESTION
Is there any efficient way to concatenate Pandas column name, and don't use loop.
My current method is very slow.
input :
...ANSWER
Answered 2022-Feb-18 at 10:29You could rework your dictionary to form groups and use groupby
+agg(list)
:
QUESTION
I created var_nt
dataframe by subsetting tx_df
columns based on row variant
- "J3", "J10", "J11", "J13".
Then, I converted the var_nt
dataframe to a GRanges
object (varnt_grange
) using the makeGRangesFromDataFrame
function.
Now, I want to write a for loop to collapse the varnt_grange
into a single vector.
ANSWER
Answered 2022-Feb-19 at 23:15On each iteration, you are inadvertently rewriting over the same object, repeatedly. Instead, you can iterate over the values in gene.list$entrez using lapply
QUESTION
I seem to be really bad at writing readable titles :-)
Here's the situation:
I have a list of names and values, and I would like to calculate the sum of the values, corresponding to every name:
ANSWER
Answered 2022-Feb-01 at 14:00QUESTION
I have df1
:
ANSWER
Answered 2022-Jan-30 at 21:02If the values are "NULL"
, then we can select
the columns of interest, convert to long format with pivot_longer
and filter
out the "NULL"
elements
QUESTION
The following code compiles without warnings in Visual Studio 2019 msvc x64:
...ANSWER
Answered 2022-Jan-18 at 08:13Does this mean that I should have written:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install F4
You can use F4 like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page