Skip to content

Conversation

@danielle-pinto
Copy link
Collaborator

@danielle-pinto danielle-pinto commented Jan 30, 2026

Making a draft PR here. There's multiple ways to solve the problem, and I added a first approach. I'm thinking that the second would be a more statistical/simulation approach. Basically, based on the values of k, m, n, we can make a vector containing all of the possible organisms (eg. [HH, Hh, hh, HH, etc.]). Then, we can calculate the percentage of dominant individuals/total individuals.

Wanted to run this by you first and see if you had any suggestions on packages to use.

@github-actions
Copy link

Once the build has completed, you can preview your PR at this URL: https://biojulia.dev/BiojuliaDocs/previews/PR16/

@kescobo
Copy link
Member

kescobo commented Feb 2, 2026

Once the build has completed, you can preview your PR at this URL: https://biojulia.dev/BiojuliaDocs/previews/PR16/

Just noting that the comment is being made, but the link doesn't actually work.

Probably unrelated to the above, your pull request is for some reason requesting to merge into another branch, rather than into main
image

Copy link
Member

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution would be to use StatsBase.jl and do a weighted probability.

One other thing that would be nice to include here is a bit more didactic discussion about how often times we make algorithms that are narrowly tailored, but then we either repeat ourselves or get more complicated as additional requirements get tacked on. Eg, for this problem, your solution works for the specific problem, but we'd have to derive a new equation if the question is something like "What's the probability of a heterozygous offspring?" It also doesn't scale up if we add another trait etc.

Nice thing about the StatsBase.jl solution and even a simulation is that they can be made generic and then can be used to ask more types of questions. I'm not necessarily demanding we add this to a first draft, but maybe open an issue as a potential enhancement.

@kescobo
Copy link
Member

kescobo commented Feb 2, 2026

I like the idea of a simulation, though it will generally not give a precisely correct answer for rosalind. I think that's fine if that's explained.

@danielle-pinto
Copy link
Collaborator Author

Probably unrelated to the above, your pull request is for some reason requesting to merge into another branch, rather than into main image

I did this just so it wasn't showing changes for the Hamming Distance problem as well. I branched off of the hamming distance branch, but in hindsight, should have branched off main. Will keep in mind for the future.

@danielle-pinto danielle-pinto marked this pull request as ready for review February 6, 2026 17:14
@danielle-pinto
Copy link
Collaborator Author

@kescobo Ready for a final review! I think you've reviewed most of the first part (algorithm piece), so the main thing to focus on here is the statistical/sampling method.

Base automatically changed from 2026-01-27-hamming-distance to main February 8, 2026 01:41

For instance, we can use a simulation that can broadly calculate the likelihood of a given offspring based on a set of given probabilities.

This solution is generic and can be used to ask more types of questions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic solution I was thinking was actually not to simulate, but rather to be generic with the exact statistics. I like the simulation too, but eg outputting the probability matrix you generated would then allow you to count other outputs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe I can make this function more general by having the probability matrix as an input as well. Is that what you meant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. If you're strictly in mendelian land, you can think of things in terms of allele frequencies and multiplication of probabilities. I also wonder if it would be worth introducing something about julia types here... but we can save that for later


function mendel_sim(k, m, n; iterations=100000)
# Genotypes: 1=HH, 2=Hh, 3=hh
population = [fill(1, k); fill(2, m); fill(3, n)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a weight vector here makes more sense - if you have millions, you're gonna allocate a giant array. Instead you can do something like

total_pop = k+m+n
wts = [k/total_pop, m/total_pop, n/total_pop]

sample([1,2,3], weights(wts), 2) # samples from the vector [1,2,3] with probability weights given by wts

Comment on lines 190 to 193
dominant_count = sum(
offspring_prob[sample(population, 2; replace=false)...]
for i in 1:iterations
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to allocate a lot I think. I think the canonical way to do this is something like

sum(1:iterations) do _
    (i,j) = sample([1,2,3], weights(wts), 2)
    return offspring_prob[i,j]
end

@danielle-pinto
Copy link
Collaborator Author

Made some edits based on your last comments! @kescobo I think we are close to being able to merge in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants