Child |

__The Model__

$P(S = i) = \sum_{j \in D}\sum_{k \in D}P(S=i|C=j)P(C=j|B=k)P(B=k)$

This is the formula we will be using to compute the probabilities, as we have data sources that can help us imply the conditional probabilities and birth distribution (see the detailed writeup for information about these data sources). These data sources embody information on empirical distributions of ovulation length, conception length, and birth dates.

__The Methodology__In order to obtain the full distribution of successful sexual activity:

$\sum_{j \in D} \sum_{k \in D} P(S = i | C = j) P(C = j | B = k) P(B = k)$,

we specify distributions for each of the three components and simulate the combined result. Each of these components is a probability distribution across a finite number of days - of successful sexual activity, of conception days, and of birthday.

Assuming that time differences between sexual activity and conception, and between conception and birthday, are independent of the day itself (i.e. pregnancies don't last longer if they start in the winter), we refactor this conditional probability into a product of independent probabilities. We express date of sexual activity given conception date as an ovulation timespan, $\textit{Ovul}$:

$P(S=i|C=j)=P(\textit{Ovul}=j-i)$

We similarly express date of conception given birthday as a pregnancy timespan, $\textit{Preg}$: $P(C = j | B = k) = P(\textit{Preg} = k-j)$ Thus our conditional probability becomes:

$P(S = i) = P\sum_{m \in O} \sum_{n \in P} P(\textit{Ovul} = m) P(\textit{Preg} = n)P(B = i + m + n)$,

where $O$ and $P$ are the set of possible ovulation lengths and pregnancy lengths in our data. We model each of these three distributions as multinomial distributions, specifying probabilities for each category of a finite number of categories (i.e. birthdays falling in $\{1/1, 1/2,...\}$). Rather than straightforwardly using the observed probabilities across categories as fact, we use Bayesian updating to obtain a distribution across these multinomial probabilities. We use a Dirichlet distribution, which describes the probability of ${\alpha_1,...,\alpha_k}$ concentration parameters occurring, where $k$ is the number of categories, as our prior. It is a conjugate prior of the multinomial distribution. We use the uninformative prior of uniform initial probabilities across all $k$ categories. Updating this prior with data from the ovulation, conception, and birthday data sets, we obtain posterior probabilities of the three multinomial probabilities.

Then, we sample multinomial probabilities from these Dirichlet posteriors to obtain the three probabilities we marginalize above, over 10,000 simulations.

__Results__
Averaging each sex day's probabilities across the 10,000 trials results in reasonable probabilities for each day of the year that sum up closely to 1.

There appears to be a clear seasonal trend of increased successful sexual activity occurring around the winter months, starting in September and dropping in December, while there is also a decline that occurs around the beginning of summer break. Notice the small bump around the middle of February -- although perhaps just a touch more dramatic than the bump in May, perhaps this means the Valentine's Day hypothesis is true after all :)

For more details, see our Github repo.