Saturday, December 3, 2016

Modeling Successful Sexual Activity by Day of the Year

Over the years, we have noticed that a lot of our friends (including ourselves) have birthdays that occur either in late summer or early fall (in the August - September range) and the mid-to-late November range. 


Child
The folk explanation is that increased sexual activity due to colder weather as well as Valentine's day may be the cause of this observation, but we wanted to use a statistically rigorous approach to tackling this question. Perhaps dispersion in conception and ovulation lengths renders our hypothesis about increased sexual activity on Valentine's Day moot. Using data about birthday distributions, length of pregnancy and conception odds at various stages of the ovulation process, we were able to use Bayesian methods to get a distribution of "successful" sexual activity days.


The Model

We label $S$ as the day that a sexual intercourse leading to an eventual birth ("successful") would occur, $C$ as the day conception occurs, and $B$ as the day the child is born.The goal of this project is to obtain $P(S = i)$ for all days $i$ in a year, the set of all days being $D$. To reach that point we decompose this probability into conditional probability according to Bayes' Rule: 

$P(S = i) = \sum_{j \in D}\sum_{k \in D}P(S=i|C=j)P(C=j|B=k)P(B=k)$

This is the formula we will be using to compute the probabilities, as we have data sources that can help us imply the conditional probabilities and birth distribution (see the detailed writeup for information about these data sources). These data sources embody information on empirical distributions of ovulation length, conception length, and birth dates.

The Methodology
In order to obtain the full distribution of successful sexual activity:

$\sum_{j \in D} \sum_{k \in D} P(S = i | C = j) P(C = j | B = k) P(B = k)$, 

we specify distributions for each of the three components and simulate the combined result. Each of these components is a probability distribution across a finite number of days - of successful sexual activity, of conception days, and of birthday. 

Assuming that time differences between sexual activity and conception, and between conception and birthday, are independent of the day itself (i.e. pregnancies don't last longer if they start in the winter), we refactor this conditional probability into a product of independent probabilities. We express date of sexual activity given conception date as an ovulation timespan, $\textit{Ovul}$:

$P(S=i|C=j)=P(\textit{Ovul}=j-i)$
We similarly express date of conception given birthday as a pregnancy timespan, $\textit{Preg}$: $P(C = j | B = k) = P(\textit{Preg} = k-j)$ Thus our conditional probability becomes:

$P(S = i) = P\sum_{m \in O} \sum_{n \in P} P(\textit{Ovul} = m) P(\textit{Preg} = n)P(B = i + m + n)$, 

where $O$ and $P$ are the set of possible ovulation lengths and pregnancy lengths in our data. We model each of these three distributions as multinomial distributions, specifying probabilities for each category of a finite number of categories (i.e. birthdays falling in $\{1/1, 1/2,...\}$). Rather than straightforwardly using the observed probabilities across categories as fact, we use Bayesian updating to obtain a distribution across these multinomial probabilities. We use a Dirichlet distribution, which describes the probability of ${\alpha_1,...,\alpha_k}$ concentration parameters occurring, where $k$ is the number of categories, as our prior. It is a conjugate prior of the multinomial distribution. We use the uninformative prior of uniform initial probabilities across all $k$ categories. Updating this prior with data from the ovulation, conception, and birthday data sets, we obtain posterior probabilities of the three multinomial probabilities.








Then, we sample multinomial probabilities from these Dirichlet posteriors to obtain the three probabilities we marginalize above, over 10,000 simulations.

Results
Averaging each sex day's probabilities across the 10,000 trials results in reasonable probabilities for each day of the year that sum up closely to 1. 




There appears to be a clear seasonal trend of increased successful sexual activity occurring around the winter months, starting in September and dropping in December, while there is also a decline that occurs around the beginning of summer break. Notice the small bump around the middle of February -- although perhaps just a touch more dramatic than the bump in May, perhaps this means the Valentine's Day hypothesis is true after all :)

For more details, see our Github repo.