In the following, I give you my basic understanding of stochastic differential equations (SDEs) that I gathered from chapter 4 of Applied Stochastic Differential Equations by Särkkä & Solin. This text follows their structure and flow but I write out some parts in more detail to aid my own understandingand skip several things not directly related to the understanding of SDEs I am after. The book is a great practical introduction to the numerical solution of SDEs and the theory behind the continuous-time Kalman filter.
If you have looked into the theory linking diffusion models to SDEs, you have surely seen equations like $$\d\vx = f(\vx, t)\,\d t + g(\vx, t)\,\d\vbeta$$ and you understand that, on some level, this means that $\vx$ follows the gradient field $f$ through time disturbed by some noise represented by $\vbeta$. But what is $\vbeta$ really and in particular $\d\vbeta$?
Informally, we can understand the above SDE as an ODE $$\frac{\d\vx}{\d t} = f(\vx, t) + g(\vx, t)\,\vw(t)$$ driven by a vector field $f$ and a random component $\vw$. Here we take $\vw$ to be a white noise or standard Gaussian process, i.e. a random function such that $\vw(s)$ and $\vw(t)$ are independent if $s \ne t$ and $\vw(t) \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})$ for all $t$. This captures the intuitive notion that $\vx$ is mostly controlled by $f$ but also disturbed randomly by $\vw$. Since $g$ only scales the noise, we will ignore it and focus on $\vw$.
While the noisy ODE perspective works heuristically, it breaks down when we try to understand SDEs on a deeper level. Let's integrate the ODE to convert it to the integral equation $$\vx(T) - \vx(0) = \int_0^T f(\vx, t)\,\d t + \int_0^T g(\vx, t)\,\vw(t)\,\d t.$$ Assuming that $f$ and $g$ are sufficiently nice functions, the first integral is perfectly ordinary and can be understood, for example, as a Riemann integral. However, the second one falls neither under the Riemann, nor the Stieltjes or Lebesgue categories of integrals. To see this, recall that these integrals are defined as a converging sequence of simple functions approximating $g(\vx, t)\,\vw(t)$ on partitions of $[0, T]$. However, because $\vw$ is a white noise process, $\vw(t)$ is unbounded and cannot be approximated by simple functions on any partition.
To make it more specific, let's think about Riemann integration. There, we have a partition $t_0 = 0 < t_1 < \ldots < t_n = T$ and search for step functions with steps at the partition boundaries $t_i$ that bound the integrand $g(\vx, t)\,\vw(t)$ from above and below. The integrals value is then defined as the limit of the integral of these step functions as we refine the partition if the lower and upper bounds converge. For SDEs, this does not happen. In fact, we cannot even find lower and upper bounds as $\vw(t)$ takes arbitrarily large values on any non-empty sub-interval of $[0, T]$.
Working out how we can understand the second integral requires us to consider another concept closely related to white noise, Brownian motion, also called the Wiener process. It is a stochastic process denoted $\vbeta$ on the positive reals, i.e. a probability distribution over functions on the positive reals $t \ge 0$. Furthermore, it satisfies
- $\vbeta(0) = 0$,
- its increments $\vbeta(t) - \vbeta(s)$ are standard Gaussian distributed with variance $t - s$ and independent from earlier values $\vbeta( r)$ where $r < s < t$,
- and its sample paths are continuous functions.
In physics, $\vbeta$ is important as a model for the random motion of particles. For us, it is important as an antiderivative of the white noise process $\vw(t)$.
Equipped with $\vbeta$, we can try again to understand the white noise integral above. Since $\vw$ is the derivative of $\vbeta$, we can write the differential of $\vbeta$ as $$\d\vbeta = \frac{\d\vbeta}{\d t}\d t = \vw\,\d t.$$ This lets us move the unmanageable $\vw$ out of the integrand and into the differential, $$\int_0^T g(\vx, t)\,\vw(t)\,\d t = \int_0^T g(\vx, t)\,\d\vbeta.$$ On a formal level, we can treat this integral as a Stieltjes integral with the integrand $g(\vx, t)$ and the integrator $\vbeta(t)$. So we can take a partition of $[0, T]$ as above and define the approximating sum $$\sum_{i = 0}^{n-1} g(\vx(t_i^*), t_i^*)\,[\vbeta(t_{i+1}) - \vbeta(t_{i})]$$ where $t_i^* \in [t_{i}, t_{i+1}]$. If this sum converges to a unique value independent of the specific choice of $t_i^*$ as $n \to \infty$, this value equals the integral and in a sense, we can understand the integral as this limit of sums. However, this is not the case if $\vbeta$ is Brownian motion, so we are not done yet. We will see proof of this by counterexample soon.
Curiously, the sum does converge to a unique value if we fix the choice $t_i^* = t_i$. Therefore, we can define $$\int_0^T g(\vx, t)\,\d\vbeta = \lim_{n \to \infty} \sum_{i = 0}^{n-1} g(\vx(t_i), t_i)\,[\vbeta(t_{i+1}) - \vbeta(t_{i})],$$ which is called the *Itô integral*. Even more curiously, the sums also converge for $t_i^* = \frac{t_{i+1} + t_i}{2}$. This gives us the *Stratonovich integral* $$\int_0^T g(\vx, t)\circ\d\vbeta = \lim_{n \to \infty} \sum_{i = 0}^{n-1} g\Big(\vx\Big(\frac{t_{i+1} + t_i}{2}\Big), \frac{t_{i+1} + t_i}{2}\Big)\,[\vbeta(t_{i+1}) - \vbeta(t_{i})],$$ which is commonly differentiated from the Itô integral by a circle in front of $\d\vbeta$. And you will surely be surprised to learn that this distinction is important, because the two interpretations of the same stochastic integral give us different values, unlike Riemann, Stieltjes and Lebesgue integrals in non-stochastic calculus that agree on the value of an integral.
The simple integral problem $\int_0^T \beta(t)\,\d\beta(t)$ will help us understand why the two variants differ. The Itô integral asserts its value to be $\frac{1}{2}\beta^2(T) - \frac{1}{2}T$ while the Stratonovich integral assigns it a value of $\frac{1}{2}\beta^2(T)$. How is that possible? As an Itô integral, our problem is the limit of $$\sum_{i} \beta(t_i)\,[\beta(t_{i+1}) - \beta(t_{i})].$$ If you stare at this for a while (or no the result and work backwards), you can see that it can be rewritten as $$\sum_i \Big[\frac{1}{2}\big({\beta(t_{i+1})}^2 - {\beta(t_i)}^2\big) - \frac{1}{2} {\big(\beta(t_{i+1}) - \beta(t_i))\big)}^2\Big].$$ When we break this into two sums, we see that the first part is a telescope sum and simplifies to $\frac{1}{2}\beta(T)$. For the second part, we have to work a bit harder. $\beta(t_{i+1}) - \beta(t_i)$ is a random variable with distribution $\mathcal{N}(0, \Delta t_i)$ as per the basic properties of Brownian motion where we define $\Delta t_i = t_{i+1} - t_i$. We can rearrange $\Var[X] = \E[X^2] - {\E[X]}^2$ to see that the expected value of each ${\big(\beta(t_{i+1}) - \beta(t_i)\big)}^2$ term is $\Delta t_i$ and use it again to compute $$\Var\Big[{\big(\beta(t_{i+1}) - \beta(t_i)\big)}^2\Big] = \E\Big[{\big(\beta(t_{i+1}) - \beta(t_i)\big)}^4\Big] - {\E\Big[{\big(\beta(t_{i+1}) - \beta(t_i)\big)}^2\Big]}^2.$$ The first term is the fourth moment of a zero-mean Gaussian, $3(\Delta t_i)^2$, and the second is just the squared variance, $(\Delta t_i)^2$, so overall the variance is $2(\Delta t_i)^2$. Using the fact that non-overlapping increments of Brownian motion are independent and parametrizing $t_i$ so that $\Delta t_i = \frac{T}{n}$, we see that $\sum_i {\big(\beta(t_{i+1}) - \beta(t_i)\big)}^2$ is random variable with mean $0$ and variance $\frac{2T^2}{n}$. Now, we can take the limit of $n \to \infty$ to get the claimed result.
What changes for the Stratonovich integral? We again have a similar sum representation $$\sum_{i} \beta(t_i^*)\,[\beta(t_{i+1}) - \beta(t_{i})],$$ but now we evaluate $\beta$ at the midpoint between $t_i$ and $t_{i+1}$ which we denote as $t_i^*$. To analyze the terms of this sum, we split $\beta(t_i^*)$ into a linear approximation $\frac{\beta(t_{i+1}) + \beta(t_i)}{2}$ and an error term $\beta(t_i^*) - \frac{\beta(t_{i+1}) + \beta(t_i)}{2}$. Plugging this in, splits the sum into above into two sums. The first is $$\sum_i \frac{\beta(t_{i+1}) + \beta(t_i)}{2}\,[\beta(t_{i+1}) - \beta(t_{i})],$$ which is a telescope sum and evaluates to $\frac{1}{2} \beta(T)$. The second sum is $$\sum_i \bigg(\beta(t_i^*) - \frac{\beta(t_{i+1}) + \beta(t_i)}{2}\bigg)\,\big[\beta(t_{i+1}) - \beta(t_{i})\big].$$ If we introduce $A_i = \beta(t_i^*) - \beta(t_i)$ and $B_i = \beta(t_{i+1}) - \beta(t_i^*)$, we can rewrite this as $$\frac{1}{2} \sum_i \big( A_i - B_i \big) \big[A_i + B_i\big] = \frac{1}{2} \sum_i \big( A_i^2 - B_i^2 \big).$$ Note that $A_i$ and $B_i$ are non-overlapping increments and therefore independent random variables with mean $0$ and variance $\frac{\Delta t_i}{2}$. Following the same argument as in the Itô case, we conclude that this sum has mean $0$ and $0$ variance in the $n \to \infty$ limit, so that the second sum vanishes and only $\frac{1}{2} \beta(T)$ remains. Note that this is the same result that we would get in non-stochastic calculus if $\beta$ was an ordinary function.
So the Stratonovich integral gives us the result that we are used to from ordinary calculus and it also carries over our beloved chain rule into the land of stochastic calculus. Then why do we still see Itô integrals everywheremeaning the few places in ML where I have seen SDEs? As far as I understand it comes down to the fact that the Stratonovich integral requires you to, figuratively, "know the future". While the Itô integral evaluates the partitions in its sum limit on the left border, i.e. at the current value, the Stratonovich integral evaluates the segments in the middle, i.e. in the future from the perspective that you integrate along the time axis toward positive infinity. Additionally, Itô integrals come with their own chain rule variant in the form of Itô's lemma.
Finally, I want to return to my opening question: what is $\d\beta$? In contrast to standard calculus, I am not able to grasp $\d\beta$ as a thing in itself. In discussions and mathematical prototyping, it is probably fine to think of it as an infinitesimal piece of white noise. However, as we have seen, its actual meaning is only determined when we integrate against it. So I will, as Särkkä and Solin suggest, view stochastic differential equations mostly as a shorthand for stochastic integral equations.