ArticlePDF Available

Symbolic Statistics with SymPy

  • Fashion Metric, Inc

Abstract and Figures

Replacing symbols with random variables makes it possible to naturally add statistical operations to complex physical models. Three examples of symbolic statistical modeling are considered here, using new features from the popular SymPy project.
Content may be subject to copyright.
Uncertainty Modeling with SymPy Stats
Matthew Rocklin
Abstract—We add a random variable type to a mathematical modeling lan-
guage. We demonstrate through examples how this is a highly separable way
to introduce uncertainty and produce and query stochastic models. We motivate
the use of symbolics and thin compilers in scientific computing.
Index Terms—Symbolics, mathematical modeling, uncertainty, SymPy
Scientific computing is becoming more challenging. On the
computational machinery side heterogeneity and increased
parallelism are increasing the required effort to produce high
performance codes. On the scientific side, computation is
used for problems of increasing complexity by an increas-
ingly broad and untrained audience. The scientific community
is attempting to fill this widening need-to-ability gulf with
various solutions. This paper discusses symbolic mathematical
Symbolic mathematical modeling provides an important
interface layer between the description of a problem by domain
scientists and description of methods of solution by compu-
tational scientists. This allows each community to develop
asynchronously and facilitates code reuse.
In this paper we will discuss how a particular problem do-
main, uncertainty propagation, can be expressed symbolically.
We do this by adding a random variable type to a popular
mathematical modeling language, SymPy [Sym, Joy11]. This
allows us to describe stochastic systems in a highly separable
and minimally complex way.
Mathematical models are often flawed. The model itself
may be overly simplified or the inputs may not be completely
known. It is important to understand the extent to which the
results of a model can be believed. Uncertainty propagation
is the act of determining the effects of uncertain inputs on
outputs. To address these concerns it is important that we
characterize the uncertainty in our inputs and understand how
this causes uncertainty in our results.
Motivating Example - Mathematical Modeling
We motivate this discussion with a familiar example from
Matthew Rocklin is with University of Chicago, Computer Science. E-mail:
2010 Matthew Rocklin. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Consider an artilleryman firing a cannon down into a valley.
He knows the initial position (x0,y0)and orientation, θ, of the
cannon as well as the muzzle velocity, v, and the altitude of
the target, yf.
# Inputs
>>> x0 =0
>>> y0 =0
>>> yf = -30 # target is 30 meters below
>>> g= -10 # gravitational constant
>>> v=30 # m/s
>>> theta =pi/4
If this artilleryman has a computer nearby he may write some
code to evolve forward the state of the cannonball to see where
it hits lands.
>>> while y>yf: # evolve time forward until y hits the ground
... t+= dt
... y=y0 +v*sin(theta)*t
... + g*t**2/2
>>> x=x0 +v*cos(theta)*t
Notice that in this solution the mathematical description of
the problem y=y0+vsin(θ)t+gt2
2lies within the while loop.
The problem and method are woven together. This makes it
difficult both to reason about the problem and to easily swap
out new methods of solution.
If the artilleryman also has a computer algebra system he
may choose to model this problem and solve it separately.
>>> t=Symbol(’t’)# SymPy variable for time
>>> x=x0 +v*cos(theta) *t
>>> y=y0 +v*sin(theta) *t+g*t**2
>>> impact_time =solve(y -yf, t)
>>> xf =x0 +v*cos(theta) *impact_time
>>> xf.evalf() # evaluate xf numerically
# Plot x vs. y for t in (0, impact_time)
>>> plot(x, y, (t, 0, impact_time))
In this case the solve operation is nicely separated. SymPy
defaults to an analytic solver but this can be easily swapped
out if analytic solutions do not exist. For example we can
easily drop in a numerical binary search method if we prefer.
If he wishes to use the full power of SymPy the artilleryman
may choose to solve this problem generally. He can do this
simply by changing the numeric inputs to sympy symbolic
>>> x0 =Symbol(’x_0’)
>>> y0 =Symbol(’y_0’)
>>> yf =Symbol(’y_f’)
>>> g=Symbol(’g’)
>>> v=Symbol(’v’)
>>> theta =Symbol(’theta’)
He can then run the same modeling code found in (missing
code block label) to obtain full solutions for impact_time and
the final x position.
Fig. 1: The trajectory of a cannon shot
x0 v thetat
y0 g
Fig. 2: A graph of all the varibles in our system. Variables on top
depend on variables connected below them. The leaves are inputs to
our system.
>>> impact_time
vsin(θ) + q4gy0+4gy f+v2sin2(θ)
>>> xf
vvsin(θ) + q4gy0+4gy f+v2sin2(θ)cos (θ)
Rather than produce a numeric result, SymPy produces an
abstract syntax tree. This form of result is easy to reason about
for both humans and computers. This allows for the manipu-
lations which provide the above expresssions and others. For
example if the artilleryman later decides he needs derivatives
he can very easily perform this operation on his graph.
Motivating Example - Uncertainty Modeling
To control the velocity of the cannon ball the artilleryman
introduces a certain quantity of gunpowder to the cannon. He
is unable to pour exactly the desired quantity of gunpowder
however and so his estimate of the velocity will be uncertain.
He models this uncertain quantity as a random variable that
can take on a range of values, each with a certain probability.
Fig. 3: The distribution of possible velocity values
In this case he believes that the velocity is normally distributed
with mean 30 and standard deviation 1.
>>> from sympy.stats import *
>>> v=Normal(’v’,30,1)
>>> pdf =density(v)
>>> z=Symbol(’z’)
>>> plot(pdf(z), (z, 27,33))
vis now a random variable. We can query it with the
following operators
P-- # Probability
E-- # Expectation
variance -- # Variance
density -- # Probability density function
sample -- # A random sample
These convert stochasitc expressions into computational ones.
For example we can ask the probability that the muzzle
velocity is greater than 31.
>>> P(v >31)
This converts a random/stochastic expression v > 31 into
a deterministic computation. The expression P(v > 31)
actually produces an intermediate integral expression which
is solved with SymPy’s integration routines.
>>> P(v >31, evaluate=False)
Every expression in our graph that depends on vis now a
random expression
We can ask similar questions about the these expressions.
For example we can compute the probability density of the
position of the ball as a function of time.
>>> a,b =symbols(’a,b’)
>>> density(x)(a) *density(y)(b)
t2e30 2a
te30 2(b+5t2)
x0 y0v theta g
x y
Fig. 4: A graph of all the varibles in our system. Red variables are
stochastic. Every variable that depends on the uncertain input, v, is
red due to its dependence.
Or we can plot the probability that the ball is still in the air
at time t
>>> plot( P(y>yf), (t, 4.5,6.5))
Note that to obtain these expressions the only novel work the
modeler needed to do was to describe the uncertainty of the
inputs. The modeling code was not touched.
We can attempt to compute more complex quantities such
as the expectation and variance of impact_time the total
time of flight.
>>> E(impact_time)
In this case the necessary integral proved too challenging for
the SymPy integration algorithms and we are left with a correct
though unresolved result.
This is an unfortunate though very common result. Math-
ematical models are usually far too complex to yield simple
analytic solutions. I.e. this unresolved result is the common
case. Fortunately computing integral expressions is a problem
of very broad interest with many mature techniques. SymPy
stats has successfully transformed a specialized and novel
problem (uncertainty propagation) into a general and well
RV Type Computational Type
Continuous SymPy Integral
Discrete - Finite (dice) Python iterators / generators
Discrete - Infinite (Poisson) SymPy Summation
Multivariate Normal SymPy Matrix Expression
TABLE 1: Different types of random expressions reduce to different
computational expressions (Note: Infinite discrete and multivariate
normal are in development and not yet in the main SymPy distribu-
studied one (computing integrals) to which we can apply
general techniques.
One method to approximate difficult integrals is through
SymPy.stats contains a basic Monte Carlo backend which
can be easily accessed with an additional keyword argument.
>>> E(impact_time, numsamples=10000)
ARandomSymbol class/type and the functions P,
E, density, sample are the outward-facing core of
sympy.stats and the PSpace class in the internal core rep-
resenting the mathematical concept of a probability space.
ARandomSymbol object behaves in every way like a
standard sympy Symbol object. Because of this one can
replace standard sympy variable declarations like
with code like
and continue to use standard SymPy without modification.
After final expressions are formed the user can query
them using the functions P, E, density, sample.
These functions inspect the expression tree, draw out the
RandomSymbols and ask these random symbols to construct
a probabaility space or PSpace object.
The PSpace object contains all of the logic to turn random
expressions into computational ones. There are several types
of probability spaces for discrete, continuous, and multivariate
distributions. Each of these generate different computational
Implementation - Bayesian Conditional Probability
SymPy.stats can also handle conditioned variables. In this
section we describe how the continuous implementation of
sympy.stats forms integrals using an example from data as-
We measure the temperature and guess that it is about 30C
with a standard deviation of 3C.
>>> from sympy.stats import *
>>> T=Normal(’T’,30,3)# Prior distribution
We then make an observation of the temperature with a
thermometer. This thermometer states that it has an uncertainty
of 1.5C
Fig. 5: The prior, data, and posterior distributions of the temperature.
>>> noise =Normal(’eta’,0,1.5)
>>> observation =T+noise
With this thermometer we observe a temperature of 26C. We
compute the posterior distribution that cleanly assimilates this
new data into our prior understanding. And plot the three
>>> data =26 +noise
>>> T_posterior =Given(T, Eq(observation, 26))
We now describe how SymPy.stats obtained this result. The
expression T_posterior contains two random variables,
Tand noise each of which can independently take on
different values. We plot the joint distribution below in figure
6. We represent the observation that T + noise == 26 as
a diagonal line over the domain for which this statement is
true. We project the probability density on this line to the left
to obtain the posterior density of the temperature.
These gemoetric operations correspond exactly to Bayesian
probability. All of the operations such as restricting to the con-
dition, projecting to the temperature axis, etc... are managed
using core SymPy functionality.
Scientific computing is a demanding field. Solutions frequently
encompass concepts in a domain discipline (such as fluid
dynamics), mathematics (such as PDEs), linear algebra, sparse
matrix algorithms, parallelization/scheduling, and local low
level code (C/FORTRAN/CUDA). Recently uncertainty layers
are being added to this stack.
Often these solutions are implemented as single monolithic
codes. This approach is challenging to accomplish, difficult to
reason about after-the-fact and rarely allows for code reuse. As
hardware becomes more demanding and scientific computing
expands into new and less well trained fields this challenging
approach fails to scale. This approach is not accessible to the
average scientist.
Various solutions exist for this problem.
Low-level Languages like C provide a standard interface for
a range of conventional CPUs effectively abstracting low-level
architecture details away from the common programmer.
Libraries such as BLAS and LAPACK provide an interface
between linear algebra and optimized low-level code. These
libraries provide an interface layer for a broad range of
architecture (i.e. CPU-BLAS or GPU-cuBLAS both exist).
Measurement Noise
Fig. 6: The joint prior distribution of the temperature and measure-
ment noise. The constraint T + noise == 26 (diagonal line) and
the resultant posterior distribution of temperature on the left.
High quality implementations of vertical slices of the stack
are available through higher level libraries such as PETSc and
Trilinos or through code generation solutions such as FENICS.
These projects provide end to end solutions but do not provide
intermediate interface layers. They also struggle to generalize
well to novel hardware.
Symbolic mathematical modeling attempts to serve as a thin
horizontal interface layer near the top of this stack, a relatiely
empty space at present.
SymPy stats is designed to be as vertically thin as possible.
For example it transforms continuous random expressions into
integral expressions and then stops. It does not attempt to
generate an end-to-end code. Because its backend interface
layer (SymPy integrals) is simple and well defined it can be
used in a plug-and-play manner with a variety of other back-
end solutions.
Multivariate Normals produce Matrix Expressions
Other sympy.stats implementations generate similarly struc-
tured outputs. For example multivariate normal random vari-
ables found in sympy.stats.mvnrv generate matrix ex-
pressions. In the following example we describe a standard
data assimilation task and view the resulting matrix expression.
mu =MatrixSymbol(’mu’,n,1)# n by 1 mean vector
Sigma =MatrixSymbol(’Sigma’, n, n) # covariance matrix
X=MVNormal(’X’, mu, Sigma)
H=MatrixSymbol(’H’, k, n) # An observation operator
data =MatrixSymbol(’data’,k,1)
Math / PDE description
Linear Algebra/
Matrix Expressions
Sparse matrix algorithms
Parallel solution /
Scientific description
Numerical Linear Algebra
Fig. 7: The scientific computing software stack. Various projects are
displayed showing the range that they abstract. We pose that scientific
computing needs more horizontal and thin layers in this image.
R=MatrixSymbol(’R’, k, k) # covariance matrix for noise
noise =MVNormal(’eta’, ZeroMatrix(k, 1), R)
# Conditional density of X given HX+noise==data
density(X , Eq(H*X+noise, data) )
µ= [I0]Σ0
Σ= [I0]IΣ0
Those familiar with data assimilation will recognize the
Kalman Filter. This expression can now be passed as an
input to other symbolic/numeric projects. Symbolic/numerical
linear algebra is a vibrant and rapidly changing field. Because
sympy.stats offers a clean interface layer it is able to
easily engage with these developments. Matrix expressions
form a clean interface layer in which uncertainty problems
can be expressed and transferred to computational systems.
We generally support the idea of approaching the sci-
entific computing conceptual stack (Physics/PDEs/Linear-
algebra/MPI/C-FORTRAN-CUDA) with a sequence of simple
and atomic compilers. The idea of using interface layers to
break up a complex problem is not new but is oddly infrequent
in scientific computing and thus warrants mention. It should be
noted that for heroic computations this approach falls short -
maximal speedup often requires optimizing the whole problem
at once.
We have foremost demonstrated the use of sympy.stats a
module that enhances sympy with a random variable type. We
have shown how this module allows mathematical modellers
to describe the undertainty of their inputs and compute the
uncertainty of their outputs with simple and non-intrusive
changes to their symbolic code.
Secondarily we have motivated the use of symbolics in
computation and argued for a more separable computational
stack within the scientific computing domain.
[Sym] SymPy Development Team (2012). SymPy: Python library for
symbolic mathematics URL
[Roc12] M. Rocklin, A. Terrel, Symbolic Statistics with SymPy Computing
in Science & Engineering, June 2012
[Joy11] D. Joyner, O. ˇ
Certík, A. Meurer, B. Granger, Open source com-
puter algebra systems: SymPy ACM Communications in Computer
Algebra, Vol 45 December 2011
... To simplify the solution and implement the thermodynamic equations mentioned above, the Python programming language is most suited as its syntax is very easy and its modules are very powerful [3][4][5][6]. To solve problems symbolically, SymPy is a very powerful tool which is also written in Python [7][8][9][10]. It is so powerful that once an equation is entered, it can integrate, differentiate, simplify expression, evaluate limit, and much more with great ease. ...
Full-text available
In this paper, an attempt has been made to develop a Python module for evaluating the first law of thermodynamics, which includes the process of work done and amount of heat gained or lost by the system and the amount of internal energy stored. The modules NumPy and Matplotlib were used to perform the stipulated task. In addition, the correctness of codes was checked against different numerical problems, and it has been observed that the program results match exactly with the results in the literature. As a result, the functions thus developed have shown high accuracy with the least effort and error in all the cases.
... However, the positioning information and error was calculated by a trilateration, which may implicitly include some error-tolerant or optimization design. Here, some results about the distance and positioning error are provided in Table 3 to address this viewpoint, where the trilateration algorithm was realized via invoking the solve function provided by SymPy [21,22]. According to the results shown in the example in Table 3, since the trilateration algorithm had some error-tolerant design, and even though all the distance errors between the tag and anchors in the raw data row were higher, it still possessed a lower positioning error. ...
Full-text available
An ultra-wideband (UWB) positioning system consists of at least three anchors and a tag for the positioning procedure. Via the UWB transceivers mounted on all devices in the system, we can obtain the distance information between each pair of devices and further realize the tag localization. However, the uncertain measurement in the real world may introduce incorrect measurement information, e.g., time, distance, positioning, and so on. Therefore, we intend to incorporate the technique of ensemble learning with UWB positioning to improve its performance. In this paper, we present two methods. The experimental results show that our ideas can be applied to different scenarios and work well. Of note, compared with the existing research in the literature, our first algorithm was more accurate and stable. Further, our second algorithm possessed even better performance than the first. Moreover, we also provide a comprehensive discussion for an ill-advised point, which is often used to evaluate the positioning efficiency in the literature.
... Here comes the importance of python language to reduce the solution and implementation task [29,30]. SymPy is a module which is used in python (as well written in python) to solve mathematics symbolically [31][32][33]. This is so powerful tool that one has to just plug in the differential model in symbolic form (as we write on paper) and it will return the output. ...
In this research article, an attempt has been made to solve the linear/nonlinear and steady/unsteady heat transfer equations using the Homotopy and Perturbation method (HPM). Moreover, the implementation of HPM has been done by using SymPy, a library in python, to solve problems symbolically. Total three problems were dealt viz. steady-state conduction with heat generation, lumped capacitance analysis with a variable specific heat of the material, and heat transfer in uniform rectangular fin with radiation from the surface. In all the cases, the HPM has given excellent results compared to the analytical and numerical. Finally, the execution of SymPy has been explained, and a detailed procedure to implement HPM through python has been presented for all three cases.
... These modules are open source and can be used freely for scientific computations. Both NumPy and Sympy are very powerful in algebra, discrete mathematics, calculus etc. [8], [9]. One of the attractive features of SymPy is its capability to format and present the results in LaTeX format. ...
Full-text available
This paper applies the solution of ODE's encounter in fluid mechanics by augmenting it with symbolic Python. The implementation procedure has been illustrated for two types of parallel flows, i.e., Couette and Hagen Poiseuille flow. The suggestive results thus obtained are plotted and presented using Matplotlib. The manuscript will help beginners of fluid mechanics solve the fluid flow problems computationally and produce the results in a better way.
... To properly compare the accuracy of M and S methods with direct methods, one needs to let the orientation and location errors of cameras contribute properly to the final projection error Δg. A symbolic algebra package Sympy (Rocklin and Terrel, 2012) supporting random variables was used to compute proper error distributions. ...
Full-text available
Two examples of pre-processing of geometric features is given. Supervised machine learning algorithms can have benefit of existence of families of alternative approximate features. This is especially with the object recognition problem in point cloud and image media. A tutorial review of existing methods in curvature analysis of triangularized surfaces is included. Two new methods are introduced; one is for point cloud filtering and another one for directional curvature histograms.
... values are used in Table 2to make comparisons possible. Computations were done by python Sympy package [27] by coding Eqs. 9... 16 with their corresponding variance terms (first five values) of Table 1. ...
Conference Paper
Full-text available
A swimmer detection and tracking is an essential first step in a video-based athletics performance analysis. A real-time algorithm is presented, with the following capabilities: performing the planar projection of the image, fading the background to protect the intimacy of other swimmers, framing the swimmer at a specific swimming lane, and eliminating the redundant video stream from idle cameras. The generated video stream is a basis for further analysis at the batch-mode. The geometric video transform accommodates a sparse camera array and enables geometric observations of swimmer silhouette. The tracking component allows real-time feedback and combination of different video streams to a single one. Swimming cycle registration algorithm based on markerless tracking is presented. The methodology allows unknown camera positions and can be installed in many types of public swimming pools.
... Statistics (sympy.stats) Support for a random variable type as well as the ability to declare this variable from prebuilt distribution functions such as Normal, Exponential, Coin, Die, and other custom distributions (Rocklin & Terrel, 2012). Tensors (sympy.tensor) ...
Full-text available
SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become a popular symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select submodules. The supplementary material provide additional examples and further outline details of the architecture and features of SymPy.
Full-text available
SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become the standard symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select domain specific submodules. The supplementary materials provide additional examples and further outline details of the architecture and features of SymPy.
Full-text available
SymPy is an open source computer algebra system written in pure Python. It is built with a focus on extensibility and ease of use, through both interactive and programmatic applications. These characteristics have led SymPy to become the standard symbolic library for the scientific Python ecosystem. This paper presents the architecture of SymPy, a description of its features, and a discussion of select domain specific submodules. The supplementary materials provide additional examples and further outline details of the architecture and features of SymPy.
Full-text available
Using domain-specific languages, scientific codes can let users work directly with equations and benefit from optimizations not available with general compilers.
Full-text available
At the heart of functional programming rests the principle of referential trans-parency, which in particular means that a function f applied to a value x always yields one and the same value y = f(x). This principle seems to be violated when contemplating the use of functions to describe probabilistic events, such as rolling a die: It is not clear at all what exactly the outcome will be, and neither is it guaran-teed that the same value will be produced repeatedly. However, these two seemingly incompatible notions can be reconciled if probabilistic values are encapsulated in a data type. In this paper, we will demonstrate such an approach by describing a probabilistic functional programming (PFP) library for Haskell. We will show that the proposed approach not only facilitates probabilistic programming in functional languages, but in particular can lead to very concise programs and simulations. In particular, a major advantage of our system is that simulations can be specified independently from their method of execution. That is, we can either fully simulate or randomize any simulation without altering the code which defines it. In the following we will present the definitions of most functions, but also leave out some details for the sake of brevity. These details should be obvious enough to be filled in easily by the reader. In any case, all function definitions can be found in the distribution of the library, which is freely available at eecs. oregonstate. edu/~erwig/pfp/. The probabilistic functional programming approach is based on a data type for representing distributions. A distribution represents the outcome of a probabilistic event as a collection of all possible values, tagged with their likelihood.
This survey will look at SymPy, a free and open source computer algebra system started in 2005 by the second author (O.Č.). It is written entirely in Python, available from SymPy is licensed under the "modified BSD" license, as is its beautiful logo designed by Fredrik Johansson.
SymPy: Python library for symbolic mathematics URL
  • Sympy Development Team
SymPy Development Team (2012). SymPy: Python library for symbolic mathematics URL