*let's focus on good definitions...*”

Today, I'll explore a seemingly simple question: *what is a rotation?* I'm asking for the **definition** of a rotation, which is important because before we can say anything useful about rotations, (it should go without saying,) we need to know what rotations are. Definitions are an integral part of math and whenever I get stuck while reading a paper, I can usually trace the issue to a misunderstanding of some definition. You've also probably heard someone say that math deals with abstractions. Well, definitions are where these abstractions take place. Definitions are mathematicians' attempts to boil ideas/questions/systems down to their essence; the less moving parts there are to think about, the easier it is understand what is going on. There is still another advantage: the less things involved in a definition, the more applicable the definition becomes. So when I ask for the definition of a rotation, try to think of the most general description that can be applied to other settings, e.g., in three/four/five dimensions.

So let's start with a naive definition. In the plane, 2D, rotation involves fixing a point, called the **fixed point** or **centre of rotation**, and then rotating the rest of the points some specified angle. We already have a problem! This is a circular definition as I'm using “rotating” to define “rotation” and don't you hate it when dictionaries do that? So how do we fix this? Notice that every point (except the fixed point) lies on a unique circle centred at the fixed point. “Rotating a point” means moving counterclockwise along this circle a distance proportional to the radius of the circle; the proportionality constant is known also as the **angle of rotation**. This definition works but it's kinda wordy and we'll soon find that it's not the simplest way to define a rotation.

What happens in 3-space? Naively, rotation involves fixing a straight line, the **axis of rotation**, and then rotating the rest of the points. Okay, blah blah circular definition blah blah. But every point not on the axis of rotation lies on a unique plane perpendicular to the axis of rotation. We know how to rotate planes (previous paragraph) and so we rotate these parallel planes the same angle about (the intersections with) the axis. We need to make sure we have a consistent way of deciding what it means to move counterclockwise on these planes. This amounts to choosing which side represents the “top” of the planes. The definition is no longer circular and is instead *recursive* since rotating planes had to be defined first. Just as before, this definition works but can be greatly simplified.

In both cases, 2D and 3D, you can directly verify that rotating *preserves distances* between points: if two points are 3 units apart, they remain 3 units apart after rotating. Actions that preserve distances are known as **isometries**. Thus rotations are isometries and it is easy to verify that the composition of isometries is itself an isometry. A second property of rotations that is hard to verify with the given definition but is still very intuitive is **orientation-preservation**: a rotated left hand will always look like a left hand. In contrast, reflections are isometries but are not orientation-preserving: a reflected left hand looks like a right hand. Lastly, when you compose two rotations with intersecting axis, then this intersection, which is fixed by each rotation, is a fixed point of the composition. So we have that the composition of two rotations with intersecting axes is an orientation-preserving isometry with a fixed point. In 1775, Euler used spherical geometry to prove his rotation theorem :

Euler.Any orientation-preserving isometry of 3-space with a fixed point has a fixed axis.

You can directly verify that: orientation-preserving isometries of the plane with fixed points are rotations. This combined with Euler's rotation theorem allows us to extend to 3D: orientation-preserving isometries of 3-space with fixed points are rotations. In both 2D and 3D, being a rotation is the same as (equivalent to) being an orientation-preserving isometry with a fixed point. This characterization is a lot simpler than talking about concentric circles and parallel planes and will be our new definition.

Definition 1.A rotation is an orientation-preserving isometry with a fixed point.

This definition says nothing about the plane or 3-space and can thus be used to define rotation in higher dimensions. Just imagine how difficult it would be to generalize the previous definitions to higher dimensions. In fact, you would have been in a little trouble if you had defined 4D-rotation as having a *fixed plane* (plane of rotation?) about which “perpendicular” planes are rotated (analogous to the original definition for 3D-rotations): in this case, we lose the property that the composition of rotations with intersecting planes of rotation is itself a 4D-rotation. Under our new definition, a rotation will not necessarily have a fixed plane of rotation. If it has one, it will be called a **simple rotation**. The definition applies to higher dimensions with one caveat: we need a proper definition of what it means to be orientation-preserving in these spaces; I will return to this in a bit when I discuss the connection to matrices.

Fig. 3 is my last example of why the new definition is better. This whole time I've been talking about rotating the plane/3-space without actually specifying what geometry I am using! You most probably assumed I was talking about the Euclidean plane/3-space and you were right. But what if I want to deal with non-Euclidean spaces like the hyperbolic plane/3-space? In this case, the original definition breaks one of the nice properties. If you “rotate” the hyperbolic plane by moving points along concentric circles distances proportional to the radii, then you no longer have an isometry: some points are stretched apart while others shrunk closer together. This is because the angle in the hyperbolic plane is the ratio of a circular arc's length to the hyperbolic sine of the radius. On the other hand, an angle in the Euclidean plane is the ratio of a circular arc's length to just the radius. Fortunately, the new definition of rotation doesn't actually care if we're talking about hyperbolic or euclidean space -- it's *space-agnostic*, if you will.

The downside of this definition is it doesn't help when describing a particular rotation. Based on the definition, one would think that a fixed point is the only thing you need to differentiate rotations; but what if two rotations share a fixed point? And what can we say about their composition? Since I want to focus on these questions, all rotations mentioned from here on out will share a fixed point, call it the origin. For 2D-rotations, we can differentiate rotations (that fix the origin) by their angles of rotation and the sum of these angles describes their composition. For 3D-rotations, we can differentiate rotations by the axes and angles of rotation but it is difficult to determine axis/angle of rotation for their composition. So what other description can we use?

Fortunately for us, any isometry of Euclidean space (thought of as a vector space) that fixes the origin is a linear map (can you show why?) and linear maps can be described by matrices. Fix a cartesian coordinate system for your space, then the images of the standard unit vectors under the isometry give you the unique columns of the matrix. Since isometries preserve distances (and angles), the columns of the matrix are perpendicular/orthogonal unit vectors and you have an orthogonal matrix. This also allows us to give a general definition of orientation-preserving:

Definition 2.An isometry of Euclidean space (that fixes the origin) isorientation-preserving(hence a rotation) if the determinant of the corresponding orthogonal matrix is \( +1. \) Such a matrix is called aspecial orthogonal matrix.

This correspondence goes the other way as well. Any (special) orthogonal matrix defines a unique linear map that is an (orientation-preserving) isometry of Euclidean space that fixes the origin (a rotation). Thus, we have a bijection between the group of rotations about the origin and the special orthogonal group, \( SO(n) \). Even better, composing rotations corresponds to matrix multiplication. So we in fact have a group isomorphism! Special orthogonal matrices are a nice description of rotations since: 1) we know how to multiply matrices to find the description of the composition of rotations; 2) it generalizes to any number of dimensions. In fact, this description is good enough to be an alternative definition for rotations (of Euclidean space that fix the origin) but I won't do that (due to the previous parathentical limitation).

How does the matrix description compare to the angle of rotation description? In 2D, a rotation with an angle rotation \( \theta \) sends the point \( (1,0) \mapsto (\cos \theta, \sin \theta) \) and \( (0,1) \mapsto (-\sin \theta, \cos \theta) \). So the rotation corresponds to the matrix: \( \begin{pmatrix}\cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{pmatrix} \). Previously, I said the composition of two 2D-rotations can be described by just adding the angles of rotation. This manifests itself in matrices through the following identity: \[ \begin{pmatrix}\cos \alpha & -\sin \alpha \\ \sin \alpha & \cos \alpha \end{pmatrix}\begin{pmatrix}\cos \beta & -\sin \beta \\ \sin \beta & \cos \beta \end{pmatrix} = \begin{pmatrix}\cos(\alpha + \beta) & -\sin(\alpha + \beta) \\ \sin(\alpha + \beta) & \cos(\alpha + \beta) \end{pmatrix} \] and this is just the angle-sum formulas in disguise.

In 3D, it's not as straight-forward to switch from axis/angle of rotation description to a matrix or vice-versa. First, I'll do this for the rotation depicted in Fig. 2 and then I'll give an outline what to do in general. The rotation in Fig. 2 sends the \( x \)-axis to the \( y \)-axis, the \( y \)-axis to the \( z \)-axis, and the \( z \)-axis to the \( x \)-axis. This means it is corresponds to the matrix \( M = \begin{pmatrix}0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \). For the axis/angle-description:

- the picture shows that the axis of rotation is the line \( (x = y = z) \);
- applying the rotation three times gives the identity, so the angle of rotation must be \( 120^\circ \) .

But can I find the axis/angle of rotation using linear algebra? Since axes of rotation are fixed lines, they correspond to eigenvectors with eigenvalue \( =1 \). The matrix \( M \) has eigenvector \( \vec n = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} \) and, indeed, this vector spans the line \( (x=y=z) \). Furthermore, the angle of rotation can be found by studying what the matrix does to vectors orthogonal to \( \vec n \). For example, the vector \( \vec u = \begin{pmatrix} 1 \\ -1 \\ 0 \end{pmatrix} \) is mapped to \( \vec v = M \vec u = \begin{pmatrix} 0 \\ 1 \\ -1 \end{pmatrix} \). Using the dot-product, it follows that the angle between \( \vec u \) and \( \vec v \) is \( 120^\circ \). Alternatively, I can find the angle of rotation by applying a change of basis, the most fundamental tool of linear algebra. Basically, I'm introducing a new cartesian coordinate system \( x'y'z' \), where \( z' \) is parallel to \( \vec n \), \( x' \) to \( \vec u \), and \( y' \) to \( \vec w = \vec n \times \vec u \). The unit vectors of \( \vec u, \vec w, \vec n \) give the columns of the change of basis matrix: \[ C = \begin{pmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \\ -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \\ 0 & -\frac{2}{\sqrt{6}} & \frac{1}{\sqrt{3}} \end{pmatrix} \] To change basis, conjugate \( M \) with \( C \): \[ M' = C^{-1} M C = \begin{pmatrix}-\frac{1}{2} & -\frac{\sqrt{3}}{2} & 0 \\ \frac{\sqrt{3}}{2} & -\frac{1}{2} & 0 \\ 0 & 0 & 1\end{pmatrix} =\begin{pmatrix}\cos 120^\circ & -\sin 120^\circ & 0 \\ \sin 120^\circ & \cos 120^\circ & 0 \\ 0 & 0 & 1\end{pmatrix} \] The matrix \( M' \) tells us that the \( z' \)-axis is fixed and the \( x'y' \)-plane is rotated \( 120^\circ \).

In general, here's how to switch from axis/angle of rotation description to a matrix. Use the angle of rotation, \( \theta \), to get the matrix \( M' = \begin{pmatrix}\cos \theta & -\sin \theta & 0 \\ \sin \theta & \cos \theta & 0 \\ 0 & 0 & 1\end{pmatrix} \). Then let the axis of rotation be the \( z' \)-axis of a new \( x'y'z' \)-cartesian system, which gives you a change of basis matrix \( C \). Finally, the matrix corresponding to the rotation is \( M = C M' C^{-1} \). \( M' \) is called the **canonical form** of the matrix \( M \) (after change of basis).

In 4D, it can be shown using linear algebra that any special orthogonal matrix \( M \) will have a canonical form \[ M' = C^{-1} M C =\begin{pmatrix}\cos \theta_1 & -\sin \theta_1 & 0 & 0 \\ \sin \theta_1 & \cos \theta_1 & 0 & 0 \\ 0 & 0 & \cos \theta_2 & -\sin \theta_2 \\ 0 & 0 & \sin\theta_2 & \cos\theta_2 \end{pmatrix} \] What this means is that a 4D-rotation can be described by introducing a new cartesian system \( x'y'z'w' \) and the rotation rotates the \( x'y' \)-plane and \( z'w' \)-plane by \( \theta_1 \) and \( \theta_2 \), respectively. So unlike 2D/3D-rotations that only require one angle of rotation, 4D-rotations are described by two. Also, 4D-rotations don't generally have axes of rotation. In fact, if neither angles of rotations are \( 0 \) then the rotation has a unique fixed point; if only one of the angles is \( 0 \) then the rotation has a fixed plane and is a simple rotation; therefore, the set of fixed points for a (nontrivial) 4D-rotation is either a point (0D) or a plane (2D). Compared with 2D and 3D rotations where the set of fixed points is a point (0D) and a line (1D), respectively.

In 5D, the canonical form of special orthogonal matrices is: \[ \begin{pmatrix}\cos \theta_1 & -\sin \theta_1 & 0 & 0 & 0 \\ \sin \theta_1 & \cos \theta_1 & 0 & 0 & 0\\ 0 & 0 & \cos \theta_2 & -\sin \theta_2 & 0 \\ 0 & 0 & \sin\theta_2 & \cos\theta_2 \\ 0 & 0 & 0 & 0 & 1\end{pmatrix} \] So general 5D-rotations have two angles of rotation and (at least) an axis of rotation. The set of fixed points is either a line (1D) or 3-space (3D). This can be generalized to higher dimensions to show that rotation in \( 2n \)- or \( (2n+1) \)-dimensions will have \( n \) angles of rotations and if you're in even/odd dimensions then the set of fixed points will have even/odd dimensions, respectively. This also allows to generalize the definition of **simple rotations** to rotations where all but one angles of rotation are \( 0 \). Therefore, by the canonical form, \( (2n)D \)- or \( (2n+1)D \)-rotations are compositions of \( n \) simple rotations.

Before ending the post, I would like to briefly mention another elegant description. Any 2D-rotation about the origin can represented by complex-multiplication with a number on the unit circle and treating the plane as the complex plane. Composing two rotations then corresponds to multiplying two complex numbers on the unit circle, which in turn corresponds to adding their arguments/angles as described earlier. Other than a change in perspective, there's not much gained from this compared to the angle or matrix descriptions. However, one might wonder if there's a way to represent higher dimension rotations as muliplications in more complicated number systems.

In 3D, we can extend the complex number system to the quaternions by adding two more imaginary numbers. A “quaternion number” consists of 4 real numbers (like how complex numbers consist of 2 real numbers) which can be used to represent the axis and angle of rotation. Compare this to the 9 real numbers needed to represent a rotation as a matrix. Composing rotations then corresponds to multiplying the quaternion numbers, which is more efficient than matrix multiplication. There are more efficiency considerations that make quaternions the preferred description of 3D-rotations in real-world applications. Quaternions can also model 4D-rotations but we can't extend the number system to generalize this idea to higher dimensions.

If you found this post interesting, here's an exercise for you to repeat most of it but with translations instead.

Exercise 1.What is a (Euclidean) translation? What properties do translations have? Does your definition easily generalize to more dimensions? How can you best describe translations? Are there any other types of orientation-preserving isometries of Euclidean space?

Or if you're familiar with hyperbolic space,

Exercise 2.Can you come up with descriptions of rotations of hyperbolic space (with a shared fixed point) that apply to any number of dimensions?

*I'd like to thank Kristen for her feedback. The first draft was a mess, all over the place.*

*Is there a situation where we know that random objects have a given property but we still have no example of an object with said property?*

Like I said before, for properties like being trascendental, it is easy to use a counting argument to show that there are uncountable transcendental numbers but proving that a particular number is transcendental is rather involved. It has been done however. So I was asking for a property shared by almost all numbers but for which we have no explicit example.

About a week after asking the question I got an answer from an unexpected place. PBS Infinite Series did a video on exchanging the digits of \( \pi \) and \( e \). Around the 8:30 mark, she discusses normal numbers and being normal is precisely the kind of property I was looking for!

Let's focus on the set \( (0,1) \). Pick a number \( r \) in this set and write its decimal representation. There's some ambiguity here since some rational numbers have two representations: 1 = 0.999... You should choose the terminating one in this case but in the end you will notice it won't matter either way. If the representation terminates, we can ask, what is the proportion of a given digit in the presentation? For example: if \( r = .123 \) then \( \frac{1}{3} \) of the digits are \( 1 \), \( 2 \), \( 3 \) and the other 7 digits do not appear. Let \( d \in \{ 0, \ldots, 9\} \) denote a digit and \( P_r(d) \) the proportion of \( d \) in the (finite) decimal representation of \( r \). So in our example,

\[ P_{.123}(d) =\begin{cases} \frac{1}{3}, & d = 1,2,3 \\ 0, & \text{otherwise} \end{cases} \]

For another example, let \( r = .0123456789 \) then \( P_r(d) = \frac{1}{10} \), i.e. r has an equal distribution of all digits.

In general, \( r \) will not have a terminating representation and \( P_r(d) \) will be the limit of the proportion of \( d \) as you look at more and more digits of \( r \). So, if \( r = \frac{12}{99} = .121212... \) then

\[ P_{r}(d) =\begin{cases} \frac{1}{2}, & d = 1,2 \\ 0, & \text{otherwise} \end{cases} \]

But \( P_r(d) \) is defined in terms of limits and limits don't always exist! So you shouldn't expect \( P_r(d) \) to be a well-defined function for all \( r \in (0,1) \).

Definition 1.However, suppose that for some \( r \), \( P_r(d) \) is well-defined and in fact \( P_r(d) = \frac{1}{10} \) for all digits \( d \). Then we say that \( r \) issimply normal in base 10. You repeat the whole process in different base representation, and if \( P_r(d) = \frac{1}{b} \) for all digits \( d \) in base \( b \), then \( r \) issimply normal in base\( b \). \( r \) is(simply) normalif it is simply normal in all bases \( b \ge 2 \).

So \( r=.0123456789 \) is simply normal in base 10. If a number is simply normal in one base, there is no reason to expect it to be simply normal in another. Rational numbers are numbers that have terminating representation in some base \( b \). Even though \( \frac{1}{3} \) doesn't terminate in base 10, it is equal to \( .1_3 \) in base 3. In fact, a rational number (in \( (0,1) \)) will have a terminating representation with only one digit in some base (the denominator) and so will not be simply normal in this base. The rational number will actually have two representations in this base but neither one will be simply normal. So the choice we made earlier doesn't actually matter and rational numbers will fail to be simply normal in some base. This summarizes as:

Remark 1.If \( r \in (0,1) \) is rational, then \( r \) is not normal.

It is really easy to construct simply normal numbers in a specific base but I don't know if there are examples of numbers that simply normal in two bases. This might be easy to do if one base is a power of another but I haven't given it much thought. What I want to focus on is that being normal is a very strong condition. You're saying something about the representation of a number in **all** bases. My intuition would say that no such numbers exist but Emile Borel proved almost the opposite.

Borel 1909.almost all numbers are normal.

The proof I've seen uses ergodic theory and I can't get into it in this post. It is a proof by contradiction and, at the end, one shows that it is impossible for the statement to be false. Unfortunately, the proof doesn't give even a single example of normal numbers. Wikipedia references a paper by Bercher and Figueira that gives an example of a computable normal number. But no digit is actually explicitly computed in the paper. So can we really say that we have an example? Wikipedia also mentions that a Chaitin's constant as being normal but it is (provably) uncomputable.

There's another issue going on here as well. So I'm not entirely sure what would satisfy me as an example of a normal number. Being normal is an asymptotic property that we would never be able to verify in a lifetime. There's an algorithm that outputs the digits of a normal number (in a requested base) but the algorithm is so slow that no digit will actually be output before the heatdeath of the universe, do we then have an example of a normal number? Even if the algorithm outputs a digit everyday, will this mean then that we have an explicit example? This becomes a philosophical question and it seems to me that, for some inexplicable reason, I will not be satisfied with a construction of a number that was made specificially to exhibit the property of normality. Something about such a construction doesn't feel concrete. On the other hand, if \( pi \) is proven to be normal then I would accept it as an explicit example, maybe because \( \pi \) came from an unrelated area.

Anyway, let me leave it at that with normal numbers. Jeremy Kun mentioned Kolmogorov randomness to me a few weeks ago in response to the blog. He blogs about the idea himself and you can read it: here. Roughly speaking, Kolmogorov randomness seems to say that a string of letters/digits is random if there is no shorter complete description of them than themselves. Like some strings of letters can be described in few letters like “a million a's in a row” but others like “kasndfkjwrefusdlsdfasdfskiowreutwevbs” can't be described in fewer letters. This is a completely different characterization of randomness that seems independent of probability but seems to capture our intuitive notion of what “random” letters should look like. There's an extension of this idea of randomness to infinite strings and according to wikipedia, random infinite strings will be normal in the given alphabet. If the alphabet are digits of a given base, it is not clear to me if the number described by this infinite string is a normal number, but according to Borel's theorem, it almost surely will be!

Finally, there's one last cool idea that I have been reading about lately: amenability of groups. So last time, I wrote about how it would be impossible to put a uniform probability measure on \( {\mathbb Z} \). I didn't really say what uniform meant and I only used it to claim that all singletons have the same probability; I also didn't define a measure but a property of measures, countably additive, allowed me to add up all the (countable) probabilities and get a contradiction. After that, I described a way we could define probability \( P \) on \( {\mathbb Z} \) by looking at the limit of uniform probabilities of larger and larger finite sets \( S_n \). In what sense would this new probability be uniform? Is the probability defined for all sets?

To answer the first question, you could start by assuming that singletons would have equal probability. Even go one step further and assume finite sets of the same size should have equal probability. But you quickly run into trouble with infinite subsets since they're all countable and hence the “same size” as one another and yet they can't all have the same probability. So uniformity should at least use part of the structure of \( {\mathbb Z} \). In the last post, I said that the set \( S_n \) I used to construct the probability were preferred since they used the linear order of the integers. But integers have more structure than just linear order, they form a group! We will say that the probability is uniform if it isn't affected by addition! So the sets \( A = \{ 0, 3, 10\} \) and \( B = \{ 2, 5, 12 \} \) should have the same probability since \( B = A + 2 \), i.e., \( B \) is obtained by shifiting \( A \) to right by two units. This idea already gives us that singletons all have the same probability. The set of odd numbers and set of even numbers will have the same probability since shift one by one unit gives you the other; so, \( P(even) = P(odd) \). So how can we know that the limit of probabilities that I described above will indeed be compatible with the group structure of the integers?

Another useful property of probability is that we can add probabilities of mutually exclusive events. It is known as the additive property. This gives us, \( P(even) + P(odd) = P(even or odd) = P(integer) = 1 \). By uniformity, \( 1 = P(even) + P(odd) = 2\cdot P(even) \) and \( P(even) = P(odd) = 0.5 \). So before even defined an explicity probability on \( {\mathbb Z} \), we know that for it to be uniform, there's 50-50 chance that an integer is even-odd. However, due to my argument from last time, we won't be able to add countably many probabilities of mutually exclusive events. So we want \( P \) to only be a finitely additive instead of countably additive like measures usually are. One last thing we would like is being able to find the probability of any event/subset of \( {\mathbb Z} \). Can \( P \) defined as a limit of probabilities compute the probability of any set? This is the really hard part!

Like I said before with normal numbers, limits do not always exist. It is easy to construct sets \( X \) so that \( \lim_{n \to \infty}\frac{|X \cap S_n|}{|S_n|} \) does not exist. So we can't ask what the probability \( P(X) \) is. This doesn't fail because of the way I chose the \( S_n \)'s; it was bound to fail because sets of \( {\mathbb Z} \) can be arbirtrarily complicated but limits require a lot of order. To fix this we need to choose (nontrivial) ultrafilter and take the limit with respect to this ultrafilter. Ultrafilters are generalizations of regular limits that were created to solved problems just like this. In an ultrafilter, all bounded sequences have limits! Unfortunately, we need to invoke the axiom of choice to use them and are thus this definition of \( P \) is nonconstructive. It means that even though a probability on every subset of \( {\mathbb Z} \) now exists, some of them can't be computed. We have now defined a finitely additive probability `measure' on \( {\mathbb Z} \).

Definition 2.A (finitely generated) group isamenableyou can give it a finitely additive probability `measure.'

Remark 2.\( {\mathbb Z} \) is amenable.

It's really remarkable that when I wrote the post last month, I was talking about the amenability of \( {\mathbb Z} \) and I didn't know about it. Of course there was more to it because we needed an ultrafilter to fix the issue with limits not existing. Turns out that \( S_n \)'s I described are an example of a Følner sequence, which are generally not even required to be nested or neatly ordered. The sets in the sequences are characterized by having boundaries with vanishing relative sizes. So like my \( S_n \)'s boundaries always have two elements (maximum/minimum) but the \( S_n \) themselves are growing in size: \( 2n+1 \). Then the relative sizes of the boundaries \( \frac{2}{2n+1} \) tends to \( 0 \).

Even more remarkable is how I found out about amenability. My department hosted this year's spring redbud topology conference and one speaker briefly mentioned that counting lattice points (square grids) inside a circle,Gauss Circle Problem, is related to the amenability of the \( {\mathbb Z}^2 \). I then googled amenability and was really surprised to see it was the rigorous version of what I had written about. And how on earth are probability measures on to lattice points inside a circle? Roughly speaking, the lattice points in bigger and bigger circles for a Følner sequence since and the relative sizes of the boundaries is approximated by the ratio of circumference to area of a circle \( \approx \frac{2\pi r}{\pi r^2} \) and this tends to \( 0 \) as \( r \) grows. This gives a link between amenability and the isoperimetric inequality, which I briefly mention in my maths & crafts post.

Amenable groups were first studied by Lebesgue as he tried to understand the Banach-Tarski paradox. The paradox says that one can take a ball, “split” it up to finitely many pieces, and rearranged the pieces to form two identical copies of the original ball. Mathematically, this only leads to a contradiction if you attempted to assign size to every subset of three-space, \( {\mathbb R}^3 \). If you assume every subset has a volume, then it seems as if we can create volume out of nothing. So to resolve this paradox, one would need to prove that the notion of volume can not be extended to every set, which gives rise to amenability of group of isometries. Terrence Tao has a two posts on amenability and the Banach-Tarski paradox and they go into more details on this.

I love it when math takes me to such unexpected areas.

]]>Random objects are weird... randomness itself is weird... math in general is weird.

— Wire Tapper (@genepeer) February 10, 2017

I've had this thought stuck in my head since December. I was at a conference talk about groups acting on certain spaces. These groups have certain “special” properties and, at the end, the speaker lists several of these properties. Amongst them was the fact that these “special” properties weren't special afterall: 'most' groups had them. He went one step further, he gave us an example like this:

\[ \left\langle q,w,e,r,t,y,u,i,o,p ~|~ \begin{matrix}etreuyuwreui, tuwtwreyyieiqr, pqiyerwrutiyoowitqr,\\ tqrqwwerytiuop, iotuetettuou \end{matrix} \right\rangle \]

and said, “Here's a random group I made on the way here. I was on the bus and I laid my fingers lightly on the keyboard. Whenever, the bus bumped it would make me type random letters. Isn't it incredible that I can claim this group has these 'special' properties?!” To be fair, this was only meant as a joke; the description of how he came up with the group got a laugh out of everyone (myself included) in the room.

Despite knowing it was a joke, something about it bothered me. I talked to my advisor afterwards about it and he pointed out to me that 'randomness' doesn't make any sense without a sample space. Basically, my issue was: “What if he wasn't just joking? Do we really know if that group he showed has these properties?” And my advisor explained, “The joke aside, his statement doesn't make sense. He can't meaningfully say here's a random group without describing the **sample space** (and probability distribution).” Ok, so he didn't explicitly say what the sample space was but he was talking about *finitely presented (f.p.) groups*. Could that be the implied sample space? If it were then we would like to say that while chosing any random f.p. group, any such group was equally likely. We don't want our method of choosing to favor some groups over others. This would give us a *uniform distribution*.

So what if he said, “here's a random group from the sample space of f.p. groups chosen with uniform distribution”? Well, it would be nonsense because *it is impossible* to put a uniform distribution on a countably infinite sample space! To explain why this is true I would have to define what a probability distribution on a sample space (probability space) is and what it means for it to be uniform. This is where measure theory comes into play and completely derails the topic. But I'll try to illustrate this with the integers, which are also countably infinite:

Theorem 1.\( \mathbb Z \) can't be a uniform distribution probability space.

*Sketch proof by contradiction:* Suppose it were. Then let \( p(0) \) be the probability of choosing 0. We only know that \( p(0) \in [0,1] \). Since the probability is uniform, \( p(0)=p(1) = p(-1) = p(2) = \cdots \). Let \( x = p(0) \). Since \( \mathbb Z \) is countably infinite, we can add up all probabilities to get the total probability, \( 1 \): \[ 1 = \sum_{i \in \mathbb Z} p(i) = \sum_{i \in \mathbb Z} x \] But there's no real number \( x \) that can be added (countably) infinitely many times to get \( 1 \). So \( \mathbb Z \) can't be a uniform distribution space.\(~\Box\)

The above argument holds if \( \mathbb Z \) is replaced with any countably infinite set, e.g., set of f.p. groups. Furthermore, when the sample space is uncountable, e.g. \( [0,1] \), then a similar argument can show that not all subsets of the sample space have a well-defined probability (measure).

Corollary.It doesn't make sense to ask for the probability that a random integer (chosen uniformly) is even.

On the other hand, if the sample space was \( \{ 1,2,3, 4 \} \) then we would need to find \( x \) such that \( 1 = \sum_{i=1}^4 x = 4x \). So \( x = 0.25 \). So a sample space of \( 4 \) items can have a uniform distribution where each item has a probability of a quarter of being chosen.

And that corollary is what my advisor was telling me in the context of f.p. groups. I understood him but the speaker was talking about random f.p. groups just a few minutes earlier. People have written papers upon papers talking properties that random groups have. What is going on? Well, *they're redefining what it means to be random* when dealing with countably infinite sets.

We're not allowed to choose uniformly randomly from a countably infinite set \( S \) but we can do it for finite sets. So one thing we can do is arrange our set \( S \) into nested hierarchies of finite set \( S_n \), i.e. \[ S_n \subset S \text{ is finite } \forall n \in \mathbb N \] \[ S_1 \subset S_2 \subset S_3 \subset S_4 \subset \cdots \] \[ \bigcup_{n=1}^\infty S_n = S \] Suppose, \( S = \mathbb Z \) and we are interested in whether a 'random' number is even. Then let \( S_n = \{ i \in \mathbb Z~:~ |i| \le n \} \). Now each \( S_n \) is finite (size \( = 2n+1 \)), each one is contained in the next one, and their union is the full set \( \mathbb Z \). Since we can choose uniformly from each \( S_n \), we can ask the probability that a random integer in \( S_n \) is even. Let's call this probability, \( P(\text{even}|S_n) \). Then we compute that: \[ \begin{aligned} P(\text{even}|S_1) &= \frac{1}{3} & & & P(\text{even}|S_2) &= \frac{3}{5}\\ P(\text{even}|S_3) &= \frac{3}{7} & & & P(\text{even}|S_4) &= \frac{5}{9}\\ P(\text{even}|S_5) &= \frac{5}{11} & & & P(\text{even}|S_6) &= \frac{7}{13} \end{aligned} \]

As you can see, the probability is getting closer and closer to \( \frac{1}{2} \) as \( n \) grows. So we (informally) say the probability that a random integer is even is \( \frac{1}{2} \). But we have to be *very* careful. We only got the half because of how we defined each \( S_n \). I could have defined them different so that the probability got closer and closer to any number in \( [0,1] \). Yes, I could have defined the \( S_n \)'s so that the probabilities converged to \( 0 \) and we'd have to say the probability a random integer is even is \( 0 \). But the \( S_n \)'s I defined above are the preferred ones because they make use of the natural order of the integers. However, with the same \( S_n \)'s, the prime number theorem implies that:

The probability that a random positive integer is prime is 0.

Wait... What? But prime numbers exist! What do you mean a random number can't be prime? This is another place where probability gets confusing. We all start to learn about (uniform) probability in situations where the sample space is finite. We are taught that the there is an equivalence between probabilistic statements and existential statements. For example, saying “The probability that a random card is a heart is 1/4” is the same thing as saying “13 of the 52 cards are hearts.” Or saying, “The probability that a random student is 8 years old is 0” means “None of the students are 8 years old.” In particular, saying “No random student is 8 years old.” is the exact same thing as saying “No student is 8 years old.” In the finite case, the *random* adjective can be dropped without changing the meaning.

But once we expand the idea of probability to an infinite sample space, this equivalence is broken. It is tempting to rephrase the above statement about prime numbers as “No positive integer is prime” or “Every positive integer is composite” but that would be a false. So instead, we rephrase the statement as “*Almost* no positive integer is prime” or “*Almost* every positive integer is composite.” or keeping the *random* adjective: “*random* positive integers are composite/not prime.”

That means there's a difference between “positive integer are not prime” and “*random* positive integers are not prime.” The adjective, random, means something very important when your sample space (e.g. positive integers) is infinite; it can't just be dropped from the sentence without changing the meaning.

When the sample space is infinite, statements about random objects probably don't mean what you think they mean.

So there are theorems stated as “random (f.p.) groups have [property x]” or “almost every (f.p.) group has [property x]” but it would be wrong (yet funny) to type a bunch of 'random' letters for a group presentation and conclude that that group has [property x] because it was 'random.'

Here is a list of interesting properties about random objects that I've come across.

- G. Cantor, 1874: random real numbers in \( [0,1] \) are irrational/transcendental.
- B. L. van der Waerden, 1936: random integer polynomials are not solvable in radicals.
- M. Gromov, 1993: random groups are word-hyperbolic.
- P. Dani, 2008: random elements in (nonelementary) word-hyperbolic groups have infinite order.
- Calegari-Walker, 2014: random groups have many surface subgroups. (This was the statement that inspired this post)
- J. Maher, 2010: random walks in mapping class groups end on pseudo-Anosov elements.
- Cumplido-Wiest, 2017: random elements in mapping class groups
*can be*pseudo-Anosov.

What's more interesting is the question of constructing examples. For example, I can give you the easy diagonal argument that real numbers are uncountable but rational/algebraic numbers are countable, so irrational/transcendental numbers must exist. But I haven't given you an example of an irrational/transcendental number. Essentially, it is difficult to tell whether a given number is transcendental even though we know *almost all* numbers are. Similarly, it takes some work to know if a given integer polynomial is not solvable in radicals even though we know *almost all* integer polynomials aren't. Same thing goes for being a word-hyperbolic groups.

I found the fourth property while I was looking for a proof that all infinite word-hyperbolic groups had at least one infinite order element. So not only do they have one infinite-order element, *almost all* elements are infinite-order!

The last two properties listed illustrate how randomness can mean very different things. You would expect random walks to end on random elements but that is not necessarily the case. So we can know that random walks end on special elements without knowing whether random elements are special! At least, we can now say (via an unrelated argument) that random elements can be special, i.e., the probability of being special is not zero. As far as we know, it is possible for random walks to always end on special elements even though special elements are only say 'half' of all the elements.

I'd like to leave you with this question related to random objects and constructivism.

*Is there a situation where we know that random objects have a given property but we still have no example of an object with said property?*

I first started making these as a kid when I failed to build functional kites. After a few failed kite-making attempts, I'd have a bunch of sticks and yarn and I'd just start playing with those instead. Start wrapping the yarn around the sticks moving from one segment to the next and before you know it, you have a kite! Just like the previous failed attempts, this kite doesn't fly but at least it looks pretty. Turns out this is known as an Ojo de Dios (God's Eye). It is a spiritual object for the Huichol and Tepehuan in Mexico; their designs are larger and use multiple colors to weave really beautiful patterns. I don't know much about the ojo de dios and its history but you should definitely google it if you're curious. Out of respect, I'll refer to my rather bland pieces as kites.

So back to what I was building. Even if you're just weaving around two crossed sticks to form a diamond shape, there's a lot of choice you end up making that determine what the final kite will look like. For example, assuming you're looking down at the sticks, each time you wrap the yarn around the sticks, you have a choice of whether to go over or under the sticks. Also, since one stick is over the other, then it matters which of these two sticks you choose to go over or under. This all sounds complicated but the following images should help understand what I mean. In these images, I pretend the yarn is an annulus (a disk with a hole in it, like a CD) that is being glued to two crossed sticks.

Exercise 1.Did my images above cover all the different ways the annulus can be glued to the cross? If not, how many are missing?

Naively, you could say there are 4 segments of the cross and, at each segment, you have 2 choices: to either glue above or below the segment. So there should be \( 2^4 = 16 \) different configurations. You need to be careful because some of these configurations may “look different” but rotating or flipping^{1} one gives you the other. But I don't want to count rotations/flips as different designs. So to answer my exercise, you'd have to see if you can rotate or flip my examples to produce any others that seem to be missing. Here are the kites represented by two of the models.

On the surface, the exercise is a counting problem but when you dig deeper you realize it is asking if you can figure out when two images that look different are actually different perspectives of the same thing. This comes up a lot in mathematics and is known as the **isomorphism problem**. It is usually a difficult problem to solve since the two “perspectives” can be so different that no connection between them can be found. It shouldn't come as a surprise that a whole field of math, knot theory, was developed just to answer general versions of my exercise. For another example of how difficult isomorphism problems can be, the *graph isomorphism problem* asks whether two networks have the same structure. The following image illustrates how the same network can be visualized differently. It is from a *Quata* article published a year ago when an algorithm that solves the problem was announced. You can read more about it here.

Closely related to the last exercise is the idea of symmetry. Pick one of the kite models, its symmetries are the things you can do to it and it still *looks the same* afterwards. “Doing nothing” is also a symmetry and I'll call it the “trivial symmetry.” Looking at my first kite, all I can do to it is rotate it clockwise or counterclockwise. It has 4-fold rotational symmetry (rotate \( 90^\circ \)) and that's it. I can not flip it because it wouldn't look the same: before, the sticks were hidden by the yarn; after flipping, the sticks are exposed. Since you can combine symmetries, e.g., rotate then flip then rotate, and even “undo” them, e.g., rotate clockwise then counterclockwise, the collection of symmetries, \( S \), forms what is called a **group**. Other examples of groups are the integers, \( \mathbb Z \), or the integers modulo 4, \( \mathbb Z_4 \). The latter are what you get when you restrict yourself to the numbers \( \{ 0, 1, 2, 3 \} \); you add the numbers as usual but when sum is greater than or equal to 4 then subtract 4, e.g., \( 2+2=4 \) then subtract \( 4 \) to get \( 4-4=0 \). So, in \( \mathbb Z_4 \), \( 2+2=0 \).^{2}

This brings us to another isomorphism problem, we can show that \( S \cong \mathbb Z_4 \). This means that the symmetries of my kite have the same *group structure* as the integers modulo 4. To truly appreciate this, let \( 1 \) represent rotating \( 90^\circ \) clockwise. Then \( 2 \) would be rotating \( 90^\circ \) clockwise twice, i.e., rotating \( 180^\circ \) clockwise. What does \( 2+2 \) mean? It means rotate \( 180^\circ \) then again rotate \( 180^\circ \). If you try it out with anything, you'll notice this brings you back where you started, i.e., \( 2 + 2 = 0 \). Being able to show that \( S \cong \mathbb Z_4 \) means you (sort of) fully understand what the symmetries of the kite are since the integers modulo 4 are well understood.

Exercise 2.What are the symmetries of the other designs from Exercise 1? Are there designs with no symmetries at all (apart from the trivial one)? Are there different designs that have the same symmetries? Symmetries with the same group structure?

Of course, the annulus model I gave earlier doesn't really cover every kite design possible. For starters, it assumes once you've chosen to wrap over a segment of the cross, then you will always wrap over it. But you can also mix it up by alternating between going over and under on the same segment to produce new kite designs. It does, however, take a bit more care to keep the results as neat as before. Here is a kite where the sticks are sandwiched between a self-intersecting sheet of yarn.

For kites, you never have to worry about how much yarn you need for a particular design. You can just unwind the yarn as you go and only cut it when you're done.

Exercise 3.Suppose you must cut the yarn before you started, you want to minimize how much yarn is wasted, and you want to make your kite out of 5 “rings” of yarn, how much yarn would you need? What if you wanted to have 10 rings of yarn?

You should notice that when you double the number of rings, then you double the perimeter of your kite but the amount of yarn will more than double because all the new rings being added are longer than the previous ones. You can think of the length of the yarn needed as being the area of your kite. In two dimensions, doubling the size of an object will double its perimeter but quadruple its area. In general, this relationship is described by an isoperimetric inequality. The inequality answers the question: *if you knew how long the outer ring had to be, how much yarn in total would you need?* I didn't bring this up just as a curiousity; the relationship between perimeter and area is actually very important in my area of study, **geometric group theory**, which is (more or less) the study of symmetries.

The exercise is also relevant to this project because I will need to cut the yarn before weaving around the sticks for the designs in the next post. This means I need to estimate how much yarn to cut but not overestimate because (set aside the issue of wasting yarn) the longer the yarn the harder it will be to weave around with it. I didn't actually do any calculations to figure how much yarn to cut. In practice, it meant that when I was doubling the size of my design, I knew I'd need at least quadruple the amount of yarn.

After making a few kites, I got bored of the monochromatic square designs. The straight lines were getting dull and I didn't have other colors of yarn to make better looking designs. Meanwhile, the crossed-sticks kept reminding me of astroids, specifically astroids as envelopes of straight line segments:

*The next step was to see if I could weave a curve out of straight lines!* Could I make an astroid? Spoiler alert:

The next post will discuss the mixed answer to this question: yes, I could weave a curve but no, I can not make an astroid. To see that that the kite above is not even an approximation of an astroid, look at the line segments formed by the yarn. In the GIF above it, the line segments are concentrated towards the ends of the axis and more spaced out at the center. However, in the kite, the line segments seem to be uniformly spaced out. This is enough to show that they are different curves and I'll leave that as an exercise.

Exercise 4.Show that the two envelopes (colored curves) below are different.

PS: This is yet another type of isomorphism problem. You're given two different constructions of curves and asked to determine if they describe the same curve.

a flip is actually a rotation with the axis of rotation lying along the plane. â†©

This is modular arithmetic and internet security is built on top of it. â†©

Fermat struggled to find other mathematicians interested in number theory. In fact, number theory was not a thing in Europe until the 18th century. So in an attempt to spark interest in what he saw as beautiful but neglected area of mathematics, Fermat would challenge other European mathematicians with proving remarkable properties of number, which he'd have supposedly proven. *Disclaimer: the following sentence and a lot of others are just my opinion and should taken with a grain of salt.* Unfortunately for Fermat, mathematics was more of a tool for science at the time; contemporary mathematicians weren't interested in studying problems that had no real world applications — the classic pure vs. applied math divide. Fortunately for future number theorists, a century later, Euler found a collection of some of Fermat's letters which contained these unproven claims and proceded to work on proving many of them; Euler's work managed to do what Fermat's letters had failed to do and, in 18th century, modern number theory was born. Fermat never published anything in his lifetime but after his death, his son published a new edition of Diophantus' *Arithmetica* which contained margin notes Fermat made in his copy of the book. There were also posthumous publications of his letters some of which found there way to Euler.

Before dividing into Fermat's theorems, here is some context. Two week ago, I watched *The Man Who Knew Infinity* — the biographical movie based on the life of Ramanujan. In it, Hardy was really tough on Ramanujan for not providing proofs for his statements/theorems. Trying to get Ramanujan to appreciate the importance of proofs, Hardy took him to the Cambridge University Library and showed him the display of Newton's *Principia*; Hardy explained that Ramanujan was capable of reaching Newton's level of recognition but first he had to learn how to write proofs. This scene captures how fundamental proofs are (or seem to be) to mathematics.

Around the same time, I coincidentally read something about there being only one known proof by Fermat. How could this be?? When I took my introduction to number theory as an undergraduate, it felt like everything we did was linked to Fermat. I knew about the time and math developments it took to prove his famous Last Theorem but I assumed the simpler theorems we proved in class had definitely been proven by him. Turns out: he simply never wrote the proofs for all (except one of) his theorems. And did you find it odd that I called them “his theorems” even though we don't have his proofs? Well here's something even more surprising: no one seems to doubt that he actually had all these proofs (with the one exception of course). *Fermat is the only mathematician I know of who got a pass on not giving proofs.* Even before its proof was published in 1994, Fermat's Last Theorem was still called a theorem.

This was so fascinating that I had to dig deeper into it. I found out that he never published anything, at least in number theory, in his lifetime; everything was published posthumously. The only record of Fermat's work, at least in number theory, is in his personal letters and his margin notes in *Arithmetica.* Fortunately for me, everything was collected together and translated to French in the volumes of *Oeuvres de Fermat.* So I started to track:

- When he stated his theorems and/or their proofs;
- Who published the first proofs.

The latter was easy to do since it was Euler for most of them and *The Euler Archive* has all his published works. Without further ado, here are some of Fermat's theorems.

Little Theorem.

(v1)if \( p \) is prime, then \( a^p \equiv a \pmod p, \) for all \( a \)

(v2)if \( p \) is prime, then \( a^{p-1} \equiv 1 \pmod p, \) for all \( a \not \equiv 0 \pmod p \).

Example 1.Let \( a=3 \) and \( p=5 \).For

v1, we get \( 3^5 \equiv 3 \pmod 5 \), that is, the remainder after dividing \( 3^5~(=243) \) by \( 5 \) is \( 3 \).For

v2, we get \( 3^4 \equiv 1 \pmod 5 \), that is, the remainder after dividing \( 3^4~(=81) \) by \( 5 \) is \( 1 \).You should try some other examples on your own: \( a = 2, p = 7 \) or \( a=14, p=3 \).

I've given two equivalent versions of the theorem but I must say Fermat did not state it like that; in fact, even the notation I used was created by Gauss over 150 years later. The earliest mention of this theorem that I could find was in Letter 44 to Frenicle (18 Oct. 1640). In that letter, Fermat's Little Theorem is stated along this lines:

Theorem 2.Any (fixed) prime number \( p \) divides one of the listed numbers \( a-1 \), \( a^2 -1 \), \( a^3 -1 \), \( a^4-1 \), \( \ldots \) for any positive integer \( a \). If \( p \) divides \( a^n - 1 \) for some \( n \), then \( n \) divides \( p-1 \). If \( a^{n} - 1 \) is the first element in the list divisible by \( p \), then \( p \) will divide \( a^{nk}-1 \) for all positive integers \( k \).

Example 2.Let \( a=2, p=7 \). \( 7 \) divides \( 2^3 - 1 \) and \( 3 \) divides \( 7-1 \).

Let \( a=3, p=11 \). \( 11 \) divides \( 3^5-1 \) and \( 5 \) divides \( 11-1 \).

What a mouthful! So putting aside the fact my two versions of the theorem use notation not available to Fermat, their contents (mathematical meaning) are different from his and, technically, his is false: if \( a=p \) then \( p \) will never divide \( a^n-1 \) for all positive integers \( n \). So Fermat was implicitly assuming that \( a \) is not a multiple of \( p \). Furthermore, the last two sentences sort of contradict each other. The third sentence implies we can get arbitrarily large \( n \) for which \( p \) divides \( a^n-1 \). That together with the second sentence implies arbitrarily large integers can divide a fixed number \( p-1 \) but this is absurd. So in the second sentence as well, Fermat was assuming that \( a^n-1 \) was the first element divisible by \( p. \) As this is a letter he was writing to a friend, Fermat was sloppy with his stated assumptions because Frenicle would probably still understand what he meant. Once you've fixed these tiny “mistakes” and added the necessary assumptions, the statement is still different from mine. He was basically saying what I said and then more; in math jargon, we say his version is stronger than mine. And what does Fermat say about proving the theorem? That *he would have sent a proof were it not too long for the letter.* Doesn't that sound familiar?

**v2** is the closest to Fermat's statement since I had to make the assumption “\( a \not \equiv 0 \pmod p \)”, i.e., \( a \) is not a multiple of \( p \). But I didn't have to make the other assumptions because my statement says nothing about dividing \( p-1 \). The second and third sentences in Fermat's statement, which never appear when people write Fermat's little theorem, will be important in a moment as I discuss the theorem's two fundamentally different proofs.

Euler first discovered the theorem (**v2**) in 1732 while working on Fermat numbers (more on this later) unaware that Fermat too had stated it; it was a conjecture at the end of Euler's paper E026. Euler gave a proof (for~**v1**) in his paper E054 (1735); **v1** was more natural since his proof relied on induction and the binomial expansion of \( (a+1)^p \). It was a simple proof and most probably the one that you will get in a number theory class. Euler still does not attribute the theorem to Fermat in this second paper. 19 years later, Euler revisited the topic and gave a second proof (for **v2** this time) in E262. This proof uses group theoretic ideas that make **v2** the natural form of the theorem. Even though groups were yet to be defined, the proof is similar to the standard one given in a modern algebra class. It is finally in this paper that Euler attributes the theorem to Fermat and actually proves Fermat's version of it; the proof is qualitatively better as it gives better intuition of what's going on (*Rule of Thumb: anything that uses group theory is always better*). Given the way Fermat stated the theorem, Weil (pg. 1147) argues that the second proof is probably what Fermat had in mind and I can't help but agree. It's hard to see why one would come up with such a wordy statement if it weren't dictated by the proof itself. In contrast, Burton (pg. 514) says that the first proof is simple and what Fermat had in mind but how could he prove the second sentence using induction or the binomial theorem? In the end, Euler was able to generalize the further group theoretic ideas when he proved Euler's theorem in E271 (1757).

Note: The letters of Fermat are all in the *Oeuvres de Fermat, Tome 2* and if they are written in latin then you can find French translations in *Tome 3*. I'm using this collection's numbering scheme when I say “Letter 44” or “Obs. 3.” The latter would be referring to the third margin note in Fermat's copy of *Arithmetica*. The letters are usually dated but the notes aren't; I read somewhere (forgot where) that we can assume they were written sometime in 1630-1640. Also, all of Euler's papers can be found at The Euler Archive where they are catalogued using labels like “E0262.” The archive has English translations for most of his papers referenced in this post.

Now on two the second claim on my list:

Christmas Theorem.Let \( p \) be an odd prime number.\( p \) is a sum of two (integer) squares if and only if \( p \) is of the form \( 4n+1 \).

Or using modern notation: \( p = x^2 + y^2 \iff p \equiv 1 \pmod 4 \)

Example 3.\( 5 = 4\cdot 1 + 1 \) is prime number and \( 5 = 1^2 + 2^2 \).

\( 13 = 4\cdot 3 + 1 \) is prime number and \( 13 = 2^2 + 3^2 \).

However, \( 3 = 4\cdot 0 + 3 \) and \( 7 = 4 \cdot 1 + 3 \) are primes which can never be written as a sum of two integer squares.

Honestly, I saw that name mentioned once on Wikipedia and I couldn't stop myself from using it. Fermat stated it on Christmas 1640 in Letter 45 to Mersenne and again as a margin note, Obs. 7. The theorem is an if and only if statement and one half is easy to show using modular arithmetics: if \( p \) is a sum of two squares, then \( p = 4n+1 \). The really hard part is showing the second half: if \( p = 4n+1 \) then there are integers \( x \) and \( y \) such that \( p= x^2 + y^2. \) Where do you even begin with this? Fermat himself admitted that it took him a while to get a proof and gave us a clue on how he did it in Letter 101 to Carcavi (Aug. 1659). This is what he wrote:

*Proof outline:* Suppose \( p_1 \equiv 1 \pmod 4 \) were not a sum of two squares, then he can find a smaller prime \( p_2 \equiv 1 \pmod 4 \) which is not a sum of two squares as well. Applying the exact same argument, he can get an even smaller prime \( p_3 \equiv 1 \pmod 4 \) that's not a sum of two squares. Keep going and the primes that aren't sums of two squares get smaller and smaller until you get to \( 5 \). But this is a contradiction since \( 5 = 1^2 + 2^2 \).\(~\Box\)

What you just read is a **particular type of proof by contradiction** that Fermat developed and used extensively. A proof by contradiction in math is a proof that starts by assuming the opposite of what is being proven and then shows that that is impossible. The most popular example is proving \( \sqrt{2} \) is irrational. You first assume that \( \sqrt{2} = \frac{p}{q} \), i.e., \( \sqrt 2 \) is rational and then give a contradiction. The contradiction means \( \sqrt{2} \) can't be rational so it must be irrational. Fermat does the same thing: he assumes the opposite, i.e., that there is some prime \( p_1 = 4n+1 \) that is not a sum of two squares, and then comes to a contradiction: \( 5= 1^2 + 2^2 \) is not a sum of two squares. So the original assumption is false and all primes of the form \( 4n+1 \) are sums of two squares. The crux of the argument is in how to find a smaller prime of the same form that is still not a sum of two squares. Since Fermat does not explain how to do this, this is not a proof.

Fermat proved a lot of properties about natural numbers using the same idea: first assume some natural number didn't have the properties, and by some really clever argument, construct a smaller natural number that doesn't have the properties. By that same clever argument, construct smaller and smaller and smaller (... indefinitely) natural numbers that don't have the properties. But this is impossible since natural numbers can not get smaller and smaller forever — we don't have negative numbers! So this contradiction allows Fermat to conclude that all natural numbers have the given properties. Of course you can substitute “natural numbers” in this explanation with any subset of natural numbers. For example, in the outline given above, “natural numbers” has been substituted by “primes of the form \( 4n+1 \).” This type of proof by contradiction functions like an inverse of induction and Fermat called it proof by infinite descent. Fermat discusses it in detail in Letter 101, where the outline given above came from, and he lists various theorems he was able to prove with it. When I decided to write this post, I had about five Fermat theorems that I wanted to put in it. Imagine my shock when I saw that each one, except Fermat's little theorem, were listed in this letter as having been proven by infinite descent!

Now that we've seen the outline from Fermat, how was this theorem actually proven? Euler proved it in the course of two papers, E228 (1749) and E241 (1750), and his proof did use infinite descent. Although the proof uses infinite descent, it doesn't follow Fermat's outline and I am not sure if there's a proof that indeed follows Fermat's outline; you can read a summary of Euler's proof here. The following is a point I will return to again: Fermat mentions in a letter that he needed the Christmas theorem to prove another theorem, the polygonal number theorem, and in the papers in which Euler gave this proof, he was actually trying to prove one case of the latter theorem. He never managed to prove it but at least we got a complete proof of the Christmas theorem.

Let's talk about Fermat's polygonal number theorem then. One type of polygonal numbers has already shown up in my blog before: Euler's pentagonal number theorem, which is a completely unrelated theorem although it was proven the same year (see E244). Before I get into Fermat's theorem, let me describe polygonal numbers.

Triangular numbers are numbers that arise from building triangles using dots layer-by-layer. Squares numbers will be numbers that arise from building squares layer-by-layer. Same thing for pentagonal numbers and pentagons, hexagonal numbers and hexagons, etc. See the following image for an illustration. And in general, when the polygons have \( k \) sides, I'll say “\( k \)-gonal numbers.”

Polygonal Number Theorem.All natural numbers \( n \) can be written as a sum of at most 3 triangular numbers, 4 squares, 5 pentagonal numbers, 6 hexagonal numbers, etc. In general, all natural numbers can be written as a sum of at most \( k \) \( k \)-gonal numbers.

Example 4.Let \( n = 14 \).Triangular numbers \( (k=3) \): \( 14 = 1 + 3 + 10 \).

Square numbers \( (k=4) \): \( 14 = 1 + 4 + 9 \), the theorem says we need at most 4 and we used 3.

Pentagonal numbers \( (k=5) \): \( 14 = 1 + 1 + 12 \), the theorem allows repeating summands.

Note: Most of Fermat's claims appear in several of his letters. In this post, I've focused only on the very first letter in which a claim appeared and later letters will only be mentioned if he discussed its proof in them. I also cite the margin notes since I can't tell which between the letters and notes are oldest.

The earliest mention from Fermat of this theorem that I can find is in Letter 12 to Mersenne (Sept 1636). It is also a margin note in *Diaphontus*, Obs. 18., which could have been written even earlier. The theorem is mentioned again in Letter 74 (25 Sep. 1654) and this is where he said that he needed the Christmas theorem to prove it. Later again, he mentioned the \( k=4 \) case of the theorem in Letter 101 while listing theorems he'd proven using infinite descent. Euler never proved this case but he proved the following partial result in E242 (1750): any number is a sum of at most four squares if one allows the squares to be fractions. 20 years later, Lagrange would use Euler's ideas to complete the proof of the polygonal number theorem when \( k=4 \) and this case is now called Lagrange's four-square theorem. I've said Euler never proved this theorem a few times but, to be fair, two years after Lagrange's result, Euler published E445 (1772) and greatly simplified Lagrange's original argument. Does that count as having proven the theorem as well? That's a question for another day.

In 1796, Gauss proved the case \( k=3 \), i.e., that every natural number is a sum of at most 3 triangular numbers. According to wikipedia, this is sometimes called the Eureka Theorem because Gauss wrote “EUREKA! \( n = \Delta + \Delta + \Delta \)” in his diary the day he proved it. It is also an immediate consequence of another theorem that was published a year later: Legendre's three-square theorem, which is incidentally equivalent to Fermat's claim in Obs. 27. Even though Legendre was the first to publish a proof, the case is credited to Gauss since the diary entry is dated earlier. Finally, Cauchy proved the full polygonal theorem in 1813 by showing it is a consequence of the Eureka theorem.

Anyway, back to Fermat! So Fermat claimed to have proven, amongst other things, the Christmas theorem and Lagrange's four-square theorem using infinite descent. The eventual proofs indeed used the same technique. Fermat also said that the Christmas theorem was necessary for the polygonal theorem and Euler did have to prove the Christmas theorem before Lagrange could prove the four-square theorem. I'm tempted to believe he had these or similar proofs even though it took other mathematicians nearly two centuries to produce them. But Legendre's theorem seems to use deep ideas on the theory of quadratic forms which was developed in the 18th century. The thing that works in Fermat's favor is he did recognize the difficulty of the problem and the [potential] importance of his ideas. Here's what he wrote after stating the theorem in Obs. 18:

My translation from French.I can't give the proof here as it depends on various abstruse mysteries of the science of number. I intend to dedicate an entire book on the subject and thereby accomplish surprising and unprecedented progress in this area of arithmetics.

It is a shame that no such book was ever written. So the two centuries don't inspire as much doubt if Fermat thought he needed an entire book to express the ideas in his proof. But it is rather surprising that he never once mentioned these ideas in his letters. So do we now take Fermat's word for it? Maybe not since there's one claim which he made that turned out to be false. Not false in the earlier sense that he made some implicit assumption but just flat out false.

False Conjecture.\( 2^{2^n}+1 \) is a prime number for all \( n \).

Examples & Euler's Counterexample.For \( n = 0, 1, 2, 3, 4 \), we have\( 2^{2^0}+1 = 3, \) \( 2^{2^1}+1 = 5, \) \( 2^{2^2}+1 = 17, \) \( 2^{2^3}+1 = 257, \) \( 2^{2^4}+1 = 65537 \) are all primes.

However, \( 2^{2^5}+1 = 4294967297 = 641 \cdot 6700417 \) is not a prime number. Euler gave this counterexample in E026, the same paper I cited earlier in which Euler independently conjectures Fermat's little theorem.

The earliest mention of the conjecture that I found was in Letter 43 (Aug. 1640) to Frenicle and the latest mention was in Letter 96 (June 1658) to Digby; on the other hand, it never appeared in the margins of *Arithmetica*. He had verified it for the given examples but \( 2^{2^5} + 1 \) might have been too big for him to factor. However, this conjecture is not worth mentioning just because it was *(the only example of him being)* wrong *(that I could find)*. It is the way he stated it in letters. In Letter 43, Fermat said he believed it to be true but that he had no proof of it. Again in Letter 96, written 18 years later, he admitted to having no proof of the statement. In this same letter, Fermat claims to have proofs of Christmas theorem, polygonal number theorem, and the next theorem I will be discussing. So for 18 years, Fermat made a differentiated the conjecture from his other statements. A year later, in Letter 101, Fermat seems to claim that he used infinite descent to prove the conjecture. But the conjecture is stated in a rather odd way and it can be argued that he was overstating his technique's importance to spark the interest of other mathematicians. So is it a coincidence that the one statement he continually admitted not being able to prove for 18 years was the only statement that was eventually disproven? But it is a fallacy to conclude that he had proofs for his theorems. Maybe he just had very good insight in the workings of numbers to come up with the right statements... On to the next Fermat theorem.

Theorem 6.No triangular number, except \( 1 \), is a fourth power.

This theorem is in Letter 96 and also at the end of Obs. 45 — the marginal note that contains his only known proof. At the end of giving the proof by infinite descent for his right-triangle theorem, Fermat says that he used the same process to prove this theorem. Euler proves it in E98 (1738) and I think it's only worth mentioning that Euler first proves a statement equivalent to the right-triangle theorem. Euler was basically following Fermat's footsteps. Now what is this theorem that we absolutely know that Fermat proved?

Right Triangle Theorem.There's no right-angled triangle with integer sides whose area is a square number.

Fermat never wrote the proof in his letters even though he was challenging others to prove it (and other equivalent statements) as early as Letter 12. In Letter 101, he gives this as the first example of theorems he was able to prove using infinite descent and this is his outline: Suppose you have a right-angled triangle with integer sides whose area is a square number. By Fermat-ian magic, you construct a smaller right-angled triangle with integer sides whose area is a square number, etc. This leads to a contradiction and so no such triangle existed in the first place. Fortunately for us, we do not have to guess what the Fermatian magic is. In Obs. 45, he writes out the proof and it does indeed match his outline. The Wikipedia article on this theorem has a similar proof if you want to see the magic.

While we're on the topic of infinite descent, I want to highlight a distinction between the two outlines of proofs by infinite descent. The right triangle theorem and theorem before it are negative statements since they're about the non-existence of objects with a given property. That is why they weren't followed with examples like all the other theorems before them. The method of infinite descent is well-suited for negative statements: assume that said objects exist, construct smaller objects, etc. and get a contradiction. But the Christmas theorem is an affirmative statement: it is about the existence of the property *“is a sum of two squares.”* Fermat explains in Letter 101 that it took him a while to figure out how to prove affirmative statements using the infinite descent. Looking at the dates: right triangle theorem appeared in Letter 12 (1636) and Christmas theorem in Letter 45 (1640). So assuming these dates are close to when the theorems were proven, then it did take a while for Fermat to adapt the technique for affirmative statements. Seeing Fermat explain the subtle differences in how to apply the technique is evidence to me that he did have the proofs for his theorems — he wasn't just spouting correct statements but was actually thinking about the subtleties in their proofs. I'll comeback to infinite descent again but for now let's talk about the one theorem that we all agree that Fermat had no proof for.

Last Theorem.For \( n \ge 3 \), \( x^n + y^n \not = z^n \) when \( x,y,z \) are positive integers.

The theorem is found in the infamous margin note, Obs. 2, where Fermat stated it and claimed to *have a truly ingenious proof of it but didn't have enough room for it.* I could be wrong but I don't think the theorem appears in its full generality in the letters. But I did find cases \( n=3, 4 \) used as challenges to other mathematicians. It can be shown that the right triangle theorem implies the \( n=4 \) case and so the proof of this is attributed to Fermat. Fermat lists the \( n=3 \) case in the infinite descent letter and it shouldn't come as a surprise that Euler's proof of it in E387 (1770) used infinite descent. Fermat's last theorem has a very long history and its eventual proof 300+ years later required deep results from outside of Number Theory. There's no way had these ideas and, as no simpler (“elementary”) proof is believed possible, we think Fermat's claimed proof had an error; maybe he noticed a flaw in the proof and that's why he used cases \( n=3,4 \) and not the general statement in his letters. So I believe he definitely didn't have a proof for Fermat's last theorem, are there other theorems on this list? The next one is partially on it.

Pell's Equation.Let \( D \) be a nonsquare integer, i.e., \( D \not = n^2 \).There is an algorithm which always finds (all?) infinitely many integer solutions to \[ x^2 - D y^2 = 1 \]

Example 6.Let \( D = 2 \) which is not a square.\[ 1 = 3^2 - 2 (2^2) = 17^2 - 2 (12^2) = \ldots = x_n^2 + 2 y_n^2 \] Let \( x_1 = 3 \) and \( y_1 = 2 \). In general, if you have a solution \( (x_{n}, y_{n}) \), then you can construct another one \( (x_{n+1}, y_{n+1}) \) with this recursive formula: \[ \begin{aligned} x_{n+1} &= 3x_{n} + 4y_{n} \\ y_{n+1} &= 2x_{n} + 3y_{n}\end{aligned} \] Can you find one solution for \( D=3? ~ D= 64?~ D = 109? \)

Fermat was not the first, by any means, to ask this problem. In fact, unbeknownst to Fermat and his contemporaries, a 12th century Indian mathematician called Bhaskara II had solved the problem. Bhaskara had given an algorithm that produced a solution for any given nonsquare \( D \). It was known that a single solution led to infinitely many more via recursive formulas thanks to Brahmagupta's Identity — Brahmagupta was a 7th century Indian mathematician. However, Bhaskara didn't prove that his algorithm always worked. So there was an algorithm which worked for all inputs that had been given to it but no one had proven that it would never fail.

Centuries later, Fermat was challenging other mathematicians to find such an algorithm, implying he had one himself. In case they failed to find it, he also asked for solutions when \( D=64, 109 \) which would appear to be an easier problem. But there's a catch: the cases \( D=64, 109 \) have very large solutions, solutions that would be impossible to find by trial-and-error, solutions that essentially require an algorithm to generate them. The fact that these were the examples he gave in his letters is strong evidence that Fermat indeed had an algorithm to solve it; he must have used the algorithm to pick the hardest cases to suggest. When you look at the table of initial solutions for \( D=2, \ldots, 128 \) on wikipedia, you find that \( D=64, 109 \) have the two largest initial solutions!

Euler published an algorithm in E029 (1733) and E323 (1759) that was based on continued fractions. Although the presentation is different, this appears to be essentially the same algorithm as Bhaskara's. And just like Bhaskara, Euler didn't prove that the algorithm always worked. It was Lagrange who eventually proved in 1768 that this algorithm always produced a solution.

Let's return to the question I asked just before bringing up Pell's equation. Do I believe that he had a solution to this problem? It's hard to answer. For starters, he must have had an algorithm to help him pick those two hard cases of Pell's equation. And if I were to guess, I'd say he had the same algorithm as Bhaskara and Euler. But, this is where things get interesting, Fermat said in Letter 101 that he could prove the equation had infinitely many solutions using infinite descent. This could be interpreted multiple ways:

- He used infinite descent to prove that Bhaskara's algorithm worked. I just can't fathom how infinite descent could be used to prove this. And since I haven't even read Lagrange's proof, I can't make good judgement on what can and cannot prove the algorithm works.
- He used infinite descent to prove a different algorithm worked. My first concern still applies. It's see how to use infinite descent to show something doesn't exist; it's trickier but still possible to show something does exist; where do you even begin to show an algorithm works?
- He used infinite descent to prove there's always a solution. This is a little more manageable. It's an affirmative statement and so it won't be straightforward to prove but it's at least possible. My guess is you start by assuming \( x^2 - Dy^2 = 1 \) has no integer solution and then creating smaller \( d \) so that \( x^2 - dy^2=1 \) has no integer solutions either, etc. Infinite descent then says this is impossible so \( x^2 - Dy^2 = 1 \) has a solution for all nonsquare \( D \). The issue here is proofs by contradiction are nonconstructive. Such a proof can tell you a solution exists but doesn't tell you how to find it. Yet Fermat insisted that others find an algorithm that produces a solution.

Searching around the web, I haven't been able to find anyone using infinite descent on Pell's equation. But in Fermat's defence, when he talked about Pell's equation in the letter, he said the problem was as difficult for him to prove as Lagrange's four square theorem. Of both theorems, he said:

My translation from French.There are some [questions] that demand new principles before using infinite descent and finding these principles is so difficult that it can't be done without extreme pain.

So on one hand, 300 years have passed and still no one has a proof of the theorem using infinite descent; on the other, Fermat had an algorithm and he claims he found it using infinite descent. Who am I to say he didn't? It is possible we still don't appreciate the full potential of this technique like Fermat did.

So how verstile was Fermat's method of infinite descent? Well, Fermat claimed to have proven a lot of theorems and he attributed the proofs of many of these to his method of infinite descent. In August 1659, Fermat wrote Letter 101 to Carcavi — he wanted to share his novel technique that he had found indispensable. After giving two outlines of how he used infinite descent to prove a negative satement (right triangle theorem) and then trickier affirmative statement (Christmas theorem), he listed several difficult theorems that he was able to prove using infinite descent:

- Lagrange's four square theorem, the case \( k=4 \) for his polygonal theorem
Pell's equation (with algorithm)

Fermat highlights these two for being extremely difficult to prove.

*Is it a coincidence that it's Lagrange who proved both of these theorems?*- The case \( n=3 \) for his Last theorem
- \( (5,3) \) is the only integer solution to \( x^2 + 2 = y^3 \).
- \( (2, 2) \) and \( (11, 5) \) are the only integer solutions to \( x^2 + 4 = y^3 \).
- \( (1, 1, 1) \) and \( (7, 2, 5) \) are the only integer solutions to the simultaneous equations: \[ \begin{aligned}x+1 &= 2y^2 \\ x^2 + 1 &= 2z^2 \end{aligned} \]

I think it would be an interesting project to track down how these were eventually proven. If the first published proof didn't employ infinite descent, is there a known infinite descent proof now?

- Looking at proof of Lagrange's theorem on Wikipedia, they use a proof by contradiction with the well-order property of natural numbers. This is just a modern reformulation of infinite descent.
- I've already voice my concerns on how to use infinite descent constructively. However, when I was reading about Euler's proof of Christmas theorem I came across a very interesting thing. So backtracking a little bit, Fermat asked Pascal in Letter 74 (25 Sep. 1654) whether he could show that primes of the form \( 4n+1 \) were sums of two squares and also give an algorithm that produces those two squares. For example, if you know that \( 53 = 4\cdot 13 + 1 \) is prime then can you split it into two squares, \( 53 = 2^2 + 7^2 \), without trial-and-error? So we know Fermat solved Christmas theorem using infinite descent but we have the same issue: how did he use infinite descent to get an algorithm. Edwards (pg. 55) gives an excellent explanation of how Euler's proof by infinite descent can be modified to be constructive. The relevant pages are available on Google Books so I encourage you to check it out if you're interested. So the problem doesn't seem as impossible anymore. Perhaps a infinite descent proof that Pell's equation has a solution can be modified into an algorithm to produces that solution.
- Euler's proof of this used infinite descent. Not much else to say on this that I haven't already.
- I don't know much about this problem and didn't really research it. When I googled “infinite descent and Pell's equation,” I found this related question on StackExchange (SE). According to it, the theorem is proven using unique factorization of \( {\mathbb Z}[\sqrt 2] \); I don't know the proof but I'll take their word for it. The SE question is specifically asking for if there's a known infinite descent proof, just like I am here. There doesn't seem to be one but one user gave several ideas on how one could try to come up with one.
- Don't know anything about this either. According to that same SE user, there is no known infinite descent proof for this either.
- Don't know anything about this. Honestly, I only think its true cause Fermat said so and, by now, I trust his judgement...

I think that's a good place to end this exposition.

It's really amazing how my attitude changed over the course of researching and writing this post. I've been writing this for a week and I think you can sense the change in the writing itself. At the beginning, I was very skeptical of why Fermat had so many things named after him in Number Theory if we only have one proof from him. I appreciate the contribution of someone asking the right questions and making the right conjectures, but is that enough to get the recognition Fermat gets in Number Theory? But by the end of the post, I am writing things like, “I only think its true cause Fermat said so.”

**Bibliography**

*Oeuvres de Fermat.* Web — This a collection with 4 volumes in it. Vol. 1 is where you'll find his margin notes in their original Latin. Vol. 2 contains his letters, these are in a mix of French and Latin. Vol. 3 has the French translations of all the Latin documents in the first volumes. I didn't really use Vol. 4 so I don't know what's in it. The webpage that I've linked to has all these volumes as PDF. The page also has English translations for a few documents. If you'd rather read the books online, then they're also hosted on archive.org.

Heath, Thomas Little. *Diophantus of Alexandria: A Study in the History of Greek Algebra.* Cambridge University Press, 1910. Web — Fermat had a copy of the Latin version of this book in which he wrote all his margin notes. This book is the English translation and it talks about Fermat's work in the introduction, notes, and supplement.

*The Euler Archive.* Web — This is the go to website for anything Euler-related. It is an archive of everything Euler published. If English translations exist, then they'll be here. I reference E387 at some point in the post; E387 is actually his book *Elements of Algebra.* There's an English translation of it available as a free eBook on Google Books.

Weil, André. “Review: MS Mahoney, The Mathematical Career of Pierre de Fermat.” *Bull. Amer. Math. Soc.* 79.6 (1973): 1138-1149. Web

Burton, David. *A History of Mathematics: An Introduction.* McGraw-Hill Companies, 2011.

Edwards, Harold M. *Fermat's Last Theorem: A Genetic Introduction to Algebraic Number Theory.* Vol. 50. Springer-Verlag, 1977.

Definition 1.Given \( \varphi: {\mathbb R}^n \to {\mathbb R}^n \), theJacobian matrix\( J(\varphi) \) is given by \[ J(\varphi)_{i,j} = \partial_j \varphi_i \] Notation: \( \varphi_i \) means the projection of \( \varphi \) to the \( i- \)th coordinate and \( \partial_j \) means the partial derivative with respect to the \( j- \)th variable.

Theorem 1(Change of Variables).Suppose \( R, \tilde R \subset {\mathbb R}^n \) are “nice” regions, \( \varphi : R \to \tilde R \) is a homeomorphism with continuous second partial derivatives, and \( J = J(\varphi) \) is nonsingular on \( R \). Then for all continuous functions \( f: {\mathbb R}^n \to {\mathbb R} \), we have \[ \int_{R'} f\, dV^n = \int_R (f\circ\varphi) |\det J| \, dV^n \] Note: For simplicities sake, I'll use a single integral sign \( \int \) even when dealing with multiple integrals. The superscript on the differentials will denote the dimension for the integral. So \( dV^3 \) for regular triple integrals and \( dS^2 \) for surface integrals.

To prove this, I will use the divergence theorem which relates the integral over regions like \( R, \tilde R \) with the net outward flux of some vector field across their respective boundaries \( \Sigma, \tilde \Sigma \), which are homeomorphic to \( S^{n-1} \). To be clear, the theorem as stated and proven only applies to very special cases: “nice” regions for which the divergence theorem applies; homeomorphism must have continuous second partial derivative (or at least, first partials exist and mixed partials are equal). In all the references to the theorem that I have seen, the regions are simply open in \( {\mathbb R}^n \) and the homeomorphism has continuous first-order partial derivatives. See [Shwartz], [Lax1], and [Lax2] in the references for more complete proofs with very different approaches as well.

Divergence Theorem.Given a simple closed \( (n-1)- \)manifold \( \Sigma \subset {\mathbb R}^{n} \) which encloses a region \( R \) and a \( n \)-dimensional vector field \( \vec F \) defined on \( R \), then \[ \oint_\Sigma \vec F \bullet \vec n \, dS^{n-1} = \int_R \left( \nabla \bullet \vec F\right) \, dV^{n} \]

When \( n=2 \), this is the flux form Green's Theorem. \( n=3 \) is the divergence theorem taught at the end of Calc 3, see this sketch proof. It isn't too hard to see how to generalize that proof to higher dimensions. The real issue is that all proofs of Green's Theorem or Divergence Theorem rely on the fact that “nice” regions (no formal definition I can find) can be decomposed into finitely many *simple regions*.

*Proof of Thm 1, \( n=2 \):* Let \( r: I \to {\mathbb R}^2 \) be a parametrization of \( \Sigma \) the boundary of \( R \). By our hypothesis, \( \tilde r = \varphi \circ r \), is a parametrization of \( \tilde \Sigma \). Reversing the orientation of \( r \) if necessary, suppose that \( \tilde r \) is oriented counter-clockwise. Also define \( f^1 \) as the partial antiderivative of \( f \) w.r.t. the first-variable and \( f^2 \) is defined similarly. Now apply the divergence theorem to \( \tilde R \): \[ \begin{aligned} \int_{\tilde R} f \, dV^2 &= \frac{1}{2} \oint_{\tilde \Sigma} \left< f^1, f^2 \right> \bullet \vec n \, dS^1 \\ &= \frac{1}{2} \int_I \left< f^1, f^2 \right>_{\tilde r} \bullet \left< \tilde r_2', -\tilde r_1'\right>_t \, dt \\ & \qquad \text{apply chain-rule and use matrix notation} \\ &= \frac{1}{2} \int_I \begin{bmatrix}f^1 & f^2 \end{bmatrix}_{\varphi \circ r} \begin{bmatrix} \partial_1 \varphi_2 \cdot r_1' + \partial_2 \varphi_2 \cdot r_2' \\ \partial_1 \varphi_1 \cdot r_1' + \partial_2 \varphi_1 \cdot r_2' \end{bmatrix}_t \, dt; \\ & \qquad \partial_i \varphi_j \text{ is evaluated at } r(t) \text{ in above matrix} \\ &= \frac{1}{2} \int_I \begin{bmatrix}f^1\circ \varphi & f^2\circ \varphi \end{bmatrix}_{r} \begin{bmatrix} \partial_2 \varphi_2 & -\partial_1 \varphi_2 \\ -\partial_2 \varphi_1 & \partial_1 \varphi_1\end{bmatrix}_r \begin{bmatrix} r_2' \\ - r_1' \end{bmatrix}_t \, dt \\ &= \pm \frac{1}{2} \int_\Sigma \left(\begin{bmatrix}f^1\circ \varphi & f^2\circ \varphi \end{bmatrix} \begin{bmatrix} \partial_2 \varphi_2 & -\partial_1 \varphi_2 \\ -\partial_2 \varphi_1 & \partial_1 \varphi_1\end{bmatrix}\right) \bullet \vec n \, dS^1 \\ &= \int_R (f\circ \varphi) | \det J | dV^2 \end{aligned} \]\(~\Box\)

The plus/minus sign accounts for the fact \( \left< r_2', - r_1' \right> \) might be an inward normal vector for \( \Sigma \). We have a minus sign iff the change of orientation of \( r \) was necessary at the beginning of the proof. The change of orientation is necessary iff the \( \det J < 0 \) on \( R \). It is a tedious exercise to verify that the divergence of the last vector field is infact \( = 2(f \circ \varphi) \det J \). This is where the hypothesis on the second partial derivatives is needed. Since the sign on the outside match the sign of the determinant, we simply have its absolute value. I am also using the fact \( J \) is nonsingular; by continuity, this means \( \det J \) never changes signs on \( R \). Finally, notice that the \( 2 \times 2 \) matrix in the last flux integral is the cofactor matrix of the Jacobian!

Definition 2.For any \( n \times n- \)matrix \( M \), thecofactor matrix\( C(M) \) is the matrix whose entries are the signed minors: \[ C(M)_{i,j} = (-1)^{i+j} M_{ij} \] Recall: the minor \( M_{ij} \) is the determinant of the matrix that you get by deleting the \( i- \)th row and \( j- \)th column.

*Proof of Thm 1, \( n=3 \):* Let \( C = C(J) \) be the cofactor matrix of the Jacobian, \( r, \tilde r:U \to {\mathbb R}^3 \) a parametrization of \( \Sigma, \tilde \Sigma \) as before and assume that \( \partial_1 \tilde r \times \partial_2 \tilde r \) is the outward normal vector on \( \tilde \Sigma \). Again define \( f^1, f^2, f^3 \) as partial antiderivatives w.r.t. the corresponding variables. Then by the divergence theorem on \( \tilde R \): \[ \begin{aligned} \int_{\tilde R} f \, dV^3 &= \frac{1}{3} \oint_{\tilde \Sigma} \left< f^1, f^2, f^3 \right> \bullet \vec n \, dS^2 \\ &= \frac{1}{3} \int_U \left< f^1, f^2, f^3 \right>_{\tilde r} \bullet \left( \partial_1 \tilde r \times \partial_2 \tilde r \right)_{(t_1, t_2)} \, dt_1 \, dt_2 \\ &= \frac{1}{3} \int_U \begin{bmatrix}f^1 & f^2 & f^3 \end{bmatrix}_{\varphi \circ r} \left( \partial_1 \tilde r \times \partial_2 \tilde r \right)^\top _{(t_1, t_2)} \, dt_1 \, dt_2 \\ &= \frac{1}{3} \int_U \begin{bmatrix}f^1 & f^2 & f^3 \end{bmatrix}_{\varphi \circ r} \cdot \left. C \right|_r \cdot \left( \partial_1 r \times \partial_2 r \right)^\top _{(t_1, t_2)} \, dt_1 \, dt _2 \\ &= \pm \frac{1}{3} \int_\Sigma \left(\begin{bmatrix}f^1\circ \varphi & f^2\circ \varphi & f^3 \circ \varphi \end{bmatrix}C\right) \bullet \vec n \, dS^2 \\ &= \int_R (f\circ \varphi) | \det J | dV^3 \end{aligned} \] Just as before: the divergence of the last vector field \( = 3 (f\circ \varphi) \det J \) and the plus/minus sign will match the determinant to give an absolute value.\(~\Box\)

So for a normal surface, the normal vector is the cross-product of the two partial derivatives of your parametrization. That is, given a parametrization \( r: U \to {\mathbb R}^{3} \) for a surface \( \Sigma \) and let \( \partial_i r = \left< \partial_i r_1, \, \partial_i r_2, \, \partial_i r_3\right> \), then we use \( \partial_1 r \times \partial_2 r \) or its negative as the outward normal vector. So for dimensions \( n \ge 3 \), I'll need a generalization of the cross-product which takes \( n-1 \) vectors and produces a vector orthogonal to all of them.

Definition 3.If \( n \ge 3 \), then thethe wedge productis a multilinear \( (n-1)- \)ary operation \( \wedge \) on \( {\mathbb R}^{n} \) such that

- \( \wedge(e_{i+1}, e_{i+2} \ldots, e_{i+n-1}) = e_{i} \)
- \( \wedge(v_1, v_2, \ldots, v_{n-1}) = \vec 0 \) if \( v_i = v_j \) for some \( i \neq j \); ~ we say \( \wedge \) is
alternatingNote: \( e_1, \ldots, e_{n} \) are the standard basis elements for \( {\mathbb R}^{n} \) and their subscripts are written \( \mod n. \) This is a special case of the

exterior product/algebrathat I mentioned in the end of last post. It captures the essence of the cross product, dot product, and the determinant. In particular, it allows us to form orthogonal vectors: \[ v_i \bullet \wedge(v_1, \ldots, v_{n-1}) = \vec 0 \quad \forall i \] The operation is essentially the \( n \times n \) determinant of a matrix whose first row is filled with the standard basis for \( {\mathbb R}^n \) and the subsequent rows are the vectors \( v_1, \ldots, v_{n-1} \): just like the cross-product when \( n=3. \)

Disclaimer: I don't know the official definition of the flux integral across a high dimensional manifold. However, my intuition tells me it is something like this:

Definition 4.Suppose \( \Sigma \subset {\mathbb R}^n \) is an \( (n-1)- \)manifold with a parametrization: \( r:U \to {\mathbb R}^n \) and let \( \vec F \) be an n-dimensinal vector field, then the net outward flux across \( \Sigma \) is given by \[ \oint_\Sigma \vec F \bullet \vec n \, dS^{n-1} = \int_U \vec F \bullet \wedge\left(\partial_1 r, \ldots, \partial_{n-1} r\right) \, dt_1 \cdots dt_{n-1} \] where the order of the arguments in \( \wedge( \ldots ) \) can be changed to ensure the product is an outward normal vector.

In this general case, the argument will rely on the wedge relation \[ \left(\wedge\left(\partial_1 \tilde r, \ldots, \partial_{n-1}\tilde r \right)\right)^\top = C \left(\wedge\left(\partial_1 r, \ldots, \partial_{n-1} r\right)\right)^\top \] and the divergence relation: \[ \nabla \bullet \left(\begin{bmatrix}f^1\circ \varphi & \cdots & f^n \circ \varphi \end{bmatrix}C\right) = n (f \circ \varphi) \det J \] Since the orientation of the normal vector on \( \Sigma \) is chosen such that the corresponding normal vector on \( \tilde \Sigma \) is outward, then we have the same interplay between a plus/minus sign on the integral and the sign of the determinant to get the absolute value. *The wedge and divergence relations are the crux of my argument*; surprisingly, the divergence one relies on a similar relation highlighted in the proof by Lax (see [Lax1]). I believe the relations hold and I've sorta sketch proofs but its too tedious and I'm not really worried about their validity.

I'm only worried about what it takes to be a “nice” region. On the bright side, this is a different (hopefully simpler?) problem from the one posed by the naive approach in the last post. The issue in the naive approach was that \( \varphi \) was potentially warping our (rectangular) partition of \( R \) so bad that the approximate parallelotopes in \( R' \) were not good enough to approximate the integral. This second approach is no longer concerned with the local properties of homeomorphism \( \varphi \) but the global properties of the regions themselves. I think this is a good place to end my foray into the change of variables formula. I managed to appreciate the properties of the determinant and exterior algebra that I did not fully understand just a month ago. I also got a basic understanding of what differential forms are and how they trivialize this whole discussion.

**References**

Lax, Peter D. “Change of Variables in Multiple Integrals.” *Amer. Math. Monthly* 106.6 (1998): 497-501. Web.

Lax, Peter D. “Change of Variables in Multiple Integrals II.” *Amer. Math. Monthly* 108.2 (2001): 115-119. Web.

Shwartz, J. “The Formula for Change in Variables in a Multiple Integral.” *Amer. Math. Monthly* 61.2 (1954): 81-85. Web.

Suppose you have a region \( R \) in the \( xy \)-plane and a map \( \varphi \) from the \( xy \)-plane to the \( uv \)-plane which is a homeomorphism when restricted to \( R \); let \( R' =\varphi(R) \). (See image.)

The map \( \varphi \) can be represented as a pair of real-valued functions:

\[ \varphi : \left\{ \begin{aligned} u &= u(x, y) \\ v &= v(x,y) \end{aligned}\right. \]

Let's focus on one rectangle in the partition of \( R \) with lower-left corner \( (a,b) \) and length (height resp.) of \( \Delta x \) (\( \Delta y \) resp.). If the rectangle is small enough, then its image under \( \varphi \) is approximately a parallelogram with vertices at \[ \begin{aligned} \varphi(a,b) &=\left( u(a,b), v(a,b) \right) \\ \varphi(a+\Delta x, b) &= \left( u(a+\Delta x, b), v(a+\Delta x, b) \right) \\ \varphi(a +\Delta x, b + \Delta y) &= \left( u(a+\Delta x, b+\Delta x), v(a+\Delta x, b+\Delta x) \right) \\ \varphi(a, b + \Delta y) &= \left( u(a, b+\Delta x), v(a, b+\Delta x)\right) \end{aligned} \] What is the area of this parallelogram? Using the linear approximations of \( u \) and \( v \), we have: \[ \begin{aligned} u(a+\Delta x, b) - u(a,b) &\approx \left.\frac{\partial u}{\partial x}\right|_{(a,b)}\Delta x \\ v(a+\Delta x, b) - v(a,b) &\approx \left.\frac{\partial v}{\partial x}\right|_{(a,b)}\Delta x \\ u(a, b+\Delta y) - u(a,b) &\approx \left.\frac{\partial u}{\partial y}\right|_{(a,b)}\Delta y \\ v(a, b+\Delta y) - v(a,b) &\approx \left.\frac{\partial v}{\partial y}\right|_{(a,b)}\Delta y \end{aligned} \]

So the “parallelogram” sides can be described by the vectors \[ \left< \frac{\partial u}{\partial x}, \frac{\partial v}{\partial x}\right> \Delta x, \qquad \left< \frac{\partial u}{\partial y}, \frac{\partial v}{\partial y}\right> \Delta y \] and its area is given by the determinant defined by the two vectors: \[ \Delta A_{uv} \approx \pm \begin{vmatrix} \frac{\partial u}{\partial x} & \frac{\partial v}{\partial x} \\ \frac{\partial u}{\partial y} & \frac{\partial v}{\partial y} \end{vmatrix}\Delta x \Delta y \] Thus \( dA_{uv} \approx |J|\, dA_{xy}. \quad \Box \)

The first crucial idea is that the parallelogram is an nice approximation of the image of small enough rectangles. While we can always approximate the image as a parallelogram, the hope is the approximation has a negligible error when we're taking the riemann sum to compute \( \displaystyle \iint_{R'} f(u,v)\, dA_{uv}. \) This is where some hypotheses on \( \varphi \) aside from being a homeomorphism on \( R \) may be needed. This is an analysis (or differential topology?) question that I'm just not ready to tackle yet; hopefully, the Divergence Theorem will enlighten me.

The second key idea is that the determinant computes the area of my parallelogram. To be able to change variables in \( 3D \), I'd need the determinant to compute the volume of a parallelepiped, and similarly for higher dimensions. So to this effect, I need to convince myself that the \( (n \times n) \)-determinant computes the \( n \)-volume of an \( n \)-parallelotope.

Let \( V = {\mathbb R}^n \) be a vector space over \( {\mathbb R} \) and, essentially, I'm going to define a function \( d: \mathrm{End}_{\mathbb R}(V) \to {\mathbb R} \) which takes a linear transformation \( \phi : V \to V \) and assigns a value \( d(\phi) \in {\mathbb R} \). \( d(\phi) \) will compute how much \( \phi \) scales volumes of parallelotopes and it turns out that \( d(\phi) = |\det (\phi)|. \) I will use the rational canonical form (RCF) to establish the connection. We proved RCF in my Abstract Algebra class last year and it is a consequence of the structure theorem of fin. gen. modules over a P.I.D.

For the rest of this post, I'm going to assume without proof that shearing and reflecting (swapping basis) preserves volume. One may take this as part of the definition of volume in a vector space. See the images below for the motivation behind this assumption:

Let \( \{ \hat e_1, \ldots \hat e_n \} \subset V \) be a basis and for nonzero elements \( s_1, \ldots, s_n \in {\mathbb R} \), define \[ vol(s_1 \hat e_1, \ldots, s_n \hat e_n) := |s_1 \cdots s_n| \blacksquare \] I am defining relative volume with respect to (w.r.t.) the basis and \( (s\,\blacksquare) \) means: \( s \) times the volume of the parallelotope whose sides are the basis vectors \( \{ \hat e_1, \ldots, \hat e_n\}. \) Evidently, scaling the sides of a \( n \)-cube should scale its \( n \)-volume by the product of the scalars.

When given \( n \) linearly independent vectors \( \vec v_i \in V \), then let \( S: V \to V \) be a composition of shear maps such that \( S \vec v_i \) is parallel to \( \hat e_i \) for all \( i \). \[ vol( \vec v_1, \ldots, \vec v_n ) := vol( S\vec v_1, \ldots, S \vec v_n) \] Since shearing does not affect volume, \( vol \) is well-defined. Extend \( vol \) to all \( n \)-tuples of vectors by setting \( vol( \vec v_1, \ldots, \vec v_n) = 0 \) if the vectors are linearly dependent.

For any linear transformation \( \phi: V \to V, \) define \( d(\phi) = vol(\phi \hat e_1, \ldots \phi \hat e_n)/\blacksquare \). Note that as \( vol \) is a scalar multiple of \( \blacksquare \), \( d(\phi) \) is a real number.

Theorem 1.\( vol(\phi \vec v_1, \ldots, \phi \vec v_n) = d(\phi) \cdot vol(\vec v_1, \ldots, \vec v_n) \) for all \( n \)-tuples of vectors in \( V \)

Corollary 2.\( d(\phi_1 \, \phi_2) = d(\phi_1) \, d(\phi_2) \), i.e., \( d \) is multiplicative

Corollary 3.\( d(\phi) \) is basis-independent. Equivalently, \( d \) is preserved under conjugation.

*Proof of theorem:* *Case 1:* Since shear maps preserve volume, the statement holds for all compositions of shear maps.

*Case 2:* Suppose \( \phi \) simply scales each basis vector, i.e., \( \phi(\hat e_i) = s_i \hat e_i \), then there exists a pair of compositions of shears \( S_1, S_2 \) such that \( S_1 \, \phi = \phi \, S_2 \), and both \( (S_1 \, \phi) \vec v_i \) and \( S_2 \vec v_i \) are parallel to \( \hat e_i \), for all \( i. \) That such a pair exists is an argument that I'll omit. Then by definition \[ \begin{aligned}vol(\phi \vec v_1, \ldots \phi v_n) &= vol((S_1 \, \phi)\vec v_1, \ldots, (S_1 \, \phi)\vec v_n) \\ &= vol((\phi \, S_2) \vec v_1, \ldots, (\phi \, S_2)\vec v_n ) \\ &= vol(s_1 \, S_2 \vec v_1, \ldots, s_n \, S_2 \vec v_n) & \text{ since } S_2\vec v_i \text{ is parallel to } \hat e_i\\ &= (s_1 \cdots s_n) \, vol(S_2 \vec v_1, \ldots, S_2 \vec v_1 ) \\ &= d(\phi) \cdot vol(\vec v_1, \ldots , \vec v_n) \end{aligned} \]

*Case 3:* In general, let \( S_1, S_2 \) be shears such that \( S_1 \vec v_i \) and \( (S_2 \, \phi) \vec v_i \) are both parallel to \( \hat e_i \) for all \( i. \) Then \( \phi' = S_2 \, \phi \, S_1^{-1} \) scales each basis vector as in case 2 above. Therefore, \( \phi = S_2^{-1} \, \phi' \, S_1 \) and

\[ \begin{aligned}vol(\phi \vec v_1, \ldots \phi v_n) &= vol( (S_2^{-1} \, \phi' \, S_1 ) \vec v_1, \ldots, ( S_2^{-1} \, \phi' \, S_1)\vec v_n) \\ &= d(S_2^{-1}) d(\phi') d(S_1) \cdot vol( \vec v_1, \ldots, \vec v_n ) \\ &=d(\phi) \cdot vol(\vec v_1, \ldots, v_n) \end{aligned} \] with each step justified since the theorem already holds for \( \phi', S_1, \) and \( S_2^{-1}. \)\(~\Box\)

The theorem implies that even though \( vol \) is volume relative to the chosen basis for \( V \), \( d(\phi) \in {\mathbb R} \), a ratio of relative volumes, will not depend on the basis. Thus \( d(\phi) \) is invariant to a change in basis (Cor. 3). In general, given vectors \( \vec v_1, \ldots, \vec v_n \), let \( \phi \) be the linear transformation that maps \( \hat e_i \mapsto \vec v_i \). As \( d(\phi) \) measures how much **any** parallelotope grows under the transformation, \( vol(\vec v_1, \ldots , \vec v_n) = d(\phi) \blacksquare. \) By the rational canonical form (RCF), there is a basis \( \mathcal B = \{ \hat b_1 , \ldots, \hat b_n\} \) for \( V \) such that \( \phi \) has the matrix form: \[ \phi = \begin{pmatrix}C_1&0&\cdots&0\\0&C_2&\cdots&0\\\vdots&\ddots&\ddots&\vdots\\0&\cdots&0&C_m \end{pmatrix} \] and for each \( 1 \le i \le m \), \( C_i \) is a companion matrix with the form:

\[ \begin{aligned} C_i = \begin{pmatrix}0&0&\cdots&-a_{0,i}\\1&0&\cdots&-a_{1,i}\\\vdots&\ddots&\ddots&\vdots\\0&\cdots&1&-a_{k_i, i} \end{pmatrix} &= S_i \begin{pmatrix}0&0&\cdots&1\\1&0&\cdots& 0 \\ \vdots&\ddots&\ddots&\vdots\\0&\cdots&1& 0 \end{pmatrix} \begin{pmatrix}1&0&\cdots&0\\0&1&\cdots& 0 \\ \vdots&\ddots&\ddots&\vdots\\0&0&\cdots& -a_{0,i} \end{pmatrix} \\ &= S_i \, T_i \, C'_i \end{aligned} \] for some (volume-preserving) shear matrix \( S_i \) and cycle of basis \( T_i; \) the latter is a composition of basis swaps. Therefore, the image of the basis \( \{ \hat b_1, \ldots \hat b_n \} \) under \( \phi \) has the same volume as the parallelotope: \( C'_1 \times \ldots \times C'_m \) which has sides \[ \{ \hat b_1, \, \hat b_2, \ldots , \, (-a_{0,1}) \hat b_{k_1+1},\, \hat b_{k_1+2}, \, \ldots, \, (-a_{0,m-1}) \hat b_l, \, \hat b_{l+1},\, \ldots, \, (-a_{0,m}) \hat b_n \} \] The map \( \rho_b \) that simply scales some elements of the basis \( \{ \hat b_1, \ldots, \hat b_n \} \) is conjugate to a map \( \rho_e \) that simply scales some elements of basis \( \{ \hat e_1, \ldots, \hat e_n\} \). The latter has \( d(\rho_e) = \prod |a_{0,i}|. \) By the corollaries, we get \[ d(\phi) = d(\rho_b) = d(\rho_e) = \prod_{i=1}^m |a_{i,0}| = | \det(\phi) | \] Note that I'm defining \( \det(\phi) \) as the alternating sum of (scaled) minors w.r.t. the basis \( \mathcal B. \)

In summary, RCF implies that any linear transformation \( \phi \) is a composition of maps \( S \, \psi \, \rho \) where

- \( \psi \) acts on some basis \( \mathcal B \subset V \) with \( m \) orbits.
- for each orbit, \( \rho \) scales exactly one element by \( -a_{i,0} \)
- \( S \) is a shearing relative to the fixed set of \( (\psi \, \rho \, \psi^{-1}). \)

\( \psi \) and \( S \) preserve volume while \( \rho \) scales volumes by \( d(\phi) = \det(\phi) \). Alternatively, there is a nice way of defining \( \det(\phi) \) independent of a choice of basis using the exterior product; this is almost identical to my definition of \( d(\phi) \) but more rigorous. The connection between the determinant and volumes is self-evident in this definition.

Finally, in the same sense that the determinant of a linear transformation measures how much volumes are scaled globally, the Jacobian determinant at a point measures how much volumes are scaled locally.

]]>Analogies in debates: If they're easier to understand than the subject in question, then they're probably fallacious.

— Jean Pierre M. (@genepeer) July 9, 2011

And bad thing about fallacious analogies is that once they're noticed (they most likely will be), the debate is derailed to their relevancy.

— Jean Pierre M. (@genepeer) July 9, 2011

The problem with analogies is that eventually they breakdown.

— Jean Pierre M. (@genepeer) January 11, 2012

Analogies should be avoided in arguments at all costs for being useless.

— Jean Pierre M. (@genepeer) December 17, 2012

Look at the dates on those tweets! I'm going to go full-on meta and use analogies to explain why I hate analogies. Hopefully, by the end of this post you will appreciate how not quite so ironic this is. Here we go:

Some people ask me why I love math so much; they wonder if math is just about calculating bigger numbers. I used to think the same way too until I realized math is more about logic than anything else I can think of. Math, to me, is all about making assumptions and finding out what the consequences of said assumptions are. The types of assumptions we make range from straight forward, e.g. \( C \) is a circle with radius of \( 1 \), to somewhat abstract, e.g. \( E/F \) is a Galois extension, to very obscure, e.g. \( F(-) \) is a functor. The obvious thread that permeates all areas of math is the use of logic (proofs) to make conclusions (theorems). What I find very interesting and would like to discuss here is how mathematicians critique theorems. Suppose I make the following claim and give a proof for it too:

Claim 1.All unit circles have a perimeter of \( 2 \)

*Proof:* Suppose you have some unit circle: \( (x-a)^2 + (y-b)^2 = 1 \). Let \( \alpha(t) = \left< a + \cos \theta, b + \sin \theta \right> \) be its paremetrization, then the perimeter is given by \[ \begin{aligned} \int_0^{2\pi} |\alpha'(t)| \, dt &= \int_0^{2\pi} 1 \, dt \\ &= 2 \end{aligned} \]\(~\Box\)

If you think my claim is bogus, which it is, then here is what you should NOT do. You shouldn't say:

Mathematicians deal with counterexamples all the time. But if you are giving me a counter-example then it is your job to first prove that your counterexample satisfies my hypotheses. In this particular case, the hypothesis is the shape is a unit circle. Before telling me that your counterexample does not have a perimeter of \( 2 \), tell me why you think it is a unit circle. The advantage of this is it weeds out most baseless counterexamples and we're able to focus on the important ones. Then the first counterexample is immediately seen for what it is, a square. Second counterexample is not a unit circle, since the radius is \( 2 \) not \( 1 \). The third has the equation \( x^{\frac{28}{15}} + y^{\frac{28}{15}} = 1 \) which is not quite the unit circle \( x^2 + y^2 = 1 \) but very close.

So the shapes are not direct counterexamples but you are onto something. To change my mind, you could argue that

the unit circle can be circumscribed in the square and it is odd that there is such a huge difference in perimeter: \( 2 \) vs. \( 8 \).

when you double the radius of a circle, then you double the perimeter. But the larger circle has perimeter \( 4\pi \), not \( 4 \).

the shape is very close to a unit circle and so their perimeters should also be very close. My shape has perimeter a lot more than \( 2 \) and in fact very close to \( 2\pi \).

Another perfectly reasonable response to my claim is critiquing the logic. There are a lot of deep ideas that I actually did not mention explicitly. These are moments where my logic possibly fails.

You said the unit circle is \( (x-a)^2 + (y-b)^2 = 1 \): yes, that comes from the definition of a circle and Pythagoras' theorem. I can explain it or give you a reference to it.

You gave the parametrization for the circle. What is that?That's some trigonometry. I can explain it or give you a reference.

Where does that integral come from?It's the arclength formula. I can explain it or give a reference.

Where does that \( 1 \) in the integral come from?It's the magnitude of a derivative. I can explain it or give a reference...

Where does the final \( 2 \) come from?It's the fundamental theorem of calculus (FTC); it was proven by Newton & Leibniz several hundred years ago. I can... Wait a minute, I made a mistake. By FTC, the final answer should be \( 2\pi \) not \( 2 \). Thanks!

I now fix the claim:

Claim 2.All unit circles have a perimeter of \( 2\pi \)

At this point, you will mull over the proof for a while, read the references I gave, and then hopefully accept my claim as true. Sometimes, you will find a flaw in my proof or come up with a solid counterexample in which case I will retract my claim and attempt to fix it. Overall, this is what a conversation about a new theorem in math might look like.

Now I'm going to try and apply it to how we debate in general. Specifically, how we use analogies to try show that someone's argument is ridiculous. In this case, analogies are acting sort of like counterexamples. Let's say we are talking about cultural appropriation and I make the following claim:

Claim 3.Iggy is an example of cultural appropriation

*Proof:* She is white but she raps in AAVE. I do not think that is okay.\(~\Box\)

A few things to note:

- Whereas in math, we have clear-cut definitions, in real life, such definitions don't exist. I have not defined what cultural appropriation is because the topic is too broad. My “proof” is not a justification of why my claim is true but more of an explanation of what I
*mean*. - Conversations in real life are rarely in the claim-proof format but I think it is good practice to use the format in your interpretation before you start critiquing. What I probably said was, “Iggy is appropriating black culture cause she is white but she raps in AAVE.”
- This is not about whether my claim is justified; it's about the proper way to respond to it with a counterexample.

You might think I'm being ridiculous but here is how you should NOT respond:

“Is Justin Timberlake appropriating as well?”

“Is French Montana appropriating as well?”

“Is Drake appropriating as well?”

So if you think my statement was bogus and you think Justin Timberlake (JT) is a good counterexample to it, then that's fine. But **it is your job to first tell me why he is relevant**. What does JT have to do with Iggy? Until you properly explain why you're bringing him up, there isn't much I can do to respond. First identify the premises in my claim and briefly explain why they apply to him as well. I do not know how JT can be made relevant but at least the second question can be rephrased as:

“French Montana is white and raps in AAVE. Are you not okay with him too?”

This is now a valid analogy: you have pointed out what you think my premises are: 1) Iggy is white. 2) Iggy raps in AAVE. Then you found an example of another artist who satisfies the same premises and for whom I will probably make a different conclusion. Now I'm forced to fix my reasoning:

She has an australian accent when she talks normally but she uses AAVE in all her raps. I am not okay with how she tries so hard to sound like that. On the other hand, French Montana raps the way he speaks and I don't mind it.

“Drake who is from Canada occasionally uses Jamaican Patois. Why is no one accusing him of cultural appropriation?”

Iggy built her whole career on another people's dialect and Drake did not. His occasional use of Jamaican Patois is okay with me. However, I will say people still make fun of him for it.

As you can see, my original claim was not very clear and it should have raised some concerns. It was your use of examples and your explanation of why you think the examples are relevant that helped me clarify my standing. In the process, we were able to focus on what my premises were. This is how you keep a healthy discussion moving forward.

Now back to my meta-analogy. The issue at hand was: **Don't use analogies!** I could have given a logical reason on why I don't like analogies but then such abstractness may not have been properly appreciated. Instead, I gave an example with math theorems and an analogy in the context of a regular discussion; I also took care to point out some of the places where the two examples are actually different. Hopefully, you now understand what my tweets *meant*:

Claim 4.Analogies/Examples are only useful if you first put in the effort to explain why they are relevant.

*Proof:* This post.\(~\Box\)

*If you disagree, show me the flaw(s) in my reasoning and/or give me one or more relevant counterexamples. Don't just say, “Jesus used parables all the time.”*

Definition 1.\( f \) has aperiod-\( m \ge 1 \)point if there exists \( x \in I \) such that \( f^m(x) = x \) but \( f^i(x) \ne x \) for \( 1 \le i < m. \) If \( m=1, \) then we say \( f \) has afixed point.

Suppose \( m < n, \) then a natural question to ask is whether having a period-\( m \) point implies having a period-\( n \) point or vice-versa. In the simple case where \( m=1 \) and \( n=2, \) it is obvious that having a fixed point does not guarantee having a period-\( 2 \) point: Let \( f(x) = x \) then every point is a fixed point. On the other hand, the intermediate value theorem (IVT) gives us the other implication:

IVT.Suppose \( g:[a,b] \to {\mathbb R} \) is continuous and, without loss of generality, \( g(a) \le g(b) \) then for every \( r \in [g(a), g(b)] \) there exists \( c \in [a,b] \) such that \( g(c) = b. \)

The topological form of this theorem states that continuous functions preserve connectedness.

Theorem 2.period-\( 2 \) point \( \Rightarrow \) fixed point.

*Proof:* Let \( a \) be a period-\( 2 \) point and \( b=f(a) \ne a. \) Then \( f(b) = f^2(a)=a \ne b \) and \( f^2(b) = f(a) = b. \) So \( b \) is a period-\( 2 \) point as well. In fact, \( f \) swaps \( a \) and \( b. \) Assume without loss of generality that \( a < b \) and let \( g(x) = x - f(x) \) be defined on \( [a, b] \subset I. \) Then \( g(a) = a - b < 0 \) and \( g(b) = b - a > 0. \) By IVT, there exists \( c \in [a,b] \) such that \( g(c) = 0. \) So \( f(c) = c \) and \( f \) has a fixed point.\(~\Box\)

And if you let \( f(x) = 1 - x, \) then all points except \( x = 0.5 \) are period-\( 2 \) points while \( x = 0.5 \) is the fixed-point guaranteed by the theorem. Evidently, the existence of period-\( 2 \) points does not imply the existence of period-\( m \) points, for \( m > 2. \)

So in a similar vein, one might ask whether the existence of a period-\( 3 \) point implies

- the existence of a fixed point.
- the existence of a period-\( 2 \) point
- the existence of a period-\( m \) point for \( m > 3 \)

Tien-Yien Li and James Yorke published a paper in 1954 that suprisingly answered “yes” to all the implications.

Li-Yorke, '75.period-3 point \( \Rightarrow \) period-\( m, \) for all \( m \ge 1. \)

To be fair, they proved much more than what I just stated but I'll just stick to this for now cause I want to highlight their elegant use of IVT. So something very interesting happens at \( m=3 \) and you might wonder how to go about proving such a general theorem. The proof will still rely on IVT but we'll have to be clever on the type of sets/functions we apply it to. We'll need two lemmas:

Lemma 4.Suppose \( g:[a,b] \to {\mathbb R} \) is continuous and \( [c,d] \subset g([a,b]), \) then there exists \( [a', b'] \subset [a,b] \) such that \( g([a',b']) = [c,d] \)

*Proof:* Since \( g \) is continuous, the pre-image of \( (c,d) \) must be open in \( [a,b], \) hence a disjoint union of open intervals. Let \( (a',b') \subset g^{-1}((c,d)) \) be one of these intervals. By continuity, \( g([a', b' ]) = [ c, d ]. \)\(~\Box\)

Lemma 5.Suppose \( f: I \to I, \) \( I_0 = [a_0,b_0] \subset g(I_0). \) Then there exists closed intervals \( I_n \subset I_0 \) such that \( g(I_n) = I_{n-1}, \) for all positive integers \( n. \) In particular, \( g^n(I_n) = I_0 \) for all positive integers \( n \)

*Proof by induction:* When \( n=1, \) the statement holds by the lemma 4. Assume that \( I_n \subset I_0 \) and \( g(I_n) = I_{n-1} \) for some \( n \ge 1, \) then let \( [c,d] = I_n \subset I_0 \subseteq g(I_0), \) then by lemma 4, there exists \( I_{n+1} = [a_{n+1}, b_{n+1}] \subset I_0 \) such that \( g(I_{n+1}) = I_n. \)\(~\Box\)

*Proof of Li-Yorke:* Suppose \( a \in I \) is a period-\( 3 \) point. Let \( b = f(a) \neq a \) and \( c = f(b)\neq a,b, \) then \( f(c) = f^3(a) = a \) and \( a, b, c \) are all period-\( 3 \) points. Without loss of generality, we have either \( a < b < c \) or \( c < b < a. \) In both cases, by IVT, \( [b, c] \subset f([a,b]) \) and \( [a,c] \subset f([b,c]). \) Again by IVT, \( f \) has a fixed point in \( [b,c]. \)

Fix a positive integer \( m > 1 \) and let \( I_0 = [b,c], \) then, by lemma 5, there exists closed intervals \( I_n \subset [b,c] \) such that \( f(I_n) = I_{n-1}, \) for all \( n \ge 1. \) Since \( I_{m-2} \subset [b,c] \subset f([a,b]), \) by lemma 4, there exists \( [a',b'] \subset [a,b] \) such that \( f([a', b']) = I_{m-2}. \) So \[ \begin{aligned} ~[a', b'] \subset [a,b] &\subset [a,c] \\ &\subset f([b,c]) \\ &= f^{m-1}(I_{m-2}) \\ &= f^{m}([a',b']) \end{aligned} \] By IVT, \( f^m \) has a fixed point \( x_0 \in [a',b']. \) Since \( f^i([a',b']) = f^{i-1}(I_{m-2}) = I_{m-i-1} \subset [b,c] \) is disjoint from \( [a', b'] \) for \( i< m, \) then \( f^i(x_0) \neq x_0 \) for \( i < m. \) Therefore \( x_0 \) is a period-\( m \) point.\(~\Box\)

The crucial step in the proof was recognition that \( f \) acted on the set \( \{ a,b,c \} \) by cyclic rotation. This simplicity is lost when look at period-\( m \) points with \( m > 3. \) So it seems sort of hopeless to use the same argument to say which period-\( n \) points are guaranteed when period-\( m \) points exists, \( m > 3. \) Li & Yorke did not address this question in the paper.

Surprisingly, Oleksandr Sharkovsky, a Ukranian mathematician had answered it completely back in \( 1964 \) but the theorem had gone largely unnoticed outside of Eastern Europe. I will now state the full theorem but save proof for another time:

Definition 2.The positive integers have admit the following ordering: \[ 3 \prec 5 \prec 7 \prec \cdots \prec 2\cdot3 \prec 2\cdot 5 \prec 2 \cdot 7 \prec \cdots \prec 2^2 \cdot 3 \prec 2^2 \cdot 5 \prec \cdots \prec 2^3 \prec 2^2 \prec 2 \prec 1 \] This is known asSharkovsky's ordering.

Sharkovsky, '64.If \( m \prec n \) in Sharkovsky's ordering, then \[ \textrm{period-}m \Rightarrow \textrm{period-}n \] and the converse is true.

To fully apperciate this theorem, let \( \succ \) be a (new) relation on the positive integers such that \( m \succ n \) if (and only if) period-\( m \) implies period-\( n. \) Then we have only proven that:

- \( 2 \succ 1 \)
- \( 1 \nsucc m \) for \( m > 1 \)
- \( 2 \nsucc m \) for \( m > 2 \)
- \( 3 \succ m \) for \( m \ge 1 \)

\( \succ \) is obviously transitive and reflexive, but we have yet to prove the relation is antisymmetric and total. I will however one thing for you to prove:

Exercise 1.\( 5 \nsucc 3 , \) that is, period-\( 5 \) \( \not\Rightarrow \) period-\( 3 \)

*Hint:* Let \( f:I \to I \) be define as \( f(0)=\frac{1}{2},\) \( f(\frac{1}{4})=1, \) \( f(\frac{1}{2}) = \frac{3}{4}, \) \( f(\frac{3}{4})=\frac{1}{5} \), \( f(1)=0 \) and \( f \) is linear in between the points \( 0, \frac{1}{4}, \frac{1}{2}, \frac{3}{4}, 1. \) Prove that \( f \) has no period-\( 3 \) point.

**References**

Li, T. Y. and Yorke, J. A. “Period Three Implies Chaos” *The American Mathematical Monthly* 82.10 (1975): 985-992. Web.

A

monoidis a set with an associative binary operation (like addition) and an identity element (like zero).

We loved the natural numbers! They represent progress! They always moved forward into the future and never turned back! But we are humans and humans are nostalgic beings. We missed the past and the simplicity of \( 0 \). In the midst of these midlife crisis, we as a people found the perfect solution. We invented the reversed passage of time. We invented the negative number that allowed us to go back to the beginning. We did this knowing that it also meant we could go past the beginning and see what came before the beginning.^{4} So the abelian group called **the integers** was born.

A

groupis a monoid where the binary operation can always be reversed (think subtraction). If the order in which the operation is applied (\( a + b \) vs \( b + a \)) doesn't matter, it is known as anabelian group.

The integers were amazing; they knew no limitations. Are you craving to see what the future has in store for you? Do you miss the good ol' days? Would you rather just stay exactly where you are? The integers are just for you! But soon we realized how tedious they could get. No one wanted to take a hundred steps just to get from \( 20 \) to \( 120 \). The world would be a better place if we just took fifty strides that were twice as long. Or twenty leaps that were equivalent to five steps each. Thus we started multiplying numbers and the integers became a ring.^{5}

A

ringis an abelian group with a new binary operation (called multiplication) that satisfies the distributive law with the group operation (called addition).

This new addition, no pun intended, to our skillset came with a caveat: *"A negative number multiplied with another negative number should be a positive number, and asking why is punishable by \( (-10)(-10) \) push-ups. How many push-ups is that? We'll multiply by \( 2 \) for each wrong answer you give!"*^{6} ^{7}

We then started asking weird questions like, "If you were \( 55 \) and there was a happy memory at \( 20 \) that you wanted to revisit after \( 3 \) long strides, then how long should each stride be?" To be honest, these were post-modern rhetorical questions that were accidentally taken seriously.^{8} But the damage was done and we did the *rational* thing, i.e., answered the rhetorical question with fractions and division.^{9} The integers were upgraded from ring to field and renamed as **the rational numbers**.

A

fieldis a commutative ring with a multiplicative identity where each non-zero element has a multiplicative inverse (basically you can divide too).

Now remember the people who asked a rhetorical question that flew right over everyone else's head? They got back to the drawing board and, after a little critical analysis, concluded that "to really show how absurd numbers are, we need to remove all ties to reality and ask really abstract questions." They figured that purely abstract questions will obviously be interpreted as the rhetorical questions they really are. So they asked, "So you think the rational number system is the answer to everything? Then tell us this, what rational number when multiplied with itself gives the number two? What does that even mean?" To their chagrin, the first question was promptly answered, "Yes!" and the second one had them stumped.^{10} No one ever discussed or even thought about the third question.

Since answering questions involving numbers had basically become our civic duty^{11}, we eventually came up with a rather complicated solution.^{12} We imagined the integers as equally-spaced dots on a straight-line and the rational numbers as even smaller dots that seemed to fill in the line but in reality didn't! So to truly make a line, we *simply* filled in all gaps on the line.^{13} So \( \sqrt{2} \) was one such gap which was kinda close to the rational number \( 1.414 \). The new numbers that were used to fill the gap were called **the irrational numbers**.^{14}

And together, all numbers were **real numbers**. We were so proud of our achievement that we expected calling them "real" would put some finality to all the madness. But it was only the beginning. Everything we thought we knew about numbers was about to change forever, and ever, and ever, and ever, and ever, etc.

This is what I imagine the history of real numbers was after reading math textbooks.

#ZeroIsANaturalNumber #WhatAreWholeNumbersAnyway? â†©

If you had 2 things then I gave you 3 things,

*obviously*you end up with 5. â†©"If \( 2 = 1 + 1 \) and \( 3 = 1 + 1 + 1 \) then \( 2 + 3 = 1 + 1 + 1 + 1 + 1 = 5 \)." "But if \(1 = \{ \emptyset \}\) then what is \( 2\)?" "Well..." â†©

Little did we know this act was the first glimpse of our predisposition to solving problems by making up the numbers needed to calm us down. â†©

A commutative ring with identity, to be specific, but who's counting? â†©

The legal system unwittingly invented the

**exponential function**. â†©Seriously though, I can't think of a non-circular argument as to why \( - \times - = + \). I guess you can argue by symmetry. (+)(+) = +, (-)(+) = -, (+)(-) = -. So to keep things fair, (-)(-) = + ... Right?! â†©

The asker really wanted the askee to reflect on the futility of it all: adding, subtracting, multiplying. "Can't you see?!?! No matter what we do we're still stuck in the [number] system!" â†©

see footnote

^{4}â†©For all the wrong reasons. â†©

again, see footnote

^{4}â†©It was the lesser of two evils. Instead of the real line, they could have just invented

**algebraic numbers**which would have answered all rhetorical questions involving multiplication and addition. But these numbers also included imaginary numbers, which were bound to raise more rhetorical questions. â†©How exactly? I'm glad you asked. â†©

In their hastiness, they accidentally invented

**transcendental numbers**like \( \pi \) which were never really the point of the first question. â†©

- I have completed two years of grad school and chosen my area of interest, Geometric Group Theory.
- I attended my first conference (Redbud Topology Conference), which inspired me to start taking my math adventures seriously again.
- I was also a teaching

- I have completed two years of grad school and chosen my area of interest, Geometric Group Theory.
- I attended my first conference (Redbud Topology Conference), which inspired me to start taking my math adventures seriously again.
- I was also a teaching assistant for Calc 3 for three semesters, taught one semester of Finite Math, and currently scheduled to teach (not TA!) an 8-week course of Calc 3 this summer.

I will start blogging again but the purpose for the blog that I laid out (almost exactly) 3 years ago will have to change; this can no longer be a "blog aimed at undergraduates." I want to blog about the things I am currently studying and also my experience/thoughts on math education. This means that I need to

- come up with the best way to describe what I am studying to someone who's unfamiliar with the area while still keeping it interesting.
- reflect on my past two years as a TA & instructor and collect my feelings about how we teach mathematics. It's a lot to do just for a blog but I will not give up on it just yet.

Luckily, I just discovered a bunch math articles on my old computer that can serve as ideas for several blog posts while I figure out the new direction for my blog. These are articles that I loved as an undergrad, so at least they will be "easier" to blog about than geometric group theory.

**Potential future blog posts:**

- Finish up on ordinals
- Sharkovsky's theorem and chaos
- Hyperbolic origami
- Constructive mathematics (philosophy of math)
- Yablo's paradox & non-circular paradoxes (philosophy, logic)
- Jordan curve theorem using nonstandard analysis
- The Kruskal count
- Playing cards with a demon
- Honeycomb conjecture
- Catalan's conjecture, this might be too ambitious...

Here's a toast to being productive again :)

]]>]]>$$\left(1+\frac{1}{3}-\frac{1}{5}-\frac{1}{7}+\frac{1}{9}

$$\left(1+\frac{1}{3}-\frac{1}{5}-\frac{1}{7}+\frac{1}{9}+\frac{1}{11}-\cdots\right)^2 = 1+\frac{1}{9}+\frac{1}{25}+\frac{1}{49} + \cdots $$

I discovered this identity while attempting to calculate \( \int_0^\infty \frac{1}{1+x^4}\,dx \). This evaluated to give me:^{1}

$$\sum_{n=0}^\infty (-1)^n\left(\frac{1}{4n+1} +\frac{1}{4n+3}\right) = \frac{\pi}{\sqrt8}$$

I instantly recognized that

$$\sum_{n=0}^\infty \frac{1}{(2n+1)^2} = \frac{\pi^2}{8}$$

from other proofs of Basel's problem. Thus squaring the former series would produce the latter. On the other hand, if I could prove the identity with a different and preferably simpler approach, then I would have a new and original proof of \( \zeta(2) = \frac{\pi^2}{6} \).

I eventually seeked help from Math Stack Exchange. The only answer received two weeks later was not very satisfactory as it reverted back to integrals and looked very similar to previous solutions to Basel's problem.^{2}

The identity looked so simple that going back to integrals seemed like a step backwards. Surely there should be a proof that relies symbol manipulation? Or a proof that showed the difference between the two series tended to 0 as the number of terms increased?

A comment in on my question linked to a different proof that did something similar. The proof uses a heuristic argument to show that

$$ \sum_{k=-\infty}^{\infty} \dfrac1{(2k+1)^2} = \left(\sum_{k=-\infty}^{\infty} \dfrac{(-1)^k}{(2k+1)} \right)^2 $$

I attempted to prove the identity along the same lines but I never could show that the difference between the two series vanished.

So eight months later, I still have no nice proof for this beautiful identity: my first mathematical misadventure.

^{1. $$ \begin{aligned}
\int_0^\infty \frac{1}{1+x^4}\,dx ~
&= \int_0^1 \frac{1+x^2}{1+x^4}\,dx \\
&= \int_0^1 1+x^2-x^4-x^6+x^8+x^{10} - \cdots \,dx \\
&= 1+\frac{1}{3}-\frac{1}{5}-\frac{1}{7}+\frac{1}{9}+\frac{1}{11}-\cdots\end{aligned}$$ And, $$ \begin{aligned}
\int_0^\infty \frac{1}{1+x^4}\,dx ~
&= \int_0^1 \frac{1+x^2}{1+x^4}\,dx \\
&= \frac{1}{2}\int_0^1 \frac{1}{x^2-\sqrt{2}x+1} + \frac{1}{x^2+\sqrt{2}x+1} \,dx \\
&= \frac{1}{\sqrt{2}} \left[ \arctan(\sqrt{2}-1) + \arctan(\sqrt{2}+1) \right] \\
&= \frac{\pi}{\sqrt{8}}\end{aligned}$$â†©}

^{2. Using similar techniques as previous solutions to Basel's problem meant that the new proof contained nothing original as noted by the author of the post. Secondly, I was originally trying to use a single simple integral \( \displaystyle \int \frac{1}{1+x^4}\,dx.\) That solution eventually introduced the double integral I was trying to avoid \( \displaystyle \int \int \frac{x}{(1+x^2)(1+x^2y^2)} \,dx\,dy.\) â†©}

Definition 1.For any set \({S \subset \mathbb R}\), let \({S^{(\infty + n)} = \left(S^{(\infty)}\right)^{(n)}}\) for \({n\in \mathbb N}\), \[ \begin{aligned} S^{(\infty \cdot 2)} &= S^{(\infty+\infty)} = \left(S^{(\infty)}\right)^{(\infty)} \end{aligned} \]

I could easily define \({S^{(\infty \cdot n)}, ~ S^{(\infty \cdot \infty)}, ~ S^{(n + \infty)}, }\) and even \({S^{(n \cdot \infty)}}\). This suggests a sort of arithmetic with these “numbers”, including \({\infty}\). The arithmetic has some interesting properties; for instance, the operations are not commutative: \({n + \infty = \infty = n \cdot \infty}\) when \({n \in {\mathbb N}}\).

To avoid confusion and the repeated use of “infinite” with quotations, I will proceed by using the symbol \({\omega}\) instead of \({\infty}\). This highlights that: 1) \({\omega}\) is a number that can be treated with arithmetic as any other; 2) \({\omega}\) has a successor, namely \({\omega + 1}\). Let me illustrate further on why the second point is important. Using \({\infty}\) in my first definition was not arbitrary. It was based on my (false) assumption that finite derived sets and the intersection of all finite derived set were the only interesting derived sets. I did not expect to learn anything useful past \({S^{(\infty)}}\) hence my use of the symbol. There is nothing wrong in continuing to use it but I (and Cantor over a century ago) chose to change it in order to separate the two notions.

Now this is where Cantor's genius truly shines. He recognized that the numbers we defined, \({1, 2, 3, \ldots,\omega, \ldots}\), represented something special: they represented order. Even the familiar \({1, 2, 3, \ldots}\) were not acting as natural numbers or signifiers of size (cardinal numbers) but as signifiers of order, specifically **well-orders of countable sets**. Let's look at \({\omega}\) for instance. I can see that it represents the natural order of natural numbers since that's the order I used to define it. But that's not all of it. I could have started with \({S^{(3)} = S'}\) and, in this case, \({\omega}\) would represent the natural order of all integers greater than 2. But aren't these two orderings essentially the same? Lastly, I was never required to use just natural numbers. So \({\omega}\) could represent the well-order of any countable set where every element has a successor and only one element is not a successor of any number. What about \({\omega \cdot 2}\)? That would be the well-order where every element has a successor and only two elements are not successors of any number. Two such well-orders in natural numbers would be: \({1 \prec 3 \prec 5 \prec \cdots \prec 2 \prec 4 \prec \cdots}\) or \({3 \prec 6 \prec 9 \prec \cdots \prec 1 \prec 2 \prec 4 \prec \cdots}\). Lastly, \({\omega^\omega}\) could represent my counterexample set \({S}\) ordered by \({>}\).

All in all, whenever I used a number in \({S^{(n)}}\), what really mattered was the order \({n}\) represents. In retrospect, my initial definition for \({S^{(\infty)}}\) was very naive for there are many more orders available. So what does it mean for orders “to be essentially the same” or “represented by the same number”? The following definition will help:

Definition 2.Two (linearly or partially) ordered sets \({A}\) and \({B}\) are said to beorder isomorphicif there exists a bijection \({f:A \rightarrow B}\) such that for every \({x,y \in A}\), \[ x \prec y \iff f(x) \prec f(y) \] \({f}\) is also known as anorder-preservingfunction.

Note: While this investigation started with a problem in topology, we are now about to momentarily leave that topic and delve into set-theory, naive for now.

Naive Definition.Anordinal numberis the unique mathematical object that represents order-isomorphic well-orderings of sets. Acountable/uncountable ordinalis an ordinal number that represents order-isomorphic well-orderings of countable/uncountable sets, respectively.

This definition is anything but rigorous for current math standards but it is basically what Cantor started with. It allowed him to define objects like “a set of all countable ordinal numbers.” However, since ordinal numbers are still in use in modern mathematics, there must be a rigorous definition that uses axiomatic set theory. I suppose one could define countable ordinal numbers in terms of equivalence classes of well-orderings of \({{\mathbb N}}\) but I don't know if that's enough to account for the existence of an uncountable ordinal. More on this in the next post; for now, the naive approach will do.

Let's return to \({\omega}\) and \({\omega \cdot 2}\) and the well-orders they represent: \[ \begin{aligned} \omega &: \qquad 1 \prec 2 \prec 3 \prec \cdots \\ \omega \cdot 2 &: \qquad 1 \prec 3 \prec 5 \prec \cdots \prec 2 \prec 4 \prec \cdots \end{aligned} \]

It seems like \({\omega \cdot 2}\) contains \({\omega}\) -- whatever “contains” means. Cantor gave the following definition:

Definition 4.A well-order on set \({A}\)is contained inanother well-order on set \({B}\) if there exists an order-preserving injective function \({f:A \rightarrow B}\) but no such function \({g:B \rightarrow A}\). Furthermore, if \({\alpha}\) and \({\beta}\) represented the well-ordered sets \({A}\) and \({B}\), then we say \({\alpha \in \beta}\).

Note: Cantor had already developed the idea that any mathematical object can be a set and \({\in}\) is just a relation on sets. He thus defined the relation \({\in}\) on the newly defined ordinals. Axiomatic set theorist would later take his ideas and build a proper foundation: everything was made a set and axioms dictated how the relation \({\in}\) could be properly defined on sets.

The following is an illustration using finite sets. If I have sets \({A = \{3,7,8\}}\) and \({B = \{a,b\}}\), then clearly the well-order \({a \prec b}\) is contained in \({7 \prec 3 \prec 8}\). Since \({2}\) and \({3}\), as ordinal numbers, represent the two well-orders, respectively, then I can write \({2 \in 3}\). In fact, \({1 \in 3}\) and \({0 \in 3}\), where \({0}\) would represent the well-order on an empty set.

Cantor then proved the important facts: ordinal numbers are well-ordered by \({\in}\); ordinals were uniquely determined by the ordinals contained in them. Thus one could say that the \({3 = \{0,1,2\}}\) and, at the same time, this ordinal represented its own well-ordering: \({0 \in 1 \in 2}\). He went further and define the set \({\omega_1 = \{\text{all countable ordinals}\}}\). This is a well-defined set since countable ordinals represent well-orderings of \({{\mathbb N}}\). Furthermore, since \({\omega_1}\) is well-ordered by \({\in}\) then it must be represented by an ordinal number I'll, incidentally, call \({\omega_1}\).

Definition 5.A nonzero ordinal number is called alimit ordinalif it is not a direct successor of any ordinal number. Otherwise, the ordinal is asuccessor ordinal.

Finally, I will briefly return to the original problem: can we make any set \({S}\) perfect if we derive it enough times? This is now evidently a misguided question since "enough times" assumed magnitude mattered but what we should care about is order. The counterexample at the end of the last post showed that one could derive a set countably infinite times and get different results if the order is changed. Thus \({S^{(\omega)}}\) was a singleton but \({S^{(\omega+1)}}\) was a perfect empty set. So the problem should be rephrased to "**can we always make \({S}\) perfect if we derive it in the right order?**"

These new definitions of derived sets using ordinals will prepare you for the last step:

Definition 6.For any set \({S \subset \mathbb R}\) and ordinal \({\alpha}\), let \({S^{(0)}=S}\), \({S^{(\alpha+1)} = \left(S^{(\alpha)}\right)',}\) and \[ S^{(\alpha)} = \bigcap_{\beta \in \alpha} S^{(\beta)} \quad \text{ if } \alpha \text{ is a limit ordinal} \]

To be continued...

]]>I will assume some familiarity with topology and countability. Here are the important definitions:

Definition 1.Let \(S \subset \mathbb{R}\). A point \(x \in \mathbb R\) is called alimit point of \(S\)if for every interval \((x-\epsilon, x+\epsilon)\), $$ (x-\epsilon, x+\epsilon) \cap S $$ has infinitely many elements.

A point \(x \in S\) is called anisolated point of \(S\)if it is not a limit point of \(S\).

Definition 2.Let \(S\) be a set. \(S\) is said to becountableif there exist an injective function \(f\colon S\rightarrow\mathbb N\). Otherwise the set is calleduncountable.

It follows from this definition that:

- if \(T\) is countable and there is an injective map from \(S \rightarrow T\), then \(S\) is countable.
- if \(S\) is countable, then all its elements can be ordered as: \( s_1\prec s_2 \prec s_3 \prec \cdots \prec s_n \prec \cdots\)
- \(\mathbb Q\), the set of rationals, is countable.
- if \(S\) is uncountable and \(T\) are countable, then \(S\setminus T\) is uncountable.

The second fact relies on the well-ordering principle of natural numbers. Now we can start proving the interesting stuff:

Theorem 3.Any set of isolated points is countable.

*Proof:* Suppose every point of \(S\) is an isolated point of \(S\). Then for every point \(x \in S\), there is an interval \( I_x = \left(x-\frac{\epsilon}{2}, x+\frac{\epsilon}{2}\right)\) such that each of these intervals are pairwise disjoint and $$ (x-\epsilon,x+\epsilon) \cap S = {x} $$
Since \(\mathbb Q\) is countable, then we can order the set of rationals like \(q_1 \prec q_2 \prec q_3 \prec \cdots\). With this new order on rationals, we say \(q_i\) is less than \(q_j\) if \(i < j\). Every interval \(I_x\) contains rational numbers; therefore, we can map each \(x \in S\) to the “least” rational number contained in its interval \(I_x\). This mapping is injective because all \(I_x\) are disjoint. By fact \((1)\) above, \(S\) is countable. \(\Box\)

Note: Noticed how I claimed every interval \(I_x\) contains rational numbers? This is known as the density of rational numbers in \(\mathbb R\). To Cantor, this fact was obvious since he defined the real numbers as the set of equivalence classes of Cauchy sequences (he called them *fundamental series*) on rationals. You can read more by searching the Arithmetization of Real Numbers.

Definition 4.For any set \(S \subset \mathbb R\), the set $$ S' = \{x \in \mathbb R : x \text{ is a limit point in S} \} $$ is known as thederived set of \(S\).

Corollary 5.If \(S \subset \mathbb R\) is uncountable, then \(S'\) is uncountable and \(S\setminus S'\) is countable.

This follows from fact \((4)\) and the theorem. Note that \(S\setminus S'\) is the set of isolated points of \(S\).

Corollary 6.If \(T = S'\) for some set \(S\), then \(T' \subset T\).

To prove this, simply apply the definition of limit points twice. This sets us up for the next definition:

Definition 7.A set \(S \subset \mathbb R\) is calledperfectif \(S' = S\).

Historical note: Nearly every definition I've given thus far is due to ideas by Cantor. Cantor defined want it meant to be countable and famously proved that real numbers were uncountable. While Cantor was not the first to define a limit-point (Weierstrass mentioned it in unpublished lectures), he was the first to study and publish them. He also defined derived and perfect sets. That is why he is considered one of the two fathers of point-set topology. The other father is of course Felix Hausdorff.

So if \(S'\) is uncountable, then \(\left(S'\right)' \subset S'\) is uncountable, so on and so forth. Intuitively, we would expect with “enough” application of derived sets, we'd be left with a perfect set since we are removing countable isolated points with each step.

Definition 8.For any set \(S \subset \mathbb R\), let \(S^{(1)} = S'\) and \(S^{(n+1)} = \left(S^{(n)}\right)'\) for \(n\in \mathbb N\). Lastly, $$ S^{(\infty)} = \bigcap_{n \in \mathbb N} S^{(n)} $$

So is it possible that \(S^{(\infty)}\) to have isolated points? My first naive guess was that the set would be perfect (and possibly empty). I tried to prove this for a whole day, but eventually, I started to doubt myself. So I worked to find a good counterexample.

Ideally, I wanted a set, \(S\), whose derived set is the same set we get if we mapped for every \(x\in S\), \(x \mapsto \frac{x}{2}\). So if \(0 \in S\) and \(S\) is bounded, then \(S^{(\infty)} = {0}\) by the Archimedean property. This led to the following neat fractal.

Theorem 9.Let \[ S = \left\{ 2^{1-i} \sum_{m \in M_i} 2^{-m} : M_i \subset \mathbb N ~and~ |M_i| = i \in \mathbb N \right\} \cup {0} \] Then \(S^{(\infty)} = {0}\).

This is easy to see by looking at the binary representation of the elements in this set. So knowing \(S^{(\infty)}\) is not enough to get rid of all isolated points, where do we go from here? Ordinal numbers.

To be continued...

**References**

Cantor, Georg. *Contributions to the Founding of the Theory of Transfinite Numbers.* Trans. Philip Jourdain. New York: Dover Publications, 1954. Web.

Theorem.\(\displaystyle \qquad \zeta(2) = \sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6} \)

*Proof:* The proof relies on evaluating the integral \(\displaystyle \int_0^\infty \int_0^\infty \frac{1}{(1+y)(1+x^2 y)} \, dA \) in two different ways, first with \(dA = dx\, dy\) and second with \(dA = dy\, dx\). The first is easiest since \(\dfrac{1}{1+y}\) can be treated as a constant and factored out the inner integral to give:

\[ \begin{aligned} \int_0^\infty \int_0^\infty \frac{1}{(1+y)(1+x^2 y)} \, dx \, dy &= \int_0^\infty \frac{1}{1+y} \int_0^\infty \frac{1}{1+x^2 y} \, dx \, dy \\ &= \int_0^\infty \frac{1}{1+y} \int_0^\infty \frac{1}{1+x^2 y} \, dx \, dy \\ &= \int_0^\infty \frac{1}{1+y} \left. \left[\frac{\tan^{-1}\left(x\sqrt{y}\right)}{\sqrt{y}} \right] \right|_0^\infty \, dy \\ &= \int_0^\infty \frac{\pi}{(1+y) \cdot 2\sqrt{y}}\, dy \\ &= \pi \left. \tan^{-1}\left(\sqrt{y}\right)\right|_0^\infty \\ &= \frac{\pi^2}{2} \end{aligned}\]

To evaluate the integral using second method, partial fractions come to the rescue and treat \(x\) as a constant: \[ \frac{1}{(1+y)(1+x^2 y)} = \frac{ ^1/_{1-x^2}}{1+y} + \frac{^{x^2}/_{x^2-1}}{1+x^2 y} \qquad (x \neq 1) \]

Therefore,

\[ \begin{aligned} \int_0^\infty \int_0^\infty \frac{1}{(1+y)(1+x^2 y)} \, dy \, dx &= \int_0^\infty \frac{1}{1-x^2} \int_0^\infty \frac{1}{1+y} - \frac{x^2}{1+x^2y} \, dy \, dx \\ &= \int_0^\infty \frac{1}{1-x^2} \left. \left[ \ln\left| \frac{1+y}{1+x^2 y} \right| \right] \right|_0^\infty \, dx \\ &= \int_0^\infty \frac{-2\ln x}{1-x^2} \, dx \\ &= \int_0^1 \frac{-2\ln x}{1-x^2} \, dx + \int_1^\infty \frac{-2\ln x}{1-x^2} \, dx \\ &= \int_0^1 \frac{-2\ln x}{1-x^2} \, dx + \int_0^1 \frac{-2\ln u}{1-u^2} \, du \quad \text{substitution } u=x^{-1} \\ &= \int_0^1 \frac{-4\ln x}{1-x^2} \, dx \end{aligned} \]

Recalling the first half of the proof, we can now set: $$ \int_0^1 \frac{-\ln x}{1-x^2} \, dx = \frac{\pi^2}{8} $$

Using the geometric expansion \(\displaystyle \frac{1}{1-x^2} = \sum_{n=0}^\infty x^{2n}\) defined for \(x \in (-1,1)\), then:

\[ \begin{aligned} \frac{\pi^2}{8} &= \int_0^1 \frac{-\ln x}{1-x^2} \, dx \\ &= \int_0^1 \sum_{n=0}^\infty -x^{2n} \ln x \, dx \\ &= \sum_{n=0}^\infty \int_0^1-x^{2n} \ln x \, dx \\ &= \sum_{n=0}^\infty \left( \lim_{a\rightarrow 0^+} \left.\left[ \frac{-x^{2n+1} \ln x}{2n+1} \right] \right|_a^1 + \int_0^1\frac{x^{2n}}{2n+1} \, dx \right) \\ &= \sum_{n=0}^\infty \left. \left[ \frac{x^{2n+1}}{(2n+1)^2} \right] \right|_0^1 \\ &= \sum_{n=0}^\infty \frac{1}{(2n+1)^2} \\ &= \zeta(2) - \frac{1}{4}\zeta(2) \\ &= \frac{3}{4}\zeta(2) \\ \zeta(2) &= \frac{\pi^2}{6} \end{aligned} \] \(\Box\)

The rest of the post consists of various important steps and their justifications. This was a great opportunity to see (1) the subtlety involved in infinite processes and (2) the care that needs to be taken when writing proofs.

The very first line of the proof already requires a rather advanced theorem of Calculus. The fact that $$ \iint_R f(x,y) \, dA = \iint_R f(x,y) \, dx \, dy = \iint_R f(x,y) \, dy \, dx $$

should not be taken for granted. It is a result of our \(f(x,y)\) being positive and continuous on the region \(R = [0,\infty) \times [0, \infty)\). My Calculus textbook gives without proof the analogous theorem applied to bounded regions and refers to advanced calculus texts for a proof. A counterexample I saw for the bounded region version:

\[ \int_0^1 \int_0^1 \frac{x^2-y^2}{(x^2 + y^2)^2} \, dx \, dy \neq \int_0^1 \int_0^1 \frac{x^2-y^2}{(x^2 + y^2)^2} \, dy \, dx \] You can evaluate the two integrals to verify for yourself. These integrals fail to behave as I expected because \(f(x,y)\) is not continuous or even defined at the origin. Fortunately, if you let \(0 < b <1\) then: \[ \int_b^1 \int_b^1 \frac{x^2-y^2}{(x^2 + y^2)^2} \, dx \, dy = \int_b^1 \int_b^1 \frac{x^2-y^2}{(x^2 + y^2)^2} \, dy \, dx = 0 \]

In my Probability and Stats class, we change the order of integration however we want and I had assumed this was given by the definition of double integrals. It's actually justified because the probability densities are nonnegative and continuous on the plane.

The second step I will justify is the partial fraction. As you may have noticed, the partial fraction as written requires \(x \neq 1\) but the partial fraction in the integral should be defined for \([0, \infty)\). The proper way to fix this would have been to split the original as I did later on: \[ \int_0^\infty \int_0^\infty f(x,y) \, dy \, dx = \int_0^1 \int_0^\infty f(x,y) \, dy \, dx + \int_1^\infty \int_0^\infty f(x,y) \, dy \, dx \]

But since the partial fraction consisted of two terms, I chose to save space and postpone the split for later. This was mostly an aesthetic decision.

I'll save the next big step for last and deal with a rather minor one: Is it justified that I claimed: \[ \sum_{n=0}^\infty \frac{1}{(2n+1)^2} = \sum_{n=1}^\infty \frac{1}{n^2} - \frac{1}{4} \sum_{n=1}^\infty \frac{1}{n^2} \]

Well it would definitely be true if I knew that the summations on the right-hand side converged. For this, one need only use the integral test. Since \(\int_1^\infty x^{-2} \, dx = 1\) then \(\zeta(2)\) exists. In fact, it would be nice to check this first before starting the proof. Know a limit exists before attempting to find it.

Now the most crucial of steps, if you have not guessed it by now is the simple-looking: \[ \int_0^1 \frac{-\ln x}{1-x^2}\, dx = \sum_{n=0}^\infty \int_0^1-x^{2n} \ln x \, dx \]

I suspected the justification lay somewhere in the fact that series was uniformly convergent and somehow uniform convergence allows one to “extract” summation signs from integrals. In the referenced article, the author claims the step follows from the “monotone convergence theorem”. Thinking he was referring to the elementary version I learnt in class, I struggled for weeks trying to justify it with this in mind. Eventually, I found out that the theorem he mentioned was another advanced theorem involving Lebesgue integrals. Without the drive to try and learn measure theory from scratch I returned to the uniform convergence idea. What follows is my attempt to justify the step without recourse to Lebesgue integrals. I would not be surprise if there is a gap in my proof.

I did a lot of research on what uniform convergence really means; read the theorems and proofs that explain how uniform convergent function sequences/series behave as we'd expect. Eventually, I realized the functions I had did not uniformly converge on \([0,1]\); they were not even defined at the origin. Fixing this problem simply let \(\displaystyle \frac{-\ln x}{1- x^2} = -\ln x + \frac{-x^2 \ln x}{1-x^2}\).

Unfortunately, it is still not uniformly convergent since the series converges to the discontinuous function: $$ \sum_{n=1}^\infty -x^{2n} \ln x = \begin{cases} 0 & x=0\\ {~} \\ \dfrac{-x^2\ln x}{1-x^2} & 0<x<1 \\ {~} \\ 0 & x = 1 \end{cases} $$

However, it is indeed uniformly convergent on any interval \([0,k]\) when \(0<k<1\). So for any such \(k\): $$ \int_0^k \frac{-x^2\ln x}{1-x^2} \, dx = \sum_{n=1}^\infty \int_0^k -x^{2n} \ln x \, dx $$

By the boundedness theorem: $$ \left| \frac{-x^2 \ln x}{1-x^2 } \right| < M \qquad \text{ for some } M $$

And since each term of the series is positive, for all \(m \in \mathbb{N}\) and all \(x \in [0,1]\): $$ 0 \le \sum_{n=1}^m -x^{2n} \ln x \le \frac{-x^2 \ln x}{1-x^2} < M $$

So: $$ 0 \le \sum_{n=1}^m \int_k^1 -x^{2n} \ln x , dx \le \int_k^1 \frac{-x^2\ln x}{1-x^2} , dx < M(1-k) $$

Since we can make the difference between the two integrals arbitrarily small by choosing an appropriate \(k \in (0,1)\), then it must be that: $$ \int_0^1 \frac{-x^2\ln x}{1-x^2} \, dx = \sum_{n=1}^\infty \int_0^1 -x^{2n} \ln x \, dx $$

This concludes the commentary. In my readings on these calculus topics, I've come to learn so much about the limitations of what one can do with Riemann Integrals. I get why there was a need for Lebesgue integrals and hopefully I will find some time to get acquainted with it. Lastly, the two big theorems I used to justify steps in this post are so rich they deserve standalone posts. I can guarantee this is not the last you'll see on uniform convergence on this blog.

**Bibliography**

Ritelli, Daniele. "Another Proof of \(\zeta(2)=\frac{\pi^2}{6}\) Using Double Integrals." * Amer. Math. Monthly* 120.7 (2013): 642-645.