![]() |
Previous | Table Of Contents | Next | ![]() |
Least Squares Curve Fitting
Now let's move on to least squares curve fits of experimental data. "Experimental data" refers to any type of data that is subject to imperfect measurement. For example, most experimental data is collected by sensors and all sensors suffer from inaccuracy, nonlinearity, noise, aging, miscalibration, etc. which cause their readings to be somewhat contaminated. If all the readings are on the same side of the true reading we say the sensor is biased. It is usually pretty obvious when a sensor shows bias and there is usually a calibration procedure that can re-center the sensor. But even without bias the sensor readings will still wander or jitter on both sides of the true reading. Averaging can remove unbiased error from experimental data since about as many flaky readings will be on both sides of the true reading. But some types of data cannot be averaged and we must try other techniques to see past the noise in these readings.
As an example of experimental data that cannot be averaged, let's consider the following velocity versus distance measurements made by telescope observation of 8 galaxies:
These measurements were made with the Hubble telescope. Back in 1924 Edwin Hubble first discovered that not all the twinkling lights we see in our nighttime sky are stars. Instead, many (in fact, billions) of these lights are other galaxies just like our own Milky Way galaxy. Hubble (for whom the Hubble telescope was named) then determined that these other galaxies were all moving away from each other, indicating that our universe is expanding, as if it had been blown apart. This is where the "big bang" theory of the origin of the universe comes from. Hubble further learned that the most distant galaxies were moving with the greatest speeds. Hubble hypothesized that speed was linearly proportional to distance. Today, this proportionality factor is known as the Hubble constant. The measurements that Hubble was able to collect in 1929 to gauge the Hubble constant were lousy but the table above shows measurements made in 2001 using the Hubble telescope. These 8 data points are plotted below.
Clearly these 8 data points don't fall on a straight line. But this could be because, even today, it is still very difficult to accurately measure these quantities. Hence the data is probably corrupted by measurement noise. We want to remove the noise and determine the straight line that we believe is buried in these measurements. We can't average the data points because they are not the same measurement (they are supposed to be spread out across a line). But we can fit a straight line to the measurements in a way that best matches the data. This process is called curve fitting and is very easy for humans to do. You just hold a transparent ruler up to the plot and adjust it until it centers on the data. One possible straight line fit is shown below:
Notice that 1 data point is significantly above the fitted line and 3 are significantly below. Yet our eye tells us that this straight line best fits the data. Our goal isn't to find a line that places the same number of data points on either side of the line. Instead we want the line that minimizes the error between the data points and the fitted line. This error is shown by the vertical bars in the plot below:
I claim that there is no other straight line that you can draw that results in a smaller amount of error bars (i.e., that results in less red ink in the previous plot). I feel confident of this claim because it turns out that there is a mathematical procedure that can find that particular straight line that minimizes the error bars seen in the previous plot. This mathematical procedure is known as a least squares curve fit and the blue line is the result of applying this algorithm to the Hubble data.
The "least" portion of the name indicates that we are minimizing the error in the curve fit. The "squares" portion of the name indicates that we are squaring the error bars in order that all the errors become unsigned. That is, the error is the difference between the data point and the fitted line. When the error bar is above the fitted line the error is positive and when the error bar is below the fitted line the error is negative. We want to treat positive and negative errors identically so as to center the fitted line between all the data points. Therefore we square the error terms to remove their signs.
The mathematical equations that describe the slope and intercept of the best fit straight line are:
These equations may look forbidding, but they are easy to compute since the RPN calculator does all the legwork. The Greek letter Sigma in these equations indicates "summation" over all the experimental data. This type of summation is performed by the RPN calculator every time you employ the Sum+ keystroke. The terms needed for eq. [1] are automatically collected in the following registers of the calculator:
I am delaying teaching you about the calculator's registers (also known as "data memory") until we get to the section on data memory. For now, just realize that the Sum+ keystroke employs certain calculator resources which you must clear via the ClearSum (or ClearREG) keystroke before starting a new statistical calculation.
After entering the experimental data using the Sum+ key ()
you are ready to hit the LR key
(
,
LR stands for linear regression)
which will cause the calculator to compute eq. [1].
The slope of the best fit straight line will be placed into
the X stack location and the intercept will be placed into the Y stack location
(just as with the Mean and StdDev keystrokes, these new numbers overwrite the X and Y
stack locations and hence the stack does NOT lift).
So let's try this process on the 8 astronomical observations in the prior table. This time we will be entering both x and y values prior to each Sum+ keystroke and we want to be careful to get the graph's y values into the Y stack location or else we will end up solving for the slope and intercept of a flipped (mirror imaged) graph. Start by clearing the summation registers and then use the following keystroke sequence to enter the first (x,y) data point:
Continue in this manner until all 8 data points have been accumulated. Then click the LR keystroke and observe that the X stack location holds 65.95 . This is the slope of the best fit straight line. The Y stack location holds the y axis intercept value of 70.00 . Based upon this data, the Hubble constant is computed as 65.95. That is, the Hubble constant is simply the slope of a plot of velocity (in km/sec) versus distance (in megaparsecs).
Theoretically, this straight line should go through the origin (i.e., the x = 0, y = 0 point where the axes intersect). The line we obtain from a least squares curve fit of the 8 galaxy measurements obviously doesn't go through the origin. With more data points (more galaxies) we could get an even better estimate. The generally accepted value for Hubble's constant is 72 +/- 8 from which scientists compute the age of the universe as 13 billion years! The +/- 8 term is basically a "truth in advertising" clause that states that scientists believe the Hubble constant must lie somewhere between 64 and 80. Incidentally, the oldest human fossils are only a few million years old so humans have been around for only .02 % of the universe's lifetime.
I should also mention that there are other types of least squares curve fits. For example, if you have a collection of experimental data points that you believe should fall on a parabola there is a way to compute the particular parabola that best fits your data points in a least squares sense. But the RPN calculator can only perform straight line fits and this is why the keystroke is labeled LR for "linear regression".
![]() |
Previous | Table Of Contents | Next | ![]() |