Saturday, 31 March 2012

Statistical Interference Using JMP


The first two questions below are a demonstration of the use of JMP to analyse statistical data. These two questions are followed by lab exercise which put the material learned to work.

1. Confidence Interval 

This question demonstrates the concept of a confidence interval. Open the script "confidence.jsl" from the sample scripts folder and run it. This script is a Normal Distribution and it picks 100 samples, each of sample size 20. For each sample, the mean is computed with a  95%  confidence  interval.  Each interval is graphed, in gray if the interval captures the overall mean and in red if it doesn't.



Press Ctrl+D to generate another series of 100 samples. Each time, note the number of times the interval captures the theoretical mean. The ones that don’t capture the mean are due only to chance, since we are randomly drawing the samples. For a 95% confidence interval, we expect that around five will not capture the mean, so seeing a few is not remarkable.

Below are a few samples produced at random:







Change the confidence level by clicking on confidence level below the graph and typing 5. The following is produced as a result:







2. Case Study: The Earth’s Ecliptic

In 1738, the Paris observatory determined with high accuracy that the angle
of the earth’s spin was 23.472 degrees. However, someone suggested that the angle
changes over  time.  Examining  historical  documents  found  five  measurements  dating
from 1460 to 1570. These measurements were somewhat different than the Paris
measurement, and they were done using much less precise methods.
The question is whether the differences in the measurements can be attributed to
the errors in measurement of the earlier observations, or whether the angle of the
earth’s rotation actually changed. We need to test the hypothesis that the earth’s
angle has actually changed.
H0:  Earth’s  angle  has  not changed, the mean of the previous values is not different
from the one calculated in Paris. Hypothesized mean = 23.472.
HA: Earth’s angle has changed and the means are different.

• Open Cassub.jmp
• Analyze > Distribution, set obliquity as Y variable

Test if the mean of these values is different than the value from the Paris
observatory. Our null hypothesis is that the mean is not different.









It can be seen that the mean is 23.499 which is different than the Paris Hypothesis (23.472), thus the hypothesis is incorrect.


LAB EXERCISES

1. The file movies.jmp contains a list of the top grossing movies of all time (as of June 2003). It contains data showing the name of a movie, the amount of money it made in the United States (Domestic) and in foreign markets (in millions of dollars), its year of release, and the type of movie.

Below is a histogram of all of the types of movies which was created by selecting Analyze --> Distribution then dragging types into the "Y, Columns" box:














There are five levels in this variable: Action, Comedy, Drama, Family, Mystery-Suspense. Of those listed above, there are 56 Action movies, 69 Comedy, 77 Drama, 45 Family and 29 Mystery-Suspense.

Below is the histogram of the domestic gross for each movie which is produced in the same way as above:














The range of values for this variable is between 100 and 600.8 dollars and the average domestic gross of these movies is 157.48 as shown on the Quantiles and Moments parts above.

There are outliers (shown on the right side of the box). Look for outliers in the datasheet; they should be extreme values that are not within the expected range. To check if your guess of outliers was right, place the pointer on one of the points in the outlier box and it should show which movie it was.


Create a subset of the data consisting of only drama movie followed by a histogram. To create a subset consisting of only drama movies, double click on drama on the histogram produced for types of movies. This will produce only the drama movies which could be used to create the  histogram shown below:









The average domestic and foreign grosses for the subset are found to be $166.10 and $322.81 respectively.
The plots have a few outliers as seen in the pictures above.


2. Open the file Analgesics file from the sample data. This file contains the results of a study of the effect of three different pain relievers on the amount of pain experienced by the patients. The only classification of patients in the study is by gender. Below is a histogram of the variables gender, drug and pain which is created by dragging all three variables into the y,columns box:


























Click on the histogram bars; it can be seen that females use much more of drug A than of B and C and their levels of pain are between 0-10, with most females having a pain level of around 7.5.

The distribution for females can be seen below:




It can be seen that males consumed almost equal amounts of each drug but experienced much higher levels of pain, with pain levels between 5-17.


The distribution for males can be seen below:



Males consumed almost the same amounts of drugs B & C as females, but much less of drug A. Males also had much higher levels of pain than females, according to the study.

Next, analyze the distribution for the amount of pain caused by each of the drugs A, B and C which can be done by placing drug in the "Y, Columns" box and pain in the "By" box.

Drug A























Drug B






















Drug C




3. Open the file Scores.jmp from  the sample data. This file contains data from a Study in the United States. the study was conducted to get the results of 5000 students on tests of Calculus and Physics. The results were separated into the four regions of the US. Some students took the Calculus test, some the Physics and some both.

A histogram of the results is shown below:
























The mean score on the Calculus test was 452.06227 and on the Physics test was 417.11735

By clicking on the histogram bar, it shows that most students who received high scores on the calculus tests, also received high scores on their Physics tests. 

An example showing that students who received high scores on one test also received high scores on the other:
























Another example showing that students who received low scores on one test also received low scores on the other:
























Moreover, a graph of Physics scores versus Calculus scores is linear:
























Below are the mean values for the Calculus scores for each of the four regions:

Region 1:























Region 2:























Region 3:






















Region 4:






















The mean scores on the Calculus tests for all four regions are almost equal.

Do the same for the Physics tests.

Region 1:




Region 2:



Region 3:


Region 4:























The mean Physics scores in all four regions of the US also seemed to be almost equal.

From an equivalent former test, the mean score of United States Calculus students was 450. This shows that there has been a minor increase in scores since that last test.

Construct a 95% confidence interval for the mean calculus score by clicking on the red triangle to the left of "Calculus Score" --> Confidence Interval --> 0.95 to get the following:






Physics teachers say that the overall United States score on the Physics test should be higher than 420; however, the data does not support that since the mean Physics score is below 420.

Construct a 95% confidence interval for the mean Physics score to get the following:





4. Open hotdogs.jmp from the sample data. The results came from the investigation of taste and nutritional content of hot dogs. Below is the histogram of the results:

















The number of hot dogs of each type is roughly equal.

The $/oz variable in this file represents the cost in dollars per ounce of hot dog. An outlier plot is created which is shown below:



 The two outlier points (Top right) represent "General Kosher Beef" and Wall's Kosher Beef Low Fat".


The caloric content of the three types of hot dogs is shown below, each with a 95% confidence interval.

Type = Beef:

























Type = Meat:
























Type = Poultry:


















On average, hot dogs made with poultry have the lowest caloric content compared with beef and meat.

To test the conjecture that the mean sodium content of all hot dogs is 410 grams, a histogram for the content of sodium by the type of hot dog (meat) was produced. The sodium content was found to be around 418.5 grams.

5. The difference between the z-test and the t-test is that, even though they are both used to statistically compare the mean values of two different groups of data to see if they are similar, the z-test is usually applied to a larger number of samples. The degrees of freedom in a Student's t-test is a parameter in the equation of the test. The value of the degrees of freedom tells us the number of mean values being compared. The z-test and t-test become equal when the value of the degrees of freedom is equal to zero.









Thursday, 15 March 2012

JMP Right In
Part I:

1. JMP Help:
The simplest way to get help while using JMP is by using the Help tab from the menu. The help tab has a list of options from which you can choose as shown below:


Three of the help options offered by JMP are the tutorial, the search and tip of the day. The JMP Beginner's Tutorial is a very useful and helpful tool that guides the user throughout some of the commands of JMP. It is a step-by-step guide and extremely user friendly, try it to learn the basics of JMP, such as how to create tables and produce plots. Another command used very often is the search command which can be used by the user to search for different data or command name. A third example is the tip of the day which is an option that gives the user a different tip about JMP each day.
JMP data table has a spreadsheet-like view and multiple ones can be open at a time. It has data management operations such as sorting, concatenation, updating values and sorting.  "Finally, one of its impressing features is: "Design experiment—creates data tables that contain traditional screening, response surface, mixed level, mixture, Taguchi arrays and full factorial designs as well as custom and augmented designs" - http://www.jmp.com/support/abcguide/data_table_features.shtml

JMP uses JMP Scripting Language (JSL) which allows JMP users to program JMP to repeat analyses, write custom programs to manipulate data in complex ways (including full matrix algebra support), and even extend JMP's graphical and analytical capabilities. - http://www.jmp.com/support/downloads/jmp_scripting_library/index.shtml

The JMP journal is a way to record a data analysis process and its results. It offers a tool that can be useful for preparing presentations. The Journal can capture JMP files, graphs and scripts as well as launch dialogs, reports, and scripts. 


Moreover, the "jmp.com" website is a very helpful resource and offers features of different commands and teaches how they can be used.

JMP is a statistical software that allows the user to explore and interact with data. The JMP software provides users with visual capabilities that allow the user to present his/her data in the most effective way. Thus, JMP allows users to design experiments and analyze the results to further present it in the most visual way possible.

2. JMP has many sample data tables that can be used by the user. We are going to work with one fo these, called "students.jmp"

Windows sample data is usually installed at: C:\ProgramFiles\SAS\JMP7\EnglishSupport Files\Sample Data
Next, open students.jmp from the above location and the following table will show up:




The above table shows 19 samples out of 233 student data. It has six columns for: number, age, sex, height, weight and finally, ID number. There are 233 rows in the table, one for each student.

3. From the menu bar, click on analyze then distribution and the following dialog appears:



Click on weight, then Y, columns and then on age then Y, columns. You could also drag weight and age to the box on the right of Y,columns. Now, weight and age are assigned to be the Y variables/ the columns. Click "ok" and the following histogram shows up:




The histogram displays how many students are of certain weights or certain ages. For example, it can be seen that most students are between 110-120 pounds and of age 12. 

The moments and quantiles show the details of the data quantitatively; they display statistical information such as the mean, median and standard deviation.
JMP provides different data for each variable due to the differences between them. Age is a whole number provided by the students while weight can have a wide range of uncertainty and can vary, day by day.

4. If one of the lines between two sets of data is clicked, for example that between 80 & 90 pounds on the weight histogram, it highlights the corresponding age for these weights. Thus, it shows the percentage of age weighing between 80-90 pounds.

5. Go to help, from the menu bar and select beginner's tutorial and go through it. Once you are done, change the layout of the histogram to horizontal as follows:

6.
From left to right, each one of the above tabs does the following:

(i) The pointer is used for selection; can  be used to select rows, columns, lines between bars of a histogram, etc..
(ii) The question mark is used to access help.
(iii) The crosshair tool is basically a set of axes with the empty space at its centre as the origin. It is used with plots to show the x and y axes when the mouse is clicked.
(iv) This is used to select a graph, by clicking on a corner and dragging throughout the whole graph, and pasting it in a document, such as a word document. 
(v) The hand tool is used to drag graphs in all directions, simply by clicking the mouse and moving it.
(vi) The brush tool is used to select multiple points from a graph.
(vii) The lasso tool is used to select irregular shapes by going around them.
(viii) The magnifying glass is used to zoom in with every click.

The following icons are used in most graphs and here is what each means:



Part II

1. The active areas of the data table are shown below:


2. In this part, we are going to create a table using JMP. The table gives the temperature in Abu Dhabi, New York, Shanghai and Florence during different months of the year. 

To create the table, click on file from the menu bar, then new and finally data table. Next, choose cols from the menu bar then add multiple columns. Set the number of columns to be 5. Type the column names in their corresponding positions in the table. Now, column characteristics are to be defined. Since the "month" column only takes characters, the data type should be changed from the default which is numerical. The other four columns need not be changed because they are in the right form. To set the column characteristics, right click on the column title then select column info. The following window shows up: 


After the columns are adjusted, rows are added by clicking on row then add row. Select the number of rows to be 5. Save the file with the name Temp_NYU.jmp. Enter the numerical data to the table as shown:


3. Now, the data is plotted by selecting graph then chart. Assign months to X and New York, Abu Dhabi, Shanghai and Florence to Y. Click ok to get the following graph:



The graph can be changed to different views such as a line graph by clicking on the red arrow to the left of the word "chart" and selecting Y options then Line Chart. The graph should now look like this:

4. Click on the help command from the menu bar and go through the pie chart tutorial. After completing it, try producing the above graph as a pie chart.

Part III:

In this part, we will learn how to use a formula editor, a tool used to build formulae and calculate values of cells in a column. Formulae can be built using other columns in the table, standard functions and constants.

To calculate the standardized value for a set of numeric variables, xi, use the following formula:

(xi – x)/sx


This is how the formula editor is used:

1. Open Students.jmp from the sample data tables
2. You are going to find the standardized weight. Create a new column by the 
name Std. Weight.
3. Right click on the newly created column and select column info and click on 
column properties

4. Choose formulae from the drop down list, it opens up the formula editor 
window as shown in figure


5. When no formula is highlighted, click on weight in the columns list
6. Click on the minus sign in the keypad
7. With the new entry highlighted (in red) click on weight in the column list
8. With this section still highlighted, click on Statistical in the functions list and 
choose Col Mean from the menu
9. Click anywhere in the white space and select the entire expression.
10.With entire expression selected, click on division symbol in the keypad.
11.Click on weight from the column list and click on Col Std Dev from the 
Statistics function menu. 
12. Click ok to see the Std. Weight column filled up with the standardized 
weights.



Another example generates the Fibonacci Series as follows:

1. File > New > Data Table
2. Rows > Add Rows choose 10 rows
3. Name the first column ‘Fib’. Right click and attach a formula for this column.
4. Choose if from the conditional group of functions
5. Select a<=b from the Comparison function list.
6. Then choose Row from the Row function list
7. Select the second argument of the conditional function and type 2
8. Select the then clause of the if statement and type ‘1’
9. Select the else clause of the if statement.
10.Click on the ‘+’ key on key pad and enter ‘Fib’ as the first term
11. Select Row > Subscript then Row > Row , -, 1.
12. Repeat the second term to add FibRow()-2.
13. Click ‘ok’ and look at the table. 

The Row>Row() Functions takes the selected row number and uses it in the Formula itself. for example, if row() > 2, then Fib (Row()-1)+Fib(Row()-2); it means that the row number is greater than 2. JMP then proceeds to check the row number condition, if it satisfied, then it takes the value of the previous row and the row before that and adds them to give the current row. The subscript function allows the calling of particular values in the specific subscript row give. Thus FibRow()-1 means that the formulae editor is calling the previous row of the Fib column.



In another example, we will use one of the sample data called Pendulum.jmp. This file contains the results of an experiment in a physics class comparing the length of a pendulum to its period. Calculations were made for a range of pendulums from short (2 cm) to long (20 m). We will use the calculator to determine a model to predict the period of a pendulum from its length.

1. Select  Analyze then Fit Y By X and make Period the Y variable and Length the X variable to produce a scatter plot:


2. Create a new column named  Transformed Period that contains a formula to 
take the square root of the Period column. Produce a scatter plot of Transformed 
Period vs. Length. The following graph is produced:
It can be seen form the graph above that it is NOT linear.

3. Try other transformations until the scatter plot looks linear. Try Square, reciprocal and natural log of the period and you will find that only the square gives a linear curve:

4. Find the line of best fit for the linear transformed data by selecting Fit Line from 
the popup menu beside the title of the scatter plot. Here it is:

5. This line is not the fit of the original data, but of the transformed data. Substitute 
a term representing the transformation you did to linearize the data and solve the equation for Period. The equation is found to be: 0.727 + 4.064L

6. Create a new column to use this formula to calculate the theoretical values. Next, 
construct another column to calculate the difference between the observed values 
of the students and the theoretical values.




7. Examine a histogram of these differences to check if there was a trend in the observations of the students. 

The histogram shows that the uncertainties lie in the =/- 0.05 range

8. The relationship between the period and length in physics is given by the formula: T = 2piLg