Back to index

Statistics

Standardise

set the mean to zero by subtracting the mean from all points.

Normalize

divide by the standard deviation to get the number of standard deviations.

Standard deviation vs Standard error

SE tells you the error because you are working on a sample not the full distribution. On a sample from a full distribution, get a mean and a standard deviation. If you had multiple samples, would get multiple means and multiple standard deviations. Could calculate the mean of means, and the standard deviation of the standard deviations. Sd of the Sds is the standard error. It tells you the uncertainty in the mean and standard deviation. The odd thing is that from one distribution you can get the error even when you are looking at one sample rather than the full distribution. Use SD when looking at the sample, use SE when looking at the full population. the standard deviation describes the spread of a sample; standard error is a measure of the precision with which the sample statistic approximates the true population value. The STANDARD ERROR is the name given to the precision with which a sample estimate approximates the population value. The standard error is not an estimate of any quantity in the population, but is a measure of the uncertainty of a single sample value as an estimate of the population value.

Spiegelhalter Art of Statistics

Getting things in proportion

Communication important. E.g. positive / negative framing. Relative risks can exaggerate, absolute risks also for clarity. Expected frequencies good for importance. Graphics chosen with care: scales.

Summarizing and communicating numbers

Single sets of numbers

Strip chart or dot diagram shows all data, offset dots vertically with a random jitter Box and whisker: shows medians, interquartiles, and outliers. Histogram chart: counts observations in intervals, shows shape of distribution. Log scale is good to stop extreme outliers dominating visuals. If average far from median, skewed distribution, and average not informative. Wisdom of crowds can get closer to truth than the average person. Range includes extremes so not v uesful. IQR is better, unaffected by extremes. 25th to 75 percentiles. Standard deviation only useful for well behaved symmetric data, because heavily influenced by extreme values.

Pairs of numbers

Show 2 numerics as scatter plots, time series as line graphs

Correlation measures steadily increasing or decreasing relationship between a pair of numbers: Pearson correlation coefficient most common. -1 to +1. Measures how close to straight line. Spearman rank correlation is alternate, uses ranks, allows for value closer to 1 on e.g. curve upwards. Generally x is independent variable, y is dependent. But this is difficult, relationship could be non causal or the other way around.

Cairo rules for display.

Aims

Primary aim is to find factors which explain variation.

Historically tried to be completely objective - but not possible. Good to try and keep opinions to yourself. Good to listen to audience, understand their limitations, don't be too clever with too much detail, know what you want to achieve. Aim to inform and encourage debate. But data doesn't speak for itself, needs a story: presentation matters. Infographics good to tell story.

Populations and Measurement

Surveys

  1. raw data in survey participants
  2. then infer something about true numbers in the sample (could be lying)
  3. then infer something about study population (could have been included in survey, people who take part could not be representative)
  4. infer something about target population

Can distort at all levels. Inductive inference gets from raw data to inference about population.

Going from 2 to 3: random sampling is best. 3 to 4 could be e.g. testing drug on men then give to women.

Normal distribution

expected for phenomena driven by a large number of small influences. Set mean and sd, and get the shape of the normal curve. e.g. birth weights. Other less natural phenomena could have non normal curve, e.g. incomes have a fat tail on rhs. Normal good to summarize large population in 2 numbers. 95% of the population in 1.645 sd either side of the mean. Get to lots of stats easily.

Now more common to have all the data in a population.

What causes what

Correlation does not imply causation.