Topic 6: Statistical Concepts

Statistics is quite arguably a chaotically enormous field, as it horizontally passes across all kinds of disciplines. Similarly, data itself has similar attributes and creates synergies with statistics.

This is the reason why we will analyze five concepts of  statistics, which are commonly used for data collection.

This field of statistics is used when the scientist wants to create a description of a dataset, using the mean (or average), the median (“the point that divides the data in half”), the mode, the variance (“measures the spread of a dataset with respect to the mean”), and lastly, the standard deviation (“measures overall spread and is calculated by taking the square root of the variance”) (Radečić, 2020).

A function provides with the probability of occurrence for each possible result of an experiment.

This term refers to the projection of high dimensional data into a space, which is of lower dimension (Radečić, 2020).

The terms “sample” and “sampling” refer to the process of gathering a group of observations collectively and are used interchangeably. Under- or over-sampling may be helpful in classification situations when you require minority and majority classes to be equally represented. An uneven dataset can be corrected by either oversampling the minority class or under-sampling the dominant class. Random over-sampling (or random under-sampling, as an alternative) entails randomly choosing and duplicating observations from the minority class (or randomly choosing and erasing data from the majority class) (Radečić, 2020).

The Bayesian approach allows for flexibility and adaptability according to new data. If the data collected do not project in the best way the observation a scientist is willing to see in the future, this field of statistics allows to incorporate their own knowledge into the calculations, rather than solely relying on the sample. It also allows updating scientist’s thoughts about the future following the input of new data. (Rice, 2018)