Introduction to Mathematics and Statistics for Data Science and Machine Learning

Ekta Patel
Apr 22, 2024
3 min read

This blog provides a basic introduction to understanding statistics and mathematics for data science.

As everyone knows, data science and machine learning require an understanding of arithmetic and statistics. Data science and machine learning are fundamentally based on arithmetic, but but the question of how much math or statistics is required often remains.

This blog will assist you in understanding some of those data science ideas.

In data science, there are four branches of mathematics:

1) Statistics (Descriptive and Inferential)

2) Linear Algebra

3) Probability

4) Calculus and Optimization

1) Statistics:

Without statistics and its never-ending applications in a variety of sectors and academic domains, I cannot conceive data science. In essence, we can summarize quantitative data and extract insights using statistical methods. Unless you are an absolute math pro, it is quite difficult to obtain any kind of insight from simply looking at raw numerical data!

Subjects related to Descriptive Statistics:

Mean, Median, Mode
IQR, percentiles
Standard deviation and Variance
Normal Distribution
Z-statistics and T-statistics
Correlation and Linear regression

Subjects related to Inferential Statistics:

Sampling distributions
confidence interval
chi-square test
Advanced regression
ANOVA

2) Linear algebra:

It is a field of mathematics where equation systems are studied. Equations of one, two, or more dimensions may be used. By creating connections between variables, it aids in the solution of numerical data or relationships between two or more variables.

Numerous fields find use for linear algebra, including neural networks, graphs, descriptive statistics, linear regression equations, statistics and matrix calculations, and image data conversions and representations.

The target variables are solved using linear algebra by machine-learning algorithms such as logistic regression and linear regression, using inputs, characteristics, or feature vectors from the data set.

3) Probability:

Whoa! Let’s talk about probability — it’s all around us! We all think in terms of possibilities. For instance, what are the odds that something would occur in a particular event? Are we not?

There are specific probability kinds that we should pay attention to:

independent events probability
dependent events probability
conditional probability

We make an effort to assess different events and their likelihood based on these. Probability density functions, often known as density curves, are graphical representations of likely outcomes that we sometimes need.

The estimation of anticipated value from given variables, the resolution of confusion matrices in classification algorithms, information entropy, evidence of specific qualities in Naive Bayes classification, and even statistics for hypothesis testing are all made possible by an understanding of probability concepts. The list of use cases is far longer than this one.

4) Calculus and Optimization:

A branch of mathematics known as optimization deals with maximizing output given a set of input variables. There are many input variables in every data set. Functions may overestimate or underestimate the output variable during machine learning algorithm training, and they may also have bias in their output prediction within the provided data set. The method optimizes training datasets and repeatedly iterates to improve accuracy in order to estimate output and fit the model to data.

Three components are involved in function optimization:

the input to the function (e.g. x),
the objective function itself (e.g. f())
the output from the function (e.g. cost).

Input (x): The function’s input, such as a potential solution, to be assessed.
Function (f()): The target or objective function that assesses inputs.
Cost: The outcome of minimizing or maximizing a potential solution using the objective function.

This concludes our overview of statistics and machine learning for data science blog posts.