Would you like to get started with machine learning?
I have created a simple and easy tutorial for complete beginners. Together, we will go over the basic steps of training a machine learning model.
While explaining the steps of training a model one by one, I will also give a very basic example of a machine-learning problem as well. So, if you’d like to follow along, you can download this sample data set from this link.
This is just a sample dataset to help you get started with machine learning.
We have 18 values of people of different ages and genders that have their favorite music defined. By using, the features of “age” and “gender” we will try to guess which genre of music is their favorite.
Note: 1 and 0 are assigned to genders as female and male in this dataset.
However, if you don’t want to follow the example, it is also perfectly fine. I will be explaining all of these steps in detail. So, let’s dive in!
First Things to Know
Before going into the steps of training a model, let’s clarify some points. Machine learning is an artificial intelligence discipline that focuses on developing algorithms that can learn from data.
To do this, machine learning models are trained on a dataset that teaches the model how to make correct predictions or classifications on fresh, previously unknown data.
So, what are these models? A machine learning model is similar to a recipe that a computer uses to generate data predictions or choices.
A model, like a recipe, follows a set of instructions to evaluate data and generate predictions or judgments based on patterns found in the data. The more data the model is trained on, the more accurate its predictions become.
What Kind of Models We Can Train?
Let’s see what are the basic machine learning models.
- Linear Regression: a model that predicts a continuous target variable from one or more input variables.
- Neural Networks: a network of linked nodes that can learn to detect complicated patterns in data.
- Decision Trees: a decision-making approach built on a chain of branching if-else statements.
- Clustering: a set of models that group comparable data points based on similarity.
- Logistic Regression: a model for binary classification problems in which the target variable has two potential values.
- Decision Trees: a decision-making approach built on a chain of branching if-else statements.
- Random Forest: an ensemble model composed of numerous decision trees. They are frequently used for classification and regression applications.
- K-Nearest Neighbors: a model that predicts the target variable using the k-nearest data points in the training set.
Depending on our problem and dataset, we decide which machine learning model fits our situation most. Yet, we will come back to this later. Now, let’s start training our model. I hope you have already downloaded the dataset if you would like to follow our example.
Also, I recommend having Jupyter Notebook installed on your local machine and using it for your machine learning projects.
1: Define the problem
The first stage in training a machine learning model is defining the issue to be solved. This entails selecting the variables that you wish to forecast (known as the target variable) and the variables that will be used to generate those predictions (known as features or predictors).
You should also decide what sort of machine-learning problem you are attempting to address (classification, regression, clustering, and so on) and what type of data you will need to gather or get to train your model.
The sort of model you employ will be determined by the type of machine learning problem you are aiming to solve. Classification, regression, and clustering are the three primary categories of machine learning challenges. When you want to predict a categorical variable, such as whether an email is a spam or not, you use classification.
When you wish to forecast a continuous variable, like the price of a house, you utilize regression. Clustering is used to put together comparable data items based on their commonalities.
If we look at our example; our challenge is to determine a person’s preferred musical style from their gender and age. We’ll utilize a dataset of 18 people for this example and information on their age, gender, and favorite musical style.
2. Prepare the data
After you’ve specified the problem, you’ll need to prepare the data for training the model. This entails cleaning and processing the data. So, that we can ensure that it is in a format that the machine learning algorithm can use.
This might include activities like deleting missing values, transforming categorical data to numerical data, and scaling or normalizing the data to ensure all characteristics are on the same scale.
For example, this is how you delete missing values:
import pandas as pd
# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')
# Check for missing values
print(data.isnull().sum())
# Drop rows with missing values
data.dropna(inplace=True)
# Check that all missing values have been removed
print(data.isnull().sum())
Little note: In the line o “import pandas as pd",
we import the Pandas library and assign it the alias “pd” to make it easier to reference its functions and objects later in the code.
Pandas is a well-known module for Python for data manipulation and analysis, particularly when working with structured or tabular data.
In our example of determining music genres. We’ll first import the dataset. I have named it music.csv, however, you can name it however you want.
To prepare the data for training a machine learning model, we split it into attributes (age and gender) and objectives (music genre).
We’ll additionally divide the data into 80:20 training and testing sets to assess the performance of our model and avoid overfitting.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
# Load data from CSV file/code>
music_data = pd.read_csv('music.csv')
# Split data into features and target
X = music_data.drop(columns=['genre'])
y = music_data['genre']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
3. Choose a machine learning model.
After you have prepared the data, you must choose a machine-learning model that is suited to your task.
There are several algorithms to pick from, such as decision trees, logistic regression, support vector machines, neural networks, and others. The algorithm you choose will be determined by the sort of issue you are attempting to answer, the type of data you have, and your performance needs.
We’ll use a decision tree classifier for this example because we’re working with a classification problem (predicting categorical data).
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier
Here is a visualization of how the Decision Tree Classifier works:
4. Train the model
You can begin training the model when you’ve chosen an acceptable machine-learning algorithm. This entails utilizing the previously generated data to educate the algorithm on how to make predictions on fresh, previously unseen data.
The algorithm will modify its internal parameters during training to minimize the difference between its predicted values and the actual values in the training data. The quantity of data utilized for training, as well as the algorithm’s specific parameters, can all have an effect on the accuracy of the resultant model.
In our specific example, now that we’ve decided on a method, we can train our model with the training data.
# Train the decision tree classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
5. Evaluate the model
After the model has been trained, it must be evaluated on new data to ensure that it is accurate and dependable. This entails testing the model with data that was not utilized during training and comparing its projected values to the actual values in the test data.
This review can assist in identifying any model flaws, such as overfitting or underfitting, and can lead to any fine-tuning that may be required.
Using the testing data, we will assess the correctness of our model.
# Import necessary libraries
from sklearn.metrics import accuracy_score
# Predict the music genre for the test data
predictions = model.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: ", accuracy)
The accuracy score is not so bad for now. 🙂 To improve your accuracy score, you can always clean the data more or try different machine-learning models to see which one gives the highest score.
6. Fine-tune the model
If the model’s efficiency is not sufficient enough, you can fine-tune it by altering various algorithm parameters or by experimenting with new algorithms entirely.
This procedure may include experimenting with alternative learning rates, modifying regularization settings, or altering the number or size of hidden layers in a neural network.
7. Use the model
Once you’re pleased with the model’s performance, you can start using it to generate predictions on new data.
This might entail feeding fresh data into the model and utilizing the model’s learned parameters to generate predictions on that data, or integrating the model into a broader application or system.
We can use our model to generate predictions on new data after we’re pleased with its accuracy. You can try different values of gender and age.
# Test the model with new data
new_data = [[25, 1], [30, 0]]
predictions = model.predict(new_data)
print("Predictions: ", predictions)
Wrap Up
We have finished training our first machine learning model.
I hope you have found it useful. You can now try using different machine learning models like Linear Regression or Random Forest.
There are many datasets and challenges in Kaggle if you’d like to improve your coding and understanding of machine learning.
Leave a Reply