CompSci Blogs

August 2023 to June 2024

View the Project on GitHub sfremy/csablog

14 April 2024

ML Model Tutorial

by

How to Make a ML Model With API Integration

Part 1: Model Construction

Step 1 - Acquire Data

Grab a dataset related to the property you would like to try to predict for off of Kaggle, Seaborn, or wherever, I don't care. Here, we'll be using the World Happiness Report (2021) to try to predict how happy a fictional country is.

First, load the dataset csv into the notebook:

import pandas as pd

#Use pandas read_csv module to convert csv file contents to pandas dataframes
df_happiness = pd.read_csv('world-happiness-report.csv')
df_loc = pd.read_csv('world-happiness-report-2021.csv')

If you are using Kaggle, most datasets will have an option to download the CSVs you need directly onto your PC.

In this case, the data is split between two CSVs, which we'll have to fix.

Step 2 - Cleaning the Data

In order to feed the dataframe into a model, we'll need to adjust its contents a little. Currently, df_happiness looks like this:

df_happiness
Country name year Life Ladder Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices Generosity Perceptions of corruption Positive affect Negative affect
0 Afghanistan 2008 3.724 7.370 0.451 50.80 0.718 0.168 0.882 0.518 0.258
1 Afghanistan 2009 4.402 7.540 0.552 51.20 0.679 0.190 0.850 0.584 0.237
2 Afghanistan 2010 4.758 7.647 0.539 51.60 0.600 0.121 0.707 0.618 0.275
3 Afghanistan 2011 3.832 7.620 0.521 51.92 0.496 0.162 0.731 0.611 0.267
4 Afghanistan 2012 3.783 7.705 0.521 52.24 0.531 0.236 0.776 0.710 0.268
... ... ... ... ... ... ... ... ... ... ... ...
1944 Zimbabwe 2016 3.735 7.984 0.768 54.40 0.733 -0.095 0.724 0.738 0.209
1945 Zimbabwe 2017 3.638 8.016 0.754 55.00 0.753 -0.098 0.751 0.806 0.224
1946 Zimbabwe 2018 3.616 8.049 0.775 55.60 0.763 -0.068 0.844 0.710 0.212
1947 Zimbabwe 2019 2.694 7.950 0.759 56.20 0.632 -0.064 0.831 0.716 0.235
1948 Zimbabwe 2020 3.160 7.829 0.717 56.80 0.643 -0.009 0.789 0.703 0.346

1949 rows × 11 columns

This is largely fine. However, df_loc contains the final score and the geographical location of the country, which we need.

#Restrict to needed columns
df_final = df_happiness[["Country name", "year", "Log GDP per capita", "Social support","Healthy life expectancy at birth", "Freedom to make life choices"]]

#Create dictionaries matching names to scores and location of all countries
add_scores = dict(zip(df_loc['Country name'], df_loc['Ladder score']))
add_locs = dict(zip(df_loc['Country name'], df_loc['Regional indicator']))

#Match names to add new columns
df_final['score'] = df_final['Country name'].map(add_scores)
df_final['location'] = df_final['Country name'].map(add_locs)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/1964612128.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['score'] = df_final['Country name'].map(add_scores)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/1964612128.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['location'] = df_final['Country name'].map(add_locs)

This isn't ideal syntax, but df_final now looks like this:

df_final
Country name year Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices score location
0 Afghanistan 2008 7.370 0.451 50.80 0.718 2.523 South Asia
1 Afghanistan 2009 7.540 0.552 51.20 0.679 2.523 South Asia
2 Afghanistan 2010 7.647 0.539 51.60 0.600 2.523 South Asia
3 Afghanistan 2011 7.620 0.521 51.92 0.496 2.523 South Asia
4 Afghanistan 2012 7.705 0.521 52.24 0.531 2.523 South Asia
... ... ... ... ... ... ... ... ...
1944 Zimbabwe 2016 7.984 0.768 54.40 0.733 3.145 Sub-Saharan Africa
1945 Zimbabwe 2017 8.016 0.754 55.00 0.753 3.145 Sub-Saharan Africa
1946 Zimbabwe 2018 8.049 0.775 55.60 0.763 3.145 Sub-Saharan Africa
1947 Zimbabwe 2019 7.950 0.759 56.20 0.632 3.145 Sub-Saharan Africa
1948 Zimbabwe 2020 7.829 0.717 56.80 0.643 3.145 Sub-Saharan Africa

1949 rows × 8 columns

The last thing to do is to encode the string data 'location' to integers, as follows:

The Commonwealth of Independent States refers to former subjects of the USSR.

#Dictionary of name - key correspondences
mapping = {'Southeast Asia': 0, 'South Asia': 1, 'Western Europe': 2, 'North America and ANZ': 3, 'East Asia': 4,
           'Middle East and North Africa': 5, 'Central and Eastern Europe': 6, 'Latin America and Caribbean': 7,
           'Commonwealth of Independent States': 8, 'Sub-Saharan Africa': 9}

#Add new encoded location column
df_final['loc_encoded'] = df_final['location'].map(mapping)

#Make sure there are no NaN entries
df_final = df_final.fillna(0)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/3752169450.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['loc_encoded'] = df_final['location'].map(mapping)

And this is our final dataframe:

df_final
Country name year Log GDP per capita Social support Healthy life expectancy at birth Freedom to make life choices score location loc_encoded
0 Afghanistan 2008 7.370 0.451 50.80 0.718 2.523 South Asia 1.0
1 Afghanistan 2009 7.540 0.552 51.20 0.679 2.523 South Asia 1.0
2 Afghanistan 2010 7.647 0.539 51.60 0.600 2.523 South Asia 1.0
3 Afghanistan 2011 7.620 0.521 51.92 0.496 2.523 South Asia 1.0
4 Afghanistan 2012 7.705 0.521 52.24 0.531 2.523 South Asia 1.0
... ... ... ... ... ... ... ... ... ...
1944 Zimbabwe 2016 7.984 0.768 54.40 0.733 3.145 Sub-Saharan Africa 9.0
1945 Zimbabwe 2017 8.016 0.754 55.00 0.753 3.145 Sub-Saharan Africa 9.0
1946 Zimbabwe 2018 8.049 0.775 55.60 0.763 3.145 Sub-Saharan Africa 9.0
1947 Zimbabwe 2019 7.950 0.759 56.20 0.632 3.145 Sub-Saharan Africa 9.0
1948 Zimbabwe 2020 7.829 0.717 56.80 0.643 3.145 Sub-Saharan Africa 9.0

1949 rows × 9 columns

Step 3 - Training the Model

The model we use will need to take the numerical variables in df_final and output a float value of 'score'. scikit-learn includes the LinearRegression model, which can accomplish this.

Before training, the data needs to be split into an x (predictive) and y(responsive) set and a training and testing set.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression

import numpy as np

#Drop unneeded columns from x
x = df_final.drop(['Country name','location','score'],axis=1)

#Take score column as y
y = df_final['score']

#Split into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

#Report accuracy as mean absolute error (in points)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error (MAE):', mae)
Mean Absolute Error (MAE): 0.6854207563152949

And this is how to make a prediction using the model. It will take a pandas dataframe or 2D numpy array, and will output an array containing a single return score.

data = {'year': 2020, 'Log GDP per capita': 7.303, 'Social support': 0.889, 'Healthy life expectancy at birth': 76.49,
        'Freedom to make life choices': 0.884, 'loc_encoded':3.0}

#[data] tells pandas to interpret every dictionary element as a one-element array
model.predict(pd.DataFrame.from_dict([data]))
array([5.33405577])

Step 4 - Saving & Using the Model

Though for such a small dataset the computational time needed to train the model is negligible, most ML tasks will require larger or more complex models that have large numbers of trainable parameters. In those cases, it is unreasonable to train the model every time it is used. Instead, loading the model's predetermined weights is usually more efficient. This is quite simple to do in scikit-learn:

from joblib import dump, load

#Save file using joblib
dump(model, 'happiness_model.joblib') 
['happiness_model.joblib']

To load the model, allowing it to function, use this code:

#model = load('happiness_model.joblib')

Part 2 - API Construction

Step 1 - Define API Endpoint

We will need to import the necessary equipment to use the model, as well as the model file itself, into the API .py file before making the API endpoint itself:

import pandas as pd
import numpy as np

from flask_restful import Api, Resource
from flask import Blueprint, request
from joblib import load

#Load the ML model: replace the file name with whatever yours is
model = load('./api/happiness_model.joblib')

#Initialize Flask API endpoint blueprint
happiness_api = Blueprint('happiness_api', __name__, url_prefix='/api/happiness')
api = Api(happiness_api)

#Use a post request to take data from the frontend as a JSON, then return output value
class happinessAPI:
    class _Predict(Resource):
        def post(self):
            #Get data from frontend
            body = request.get_json()
            
            if body is not None:
                #Convert frontend JSON output to a pandas dataframe
                data = pd.DataFrame([body])
                data = data.rename(columns={"freedom":"Freedom to make life choices",
                                            "lifespan":"Healthy life expectancy at birth",
                                            "money":"Log GDP per capita",
                                            "social":"Social support",
                                            "location":"loc_encoded"})
                
                #Predict and return the happiness score (model.predict returns a 1-element array so we need to take a slice)
                score = model.predict(data)[0]
                return {'score': score}, 200
            else:
                return {'message': 'No data provided'}, 400
    
    #Add endpoint resource for this method
    api.add_resource(_Predict, '/predict')

In main.py, we will need to register the endpoint:

from __init__ import app
from api.happy import happiness_api
from flask import Flask
from flask_cors import CORS

#Enable CORS for everything
app = Flask(__name__)
CORS(app)

app.register_blueprint(happiness_api)

#Allow all CORS headers before requests
@app.before_request
def before_request():
    allowed_origin = request.headers.get('Origin')
    if allowed_origin:
        cors._origins = "*"

As a bonus, this code will get rid of inherent CORS issues, though some browsers (e.g. Chrome) will raise problems with localhost anyway. It's a security risk and you shouldn't do this on any actual web design, but no one cares enough about what we're doing in here to cause security problems.

Step 2 - Design Frontend

I don't care what CSS styling you use, but you'll want to have a set of input boxes or selectors which correspond to the inputs your model takes, an event listner which passes all that stuff to the backend, and a place to put your output statistic. In our case, we need:

In any case, this is my relatively simple frontend data display:

<h2>How Happy is Your Country?</h2>

<div class="container">
  <form id="predictionForm">
    <!--Location select: Note that the options pass encoded values.-->
    <label for="loc">Where Are You?</label>
    <select id="loc" name="loc" required>
      <option value="0">Southeast Asia</option>
      <option value="1">South Asia</option>    
      <option value="2">Western Europe</option>  
      <option value="3">North America</option>
      <option value="4">East Asia & Pacific Islands</option>    
      <option value="5">Middle East & North Africa</option>    
      <option value="6">Central & Eastern Europe</option>
      <option value="7">Latin America & Caribbean</option>
      <option value="8">Former USSR</option>    
      <option value="9">Sub-Saharan Africa</option>    
    </select>
    <!--Per Capita GDP input-->
    <label for="wealth">Per Capita Yearly GDP ($):</label>
    <input type="number" id="wealth" name="wealth" value = "12647" required>
    <!--Social Support input-->
    <label for="soc">Social Support (Scale 0 - 1)</label>
    <input type="number" id="soc" name="soc" min="0" max="1"  value = "0.5" required>
    <!--Life Expectancy input-->
    <label for="life">Life Expectancy, in years</label>
    <input type="number" id="life" name="life" value = "50" required>
    <!--Freedom input-->
    <label for="freedom">Social Freedom (Scale 0 - 1)</label>
    <input type="number" id="freedom" name="freedom" min="0" max="1" value = "0.5" required>
    <!--Year input-->
    <label for="year">What year is it?</label>
    <input type="number" id="year" name="year" value = "2020" required>
    <button type="submit">Predict Happiness</button>
  </form>
  <div id="result"></div>
</div>

As for the JS part of the frontend, we will need an event listner tied to the 'submit' button, a converter to turn 'wealth' into its base 10 logarithm, and to pass everything to backend.

The columns need to be renamed to match those in the training set, but that's an easy transformation to do in backend.

//Adds an event listner for when the submit button is pushed
document.getElementById("predictionForm").addEventListener("submit", function(event) {
    event.preventDefault();
    predictHappiness();
  });

//Get data from the HTML elements
function predictHappiness() {
    //Define formData as a JSON containing all of the elements
    var formData = {
        year: parseInt(document.getElementById("year").value),
        //Take base 10 log of "wealth"
        money: Math.log10(parseFloat(document.getElementById("wealth").value)),
        social: parseFloat(document.getElementById("soc").value),
        lifespan: parseFloat(document.getElementById("life").value),
        freedom: parseFloat(document.getElementById("freedom").value),
        location: parseInt(document.getElementById("loc").value)
    };

//Set API url
const apiUrl = "http://127.0.0.1:8086/api/happiness/predict"

//Send data to backend and report prediction
fetch(apiUrl, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      'Access-Control-Allow-Origin': 'http://127.0.0.1:4100',
      'Access-Control-Allow-Credentials': 'true'
    },
    body: JSON.stringify(formData)
  })
  .then(response => response.json())
  .then(data => {
    // Display prediction result
    document.getElementById("result").innerHTML = "Your Country's Happiness Score: " + (data.score).toFixed(2);
  })
  .catch(error => {
    console.error("Error:", error);
    document.getElementById("surv").innerHTML = "An error occurred. Please try again.";
  });
}

Congratulations! If you followed the instructions correctly and didn't catastrophically mess up somewhere, you should now have a working API-integrated machine learning model!

tags: