August 2023 to June 2024
View the Project on GitHub sfremy/csablog
by
Grab a dataset related to the property you would like to try to predict for off of Kaggle, Seaborn, or wherever, I don't care. Here, we'll be using the World Happiness Report (2021) to try to predict how happy a fictional country is.
First, load the dataset csv into the notebook:
import pandas as pd
#Use pandas read_csv module to convert csv file contents to pandas dataframes
df_happiness = pd.read_csv('world-happiness-report.csv')
df_loc = pd.read_csv('world-happiness-report-2021.csv')
If you are using Kaggle, most datasets will have an option to download the CSVs you need directly onto your PC.
In this case, the data is split between two CSVs, which we'll have to fix.
In order to feed the dataframe into a model, we'll need to adjust its contents a little. Currently, df_happiness looks like this:
df_happiness
Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.80 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 |
1 | Afghanistan | 2009 | 4.402 | 7.540 | 0.552 | 51.20 | 0.679 | 0.190 | 0.850 | 0.584 | 0.237 |
2 | Afghanistan | 2010 | 4.758 | 7.647 | 0.539 | 51.60 | 0.600 | 0.121 | 0.707 | 0.618 | 0.275 |
3 | Afghanistan | 2011 | 3.832 | 7.620 | 0.521 | 51.92 | 0.496 | 0.162 | 0.731 | 0.611 | 0.267 |
4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1944 | Zimbabwe | 2016 | 3.735 | 7.984 | 0.768 | 54.40 | 0.733 | -0.095 | 0.724 | 0.738 | 0.209 |
1945 | Zimbabwe | 2017 | 3.638 | 8.016 | 0.754 | 55.00 | 0.753 | -0.098 | 0.751 | 0.806 | 0.224 |
1946 | Zimbabwe | 2018 | 3.616 | 8.049 | 0.775 | 55.60 | 0.763 | -0.068 | 0.844 | 0.710 | 0.212 |
1947 | Zimbabwe | 2019 | 2.694 | 7.950 | 0.759 | 56.20 | 0.632 | -0.064 | 0.831 | 0.716 | 0.235 |
1948 | Zimbabwe | 2020 | 3.160 | 7.829 | 0.717 | 56.80 | 0.643 | -0.009 | 0.789 | 0.703 | 0.346 |
1949 rows × 11 columns
This is largely fine. However, df_loc contains the final score and the geographical location of the country, which we need.
#Restrict to needed columns
df_final = df_happiness[["Country name", "year", "Log GDP per capita", "Social support","Healthy life expectancy at birth", "Freedom to make life choices"]]
#Create dictionaries matching names to scores and location of all countries
add_scores = dict(zip(df_loc['Country name'], df_loc['Ladder score']))
add_locs = dict(zip(df_loc['Country name'], df_loc['Regional indicator']))
#Match names to add new columns
df_final['score'] = df_final['Country name'].map(add_scores)
df_final['location'] = df_final['Country name'].map(add_locs)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/1964612128.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_final['score'] = df_final['Country name'].map(add_scores)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/1964612128.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_final['location'] = df_final['Country name'].map(add_locs)
This isn't ideal syntax, but df_final now looks like this:
df_final
Country name | year | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | score | location | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2008 | 7.370 | 0.451 | 50.80 | 0.718 | 2.523 | South Asia |
1 | Afghanistan | 2009 | 7.540 | 0.552 | 51.20 | 0.679 | 2.523 | South Asia |
2 | Afghanistan | 2010 | 7.647 | 0.539 | 51.60 | 0.600 | 2.523 | South Asia |
3 | Afghanistan | 2011 | 7.620 | 0.521 | 51.92 | 0.496 | 2.523 | South Asia |
4 | Afghanistan | 2012 | 7.705 | 0.521 | 52.24 | 0.531 | 2.523 | South Asia |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1944 | Zimbabwe | 2016 | 7.984 | 0.768 | 54.40 | 0.733 | 3.145 | Sub-Saharan Africa |
1945 | Zimbabwe | 2017 | 8.016 | 0.754 | 55.00 | 0.753 | 3.145 | Sub-Saharan Africa |
1946 | Zimbabwe | 2018 | 8.049 | 0.775 | 55.60 | 0.763 | 3.145 | Sub-Saharan Africa |
1947 | Zimbabwe | 2019 | 7.950 | 0.759 | 56.20 | 0.632 | 3.145 | Sub-Saharan Africa |
1948 | Zimbabwe | 2020 | 7.829 | 0.717 | 56.80 | 0.643 | 3.145 | Sub-Saharan Africa |
1949 rows × 8 columns
The last thing to do is to encode the string data 'location' to integers, as follows:
The Commonwealth of Independent States refers to former subjects of the USSR.
#Dictionary of name - key correspondences
mapping = {'Southeast Asia': 0, 'South Asia': 1, 'Western Europe': 2, 'North America and ANZ': 3, 'East Asia': 4,
'Middle East and North Africa': 5, 'Central and Eastern Europe': 6, 'Latin America and Caribbean': 7,
'Commonwealth of Independent States': 8, 'Sub-Saharan Africa': 9}
#Add new encoded location column
df_final['loc_encoded'] = df_final['location'].map(mapping)
#Make sure there are no NaN entries
df_final = df_final.fillna(0)
/var/folders/mh/dn_3qv0s6dndm54j9k1xw61h0000gn/T/ipykernel_80739/3752169450.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_final['loc_encoded'] = df_final['location'].map(mapping)
And this is our final dataframe:
df_final
Country name | year | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | score | location | loc_encoded | |
---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2008 | 7.370 | 0.451 | 50.80 | 0.718 | 2.523 | South Asia | 1.0 |
1 | Afghanistan | 2009 | 7.540 | 0.552 | 51.20 | 0.679 | 2.523 | South Asia | 1.0 |
2 | Afghanistan | 2010 | 7.647 | 0.539 | 51.60 | 0.600 | 2.523 | South Asia | 1.0 |
3 | Afghanistan | 2011 | 7.620 | 0.521 | 51.92 | 0.496 | 2.523 | South Asia | 1.0 |
4 | Afghanistan | 2012 | 7.705 | 0.521 | 52.24 | 0.531 | 2.523 | South Asia | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1944 | Zimbabwe | 2016 | 7.984 | 0.768 | 54.40 | 0.733 | 3.145 | Sub-Saharan Africa | 9.0 |
1945 | Zimbabwe | 2017 | 8.016 | 0.754 | 55.00 | 0.753 | 3.145 | Sub-Saharan Africa | 9.0 |
1946 | Zimbabwe | 2018 | 8.049 | 0.775 | 55.60 | 0.763 | 3.145 | Sub-Saharan Africa | 9.0 |
1947 | Zimbabwe | 2019 | 7.950 | 0.759 | 56.20 | 0.632 | 3.145 | Sub-Saharan Africa | 9.0 |
1948 | Zimbabwe | 2020 | 7.829 | 0.717 | 56.80 | 0.643 | 3.145 | Sub-Saharan Africa | 9.0 |
1949 rows × 9 columns
The model we use will need to take the numerical variables in df_final and output a float value of 'score'. scikit-learn includes the LinearRegression model, which can accomplish this.
Before training, the data needs to be split into an x (predictive) and y(responsive) set and a training and testing set.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
import numpy as np
#Drop unneeded columns from x
x = df_final.drop(['Country name','location','score'],axis=1)
#Take score column as y
y = df_final['score']
#Split into testing and training datasets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
#Report accuracy as mean absolute error (in points)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error (MAE):', mae)
Mean Absolute Error (MAE): 0.6854207563152949
And this is how to make a prediction using the model. It will take a pandas dataframe or 2D numpy array, and will output an array containing a single return score.
data = {'year': 2020, 'Log GDP per capita': 7.303, 'Social support': 0.889, 'Healthy life expectancy at birth': 76.49,
'Freedom to make life choices': 0.884, 'loc_encoded':3.0}
#[data] tells pandas to interpret every dictionary element as a one-element array
model.predict(pd.DataFrame.from_dict([data]))
array([5.33405577])
Though for such a small dataset the computational time needed to train the model is negligible, most ML tasks will require larger or more complex models that have large numbers of trainable parameters. In those cases, it is unreasonable to train the model every time it is used. Instead, loading the model's predetermined weights is usually more efficient. This is quite simple to do in scikit-learn:
from joblib import dump, load
#Save file using joblib
dump(model, 'happiness_model.joblib')
['happiness_model.joblib']
To load the model, allowing it to function, use this code:
#model = load('happiness_model.joblib')
We will need to import the necessary equipment to use the model, as well as the model file itself, into the API .py file before making the API endpoint itself:
import pandas as pd
import numpy as np
from flask_restful import Api, Resource
from flask import Blueprint, request
from joblib import load
#Load the ML model: replace the file name with whatever yours is
model = load('./api/happiness_model.joblib')
#Initialize Flask API endpoint blueprint
happiness_api = Blueprint('happiness_api', __name__, url_prefix='/api/happiness')
api = Api(happiness_api)
#Use a post request to take data from the frontend as a JSON, then return output value
class happinessAPI:
class _Predict(Resource):
def post(self):
#Get data from frontend
body = request.get_json()
if body is not None:
#Convert frontend JSON output to a pandas dataframe
data = pd.DataFrame([body])
data = data.rename(columns={"freedom":"Freedom to make life choices",
"lifespan":"Healthy life expectancy at birth",
"money":"Log GDP per capita",
"social":"Social support",
"location":"loc_encoded"})
#Predict and return the happiness score (model.predict returns a 1-element array so we need to take a slice)
score = model.predict(data)[0]
return {'score': score}, 200
else:
return {'message': 'No data provided'}, 400
#Add endpoint resource for this method
api.add_resource(_Predict, '/predict')
In main.py, we will need to register the endpoint:
from __init__ import app
from api.happy import happiness_api
from flask import Flask
from flask_cors import CORS
#Enable CORS for everything
app = Flask(__name__)
CORS(app)
app.register_blueprint(happiness_api)
#Allow all CORS headers before requests
@app.before_request
def before_request():
allowed_origin = request.headers.get('Origin')
if allowed_origin:
cors._origins = "*"
As a bonus, this code will get rid of inherent CORS issues, though some browsers (e.g. Chrome) will raise problems with localhost anyway. It's a security risk and you shouldn't do this on any actual web design, but no one cares enough about what we're doing in here to cause security problems.
I don't care what CSS styling you use, but you'll want to have a set of input boxes or selectors which correspond to the inputs your model takes, an event listner which passes all that stuff to the backend, and a place to put your output statistic. In our case, we need:
In any case, this is my relatively simple frontend data display:
<h2>How Happy is Your Country?</h2>
<div class="container">
<form id="predictionForm">
<!--Location select: Note that the options pass encoded values.-->
<label for="loc">Where Are You?</label>
<select id="loc" name="loc" required>
<option value="0">Southeast Asia</option>
<option value="1">South Asia</option>
<option value="2">Western Europe</option>
<option value="3">North America</option>
<option value="4">East Asia & Pacific Islands</option>
<option value="5">Middle East & North Africa</option>
<option value="6">Central & Eastern Europe</option>
<option value="7">Latin America & Caribbean</option>
<option value="8">Former USSR</option>
<option value="9">Sub-Saharan Africa</option>
</select>
<!--Per Capita GDP input-->
<label for="wealth">Per Capita Yearly GDP ($):</label>
<input type="number" id="wealth" name="wealth" value = "12647" required>
<!--Social Support input-->
<label for="soc">Social Support (Scale 0 - 1)</label>
<input type="number" id="soc" name="soc" min="0" max="1" value = "0.5" required>
<!--Life Expectancy input-->
<label for="life">Life Expectancy, in years</label>
<input type="number" id="life" name="life" value = "50" required>
<!--Freedom input-->
<label for="freedom">Social Freedom (Scale 0 - 1)</label>
<input type="number" id="freedom" name="freedom" min="0" max="1" value = "0.5" required>
<!--Year input-->
<label for="year">What year is it?</label>
<input type="number" id="year" name="year" value = "2020" required>
<button type="submit">Predict Happiness</button>
</form>
<div id="result"></div>
</div>
As for the JS part of the frontend, we will need an event listner tied to the 'submit' button, a converter to turn 'wealth' into its base 10 logarithm, and to pass everything to backend.
The columns need to be renamed to match those in the training set, but that's an easy transformation to do in backend.
//Adds an event listner for when the submit button is pushed
document.getElementById("predictionForm").addEventListener("submit", function(event) {
event.preventDefault();
predictHappiness();
});
//Get data from the HTML elements
function predictHappiness() {
//Define formData as a JSON containing all of the elements
var formData = {
year: parseInt(document.getElementById("year").value),
//Take base 10 log of "wealth"
money: Math.log10(parseFloat(document.getElementById("wealth").value)),
social: parseFloat(document.getElementById("soc").value),
lifespan: parseFloat(document.getElementById("life").value),
freedom: parseFloat(document.getElementById("freedom").value),
location: parseInt(document.getElementById("loc").value)
};
//Set API url
const apiUrl = "http://127.0.0.1:8086/api/happiness/predict"
//Send data to backend and report prediction
fetch(apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
'Access-Control-Allow-Origin': 'http://127.0.0.1:4100',
'Access-Control-Allow-Credentials': 'true'
},
body: JSON.stringify(formData)
})
.then(response => response.json())
.then(data => {
// Display prediction result
document.getElementById("result").innerHTML = "Your Country's Happiness Score: " + (data.score).toFixed(2);
})
.catch(error => {
console.error("Error:", error);
document.getElementById("surv").innerHTML = "An error occurred. Please try again.";
});
}
Congratulations! If you followed the instructions correctly and didn't catastrophically mess up somewhere, you should now have a working API-integrated machine learning model!
tags: