Data Scientist Job Recommendation System — Mini Glassdoor

6 min readDec 14, 2020

Objective

Recommendation systems are a class of data science utilities that involve suggesting a product to a user based on a user’s or multiple users’ activities. This project is a similar approach that recommends data science jobs to users based on their activity by applying a classification machine learning based recommender system.

Steps involved

Data Extraction
Data Cleaning
Exploratory data analysis
Data Preparation for Modeling
Machine learning classifier model implementation
Optimizing model for better validation
Hyperparameter Tuning
Prediction using the best fit model and checking validation metrics
Recommendation system based on predictions for different users
Productionalizing into a web application (future scope)

Step 1: Data Extraction

For this data science job recommendation project, I am extracting 1000 data science jobs in the USA from multiple states, and creating a pandas data frame. The scraper code has been referenced from the following article

Selenium Tutorial: Scraping Glassdoor.com in 10 Minutes

I had to scrape jobs data from Glassdoor.com for a project. Let me tell you how I did it…

towardsdatascience.com

And the data extraction process looks like below

Our extracted data frame looks as follows:

Please Note: Download the exact chrome driver for the chrome version that you have installed on your local PC using the following link https://chromedriver.chromium.org/downloads

Step 2: Data Cleaning

Steps that were done as part of data cleaning are as follows:

Drop unwanted columns
Check null values
Take average salary from given lower and upper salary limit estimates
Replace State name with state code for readability
Replace “-1” with “Unknown” in multiple columns also for readability
Simplify job titles for EDA and our future model to restrict the number of classes
Extract technologies required from each description for EDA

Step 3: Exploratory data analysis

Exploratory data analysis is done for 2 purposes, business purpose to extract intelligence and to understand the nature of the data set for our modeling purposes.

The EDA done in my project are as follows:

Bar charts to analyze the frequency distribution of various classes in categorical columns
Pie charts to analyze the frequency distribution of categorical columns with only 2 classes (not recommended for visualization best practice)
Histogram for continuous variables to understand the distribution of numerical variables taken as bins
A multivariate correlation heatmap to identify the degree of correlation between the continuous variables.

The updated code and some of the plots for reference are as below:

Data Scientist jobs require equal distribution of Python and SQL but less demand for R and growing demand for Spark

We can see that all our classes have a very huge imbalance in their frequencies.

Some key inferences that most jobs have an avg salary of 100k $-110k $ with companies aging less than 20 years i.e from the year 2000 and most of the companies have an average rating of 3.5–4.

Step 4: Data preparation for Modeling

The data scraped from glassdoor has no user activity from which we can create a recommendation system as a recommender requires to understand the nature of the user behavior.

Hence I created a column called user_id and randomized numbers from 1– 4 to create user behavior. Hence this recommendation system’s accuracy might be a little less compared to the traditional recommenders used by big companies such as Netflix, Amazon, etc. which opens doors for improvement of this project in the long run or if this project is commercialized

As I am trying to implement a recommendation system using classification models, I am encoding all my categorical variables using a multicolumn label encoder.

Once the encoding is done, the updated correlation heatmap with all the features is as below:

Here we can find some correlation between variables which can be discarded as features for our model. But since this data is randomized based on my preferences, all features can be included for the time being.

And finally an 80–20 split of the complete dataset to training and test datasets respectively for training our model, validation, and prediction.

Step 5: Machine learning classifier model implementation

After the data preparation, I went forward and applied some of the classification models such as Logistic regression, K-NN, Decision Tree, Random Forest, Support Vector Machine, and Naive Bayes and our initial accuracies are as below:

Now we can infer here that all my models’ accuracies are too low and some of the reasons for our poor accuracies are

randomized user_id
lack of data
imbalanced classes

Step 6: Optimizing model for better validation

Among the reasons mentioned above, the one thing that can be controlled by the user is the imbalance class and hence I did a SMOTE sampling technique over my training & test data sets, generated a new set of residuals data set to train my models.

After this step, my Random forest model’s classification accuracy rocketed to 73.5% with which I decided to fit and validate my predictions. However, when it comes to a class imbalanced data frame, the better metrics for validating a model would be recall (true positive rate) or f1 score which approximated to an average of 73%.

Step 7: Hyperparameter Tuning

For a random forest model, some of the hyperparameters that can be tuned include depth, n_estimators, etc. The best parameter values obtained using a Parameter GridSearchCV are as follows:

Step 8: Prediction using the best fit model and checking validation metrics

The classification report with the hyperparameter optimized model is as below where the recall and f1 score for all classes averaged around 75%.

Step 9: Recommendation system based on predictions for a specific user

The output of my project looks as follows, where for a user_id, jobs are recommended based on the user’s activity as follows:

The code repository for the recommendation system can be found in my GitHub repository

anishnitin/Data-Scientist-Job-Recommender

An end - to - end data science pipeline for a job recommendation system created from data using glassdoor in Python…

github.com

Step 10: Future Scope

For further steps, I would like to do some of the following methods

Increase data size and use other sources such as LinkedIn etc to avoid IP blocking in case too large no. of data is extracted from the website.
Deploy the model using flask and javascript into an online web portal more of a data scientist glassdoor.
Increase the model performance by obtaining user behavior instead of randomizing user_id.

Conclusion

This project was fun to work with employing all the phases of a data science project into a recommendation engine which has been one of the most popular applications of the data science and machine learning space.

Thanks to Mr. Ken Jee who made the Youtube playlist of an end-to-end data science project with a different objective and Mr. Andrew Mao for the continuous motivation. Please check out their Youtube channels:

Ken Jee

Data Science and Sports Analytics are my passions. My name is Ken Jee and I have been working in the data science field…

www.youtube.com

Andrew Mo - Six Figure Dream Job

I'm millennial money Batman if his utility belt only had nerdy gadgets to make and save money. As a 25-year-old in…