Data Scientist Job Recommendation System — Mini Glassdoor

Anish Nitin Somaiah
6 min readDec 14, 2020

Objective

Recommendation systems are a class of data science utilities that involve suggesting a product to a user based on a user’s or multiple users’ activities. This project is a similar approach that recommends data science jobs to users based on their activity by applying a classification machine learning based recommender system.

Steps involved

  1. Data Extraction
  2. Data Cleaning
  3. Exploratory data analysis
  4. Data Preparation for Modeling
  5. Machine learning classifier model implementation
  6. Optimizing model for better validation
  7. Hyperparameter Tuning
  8. Prediction using the best fit model and checking validation metrics
  9. Recommendation system based on predictions for different users
  10. Productionalizing into a web application (future scope)

Step 1: Data Extraction

For this data science job recommendation project, I am extracting 1000 data science jobs in the USA from multiple states, and creating a pandas data frame. The scraper code has been referenced from the following article

And the data extraction process looks like below

Our extracted data frame looks as follows:

Please Note: Download the exact chrome driver for the chrome version that you have installed on your local PC using the following link https://chromedriver.chromium.org/downloads

Step 2: Data Cleaning

Steps that were done as part of data cleaning are as follows:

  1. Drop unwanted columns
  2. Check null values
  3. Take average salary from given lower and upper salary limit estimates
  4. Replace State name with state code for readability
  5. Replace “-1” with “Unknown” in multiple columns also for readability
  6. Simplify job titles for EDA and our future model to restrict the number of classes
  7. Extract technologies required from each description for EDA

Step 3: Exploratory data analysis

Exploratory data analysis is done for 2 purposes, business purpose to extract intelligence and to understand the nature of the data set for our modeling purposes.

The EDA done in my project are as follows:

  1. Bar charts to analyze the frequency distribution of various classes in categorical columns
  2. Pie charts to analyze the frequency distribution of categorical columns with only 2 classes (not recommended for visualization best practice)
  3. Histogram for continuous variables to understand the distribution of numerical variables taken as bins
  4. A multivariate correlation heatmap to identify the degree of correlation between the continuous variables.

The updated code and some of the plots for reference are as below:

FD for 2 class

Data Scientist jobs require equal distribution of Python and SQL but less demand for R and growing demand for Spark

FD for multiclass

We can see that all our classes have a very huge imbalance in their frequencies.

Histograms

Some key inferences that most jobs have an avg salary of 100k $-110k $ with companies aging less than 20 years i.e from the year 2000 and most of the companies have an average rating of 3.5–4.

Correlation Heatmap

Step 4: Data preparation for Modeling

The data scraped from glassdoor has no user activity from which we can create a recommendation system as a recommender requires to understand the nature of the user behavior.

Hence I created a column called user_id and randomized numbers from 1– 4 to create user behavior. Hence this recommendation system’s accuracy might be a little less compared to the traditional recommenders used by big companies such as Netflix, Amazon, etc. which opens doors for improvement of this project in the long run or if this project is commercialized

As I am trying to implement a recommendation system using classification models, I am encoding all my categorical variables using a multicolumn label encoder.

Once the encoding is done, the updated correlation heatmap with all the features is as below:

Here we can find some correlation between variables which can be discarded as features for our model. But since this data is randomized based on my preferences, all features can be included for the time being.

And finally an 80–20 split of the complete dataset to training and test datasets respectively for training our model, validation, and prediction.

Step 5: Machine learning classifier model implementation

After the data preparation, I went forward and applied some of the classification models such as Logistic regression, K-NN, Decision Tree, Random Forest, Support Vector Machine, and Naive Bayes and our initial accuracies are as below:

Now we can infer here that all my models’ accuracies are too low and some of the reasons for our poor accuracies are

  1. randomized user_id
  2. lack of data
  3. imbalanced classes

Step 6: Optimizing model for better validation

Among the reasons mentioned above, the one thing that can be controlled by the user is the imbalance class and hence I did a SMOTE sampling technique over my training & test data sets, generated a new set of residuals data set to train my models.

After this step, my Random forest model’s classification accuracy rocketed to 73.5% with which I decided to fit and validate my predictions. However, when it comes to a class imbalanced data frame, the better metrics for validating a model would be recall (true positive rate) or f1 score which approximated to an average of 73%.

Step 7: Hyperparameter Tuning

For a random forest model, some of the hyperparameters that can be tuned include depth, n_estimators, etc. The best parameter values obtained using a Parameter GridSearchCV are as follows:

Step 8: Prediction using the best fit model and checking validation metrics

The classification report with the hyperparameter optimized model is as below where the recall and f1 score for all classes averaged around 75%.

Step 9: Recommendation system based on predictions for a specific user

The output of my project looks as follows, where for a user_id, jobs are recommended based on the user’s activity as follows:

The code repository for the recommendation system can be found in my GitHub repository

Step 10: Future Scope

For further steps, I would like to do some of the following methods

  1. Increase data size and use other sources such as LinkedIn etc to avoid IP blocking in case too large no. of data is extracted from the website.
  2. Deploy the model using flask and javascript into an online web portal more of a data scientist glassdoor.
  3. Increase the model performance by obtaining user behavior instead of randomizing user_id.

Conclusion

This project was fun to work with employing all the phases of a data science project into a recommendation engine which has been one of the most popular applications of the data science and machine learning space.

Thanks to Mr. Ken Jee who made the Youtube playlist of an end-to-end data science project with a different objective and Mr. Andrew Mao for the continuous motivation. Please check out their Youtube channels:

Connect with me on Linkedin

--

--

Anish Nitin Somaiah

Data Scientist experienced in developing actionable solutions, implementing data analytics and help fulfill business stakeholder needs with data-driven insights