SQL-The excel way and data analysis using it
SQL has been one of the primary tools used by various levels in the data science space from data analysts to data engineers to data scientists. Its purposes have also expanded from just maintaining RDBMS to data analysis for quick business intelligence considering the ease of the language.
However, to my knowledge (which can be wrong) many aspiring data scientists who start off with MS Excel and are progressing towards SQL find it difficult to grasp during the first introduction of the query language.
Here are some of the important SQL queries along with their equivalent excel operations just to give that refresher which would be useful not only for data science interviews but for future jobs as well.
The following article is based on the course from the following link. The course is paid but totally worth the money.
JomaClass | SQL for Data Analysis
SQL for Data Analysis is a beginners' course that will teach you everything you need to start using SQL to query data…
And one of the best free resources to learn and practice SQL with real data would be Mode. Check the link below:
Mode SQL Tutorial
Learn to answer questions with data using SQL. No coding experience necessary.
This is just for beginner to intermediate data scientists and does not include high-level SQL operations such as Transaction operations, query optimizations, nested subqueries, etc.
The contents of the article are as follows:
- SELECT with WHERE
- GROUP BY, AGGREGATE, ORDER BY
- JOINS (Left and Inner)
- WINDOW Functions
The datasets that I use to demonstrate are from mode public warehouse available to everyone:
SELECT statements with WHERE clause:
Select statements are used to select all rows of the entire data set or all rows of specific columns or select rows that satisfy a certain condition.
SELECT * FROM <tablename> is equivalent to opening an excel file from your computer where you can see the entire data frame.
Any SELECT statement with a WHERE clause is similar to applying filters over columns in Excel.
For example: if we are to find out all the rows & columns from the sat_scores dataset where the hours studied is 40, the query would be as below
WHERE hrs_studied = 40;
The excel action for this query looks like the image
GROUP BY, AGGREGATE, and ORDER BY:
GROUP BY clause is generally used to accumulate classes in a categorical column, Based on AGGREGATE functions such as sum, count, average, etc which can later be sorted in the desired order based on the same or different columns using ORDER BY
SELECT agg_fun(column_name) FROM <table_name> GROUP BY <categorical_column_name> ORDER BY <desired_column_name> ASC/DESC
the above is the basic syntax and this is equivalent to Pivot tables in Excel where you drop the <categorical_column_name> in the Rows field, agg_fun and <column_name> in the values field where we can change the aggregate function under “value field settings”. Later for ordering the result, we can select the sorting using a filter option.
For example: if we are to find out the total no. of hours spent by the students of each teacher in a dataset and sort it from largest to smallest, the query would be as follows:
GROUP BY 1
ORDER BY 2 DESC;
The equivalent excel action is as below:
JOINS(LEFT & INNER):
JOINS are mainly used in cases where there are 2 or more tables and you have to generate an output combining both the tables. Joins can vary from Just joining two tables to merging Group By, Aggregates, Where filters, and so on.
SELECT table1alias.column name,
FROM table1 table1alias INNER/
LEFT JOIN table2 table2alias
ON table1alias.key = table2alias.key
The above is the general query structure for Joins. Now the most commonly used joins are inner and left joins where the inner join returns rows with common keys in both tables (default) and the left join keeps all the rows of the first table and, blanks from the second table for non-matching keys.
The above query syntax is equivalent to vlookup operation in Excel where we lookup certain values of the first table’s key in the second table, the resulting table will have all values for matching keys and #N/A for non-matching keys.
In the vlookup formula, the lookup value is table1 key, table array is the entire table, column_index_num is the index of the table2 key.
This table with #N/A is an example of left join and if the #N/A rows are removed that output is an example of an inner join.
Note: this can also be executed using IF and MATCH excel formulas but this gives a rough idea of joins and their use.
The main application of joins usually lies in databases that are normalized with different tables containing different types of information linked with primary and foreign keys. Such keys are used to derive certain tables with meaningful information for a use case with the help of the joins.
For example, Let's say we have 2 tables, one has details about players, school name and another table about the school name and the associated conferences. If we need a query that displays player names, school names and conferences for schools. The query looks as follows:
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
The excel operation looks as follows:
In the above example, the conference column from table 2 i.e football_teams was joined to table 1 i.e football_players based on the common key in both the tables i.e school_name
To understand joins visually please refer
Window functions are generally used to calculate rolling aggregates of a column for analysis purposes. One of the main visual applications of the window function is a Pareto chart that signifies the trend of one variable in correspondence to another.
The 2 main keywords used in window functions are
- Over(): this is used to calculate the rolling aggregate of a particular column across the entire table
- Partition by(): this is used to calculate the rolling aggregates of each class in a categorical column
Window functions are basic maths that can be logically applied using formulae and pivot tables in excel which is straightforward.
The following is a sample of how querying can be used to perform data analysis and create useful business intelligence. The nature of the table used is as below and can be found in the public database of the mode platform linked above.
Here we try to find the efficiency of teachers based on the total marks scored by students and total hours studied.
SELECT teacher, SUM(sat_writing + sat_verbal + sat_math) AS total_score, SUM(hrs_studied) AS total_hrs
GROUP BY 1
Here we try to find the relation between hours studied and the percentage scored using a Pareto chart with the help of window functions.
running_sum_score*1.0/total_sum AS score_percent
SUM(total_score) over (
ORDER BY total_score DESC) AS running_sum_score,
SUM(total_score) over() AS total_sum,
SUM(total_hrs) over (
ORDER BY total_score DESC) AS running_sum_hrs,
SUM(total_hrs) over() AS total_hours
SUM(sat_writing + sat_verbal + sat_math) AS total_score,
SUM(hrs_studied) AS total_hrs
GROUP BY 1) t1
ORDER BY 1 DESC
As mentioned earlier, the above article covers essential topics to crack the data science interview and give an understanding of the fundamental concepts of SQL using the language of EXCEL for beginners to understand, and also a small demo of how useful business intelligence can be drawn from SQL. Happy read….