SQL Notes
SQL and Big Query
Big Query is a web service that lets you apply SQL to big datasets
Getting started with Big Query
from google.cloud import bigquery
First we set up a client from big query
client = bigquery.Client()
Then we can get access to the data sets that we want. For this example we'll use hacker news
We start by
- construct a reference to the data set
- use the get_dataset() method, with the reference to get access to the data
dataset_ref = client.dataset('hacker_news', project='bigquery-public-data') dataset = client.get_dataset(dataset_ref)
The data set is now filled with a bunch of tables
tables = list(client.list_tables(dataset)) for table in tables: print(table.table_id)
Table schema
dataset_ref = client.dataset('chicago_crime', project='bigquery-public-data') table_ref = dataset_ref.table("crime")#grab the individual table that you're looking for table = client.get_table(table_ref)#turn in to a table obj print(table.schema)
The schema field gives us info about specific columns
- name
- field type
- mode - (NULLABLE - means you can input NULL values
- description
We can then grab the first 5 entries of that table to make sure that the data makes sense in relation to the schema that we have
client.list_rows(table, max_results=5).to_dataframe()
- We can get however many rows that we want, using the list rows feature.
- BigQuery returns a 'RowIterator' object
- We can quickly convert the Rowiterator object to a pandas object by using to_dataframe() method
We can also use list rows to look at a specific column
client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()
This will just grab the individual value that we want.
selected_fields=table
Return object
If we want to access a specific field you need to figure out the names
table_ref = dataset_ref.table("crime") table = client.get_table(table_ref) fields_for_plotting = ["latitude", "longitude"]
SELECT, FROM & WHERE with python
After we use python to get a good understanding of the dataset we can the make some query on the data.
query = """ SELECT city FROM `bigquery-public-data.openaq.global_air_quality` WHERE country = 'US' """ client = bigquery.Client() query_job = client.query(query) us_cities = query_job.to_dataframe()#get in to panda dataframe
Submitting your query to the data set
The output is a pandas dataframe with the query that we were looking for
Multiple queries
Checking the size of a query before querying
Some datasets can be very large, and a query might be gigantic so you have to be careful. You can run a preliminary check on the data.
query = """ SELECT score, title FROM `bigquery-public-data.hacker_news.full` WHERE type = "job" """ dry_run_config = bigquery.QueryJobConfig(dry_run = True) dry_run_query_job = client.query(query, job_config = dry_run_config) print("process {} bytes in ".format(dry_run_query_job.total_bytes_processed))
This will get you back the rough amount of data that will be processed in the query.
This will bring back a result in bytes
You can then configurate your query to only run on certain criteria of data size
ONE_MB = 1000*1000 #ONE_GB = 1000*1000*1000 safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB) safe_query_job = client.query(query, job_config=safe_config) safe_query_job.to_dataframe() #returns an error because the return is over the allowed size
Getting a distinct country in a query
query = """ SELECT DISTINCT country FROM `bigquery-public-data.openaq.global_air_quality` WHERE units = 'ppm' """
Group By, Having & Count with Python and BigQuery
COUNT
→ aggregate function → takes many values and return 1
Other aggregate funtions that do the same are →
SUM() AVG() MIN() MAX()
GROUP BY & HAVING
The two are used together
query = """ SELECT Animal, COUNT(ID) FROM `bigquery-public-data.pet_records.pets` GROUP BY Animal HAVING COUNT(ID) > 1 """
Hacker news example
From the table, get the number of comments that had sub comments greater than 10.
query = """ SELECT parent, COUNT(ID) AS numPosts FROM `bigquery-public-data.hacker_news.comments` GROUP BY parent HAVING COUNT(id) > 10 """
Order By
When we have a table we may want to order by some value so that we can better understand the data
You may want to order data by the ID
SELECT ID, Animal FROM table ORDER BY ID
SQL is also very good at ordering by differrent values
Like Alphabetica
SELECT ID, Animal FROM table ORDER BY Animal
You can also order in descending order
SELECT ID, Animal FROM table ORDER BY Animal DESC
Dates are also a common ordering tool
We might have a data object or a Datetime object
Extract
We can then extract some info about the information using extract before we order it
For example we can order by the week in the year from 1-53
SELECT name, EXTRACT(WEEK from Date) as Week FROM table ORDER BY Week
With this we can use a group by to further order the data.
SELECT COUNT(ID) AS numEvents, EXTRACT(DAYOFWEEK from date) AS day FROM table GROUP BY day ORDER BY numEvents
You can then create complex queries to be able to get the info you want.
SELECT i_code, i_name, COUNT(1) AS num_rows FROM table WHERE y = 2016 GROUP BY i_code, i_name HAVING COUNT(1) > 175 ORDER BY COUNT(1) DESC
The ordering process of above is as follows
- create a meta table that is i_code, i_name and count some groups elements and call that num_rows
- Get all the rows with 2016 year
- Group all the rows with 2016 according to nique occurences of i_code, i_name
- Then count elements in the groups
- Exclude less than 175
- Order by the count of elements in the non excluded groups
AS & With
As can be used to rename some selection in your table selection
SELECT name_id_etet AS name FROM myTable
This can be used even more powerfully when you need to reduce the complexity of your selection
WITH overEighties As ( SELECT name, age, ID FROM nursing_home_residents WHERE age > 80 ) SELECT name, ID FROM overEighties
This creates a meta table, that we preform some operation from, to then produce a new final table
This is known as a common table expression (CTE)
We might want to get the date of some transaction, and we dont want to complicate our code by having to do the conversion from a date timestamp to Date conversion
WITH time AS ( SELECT DATE(timestamp) AS transaction_time FROM mainTable ) SELECT COUNT(1) AS transaction, transaction_time FROM time GROUP BY transaction_time ORDER BY transaction_time
Another example of a more complex query
WITH RelevantRides AS ( SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, trip_seconds, trip_miles FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` WHERE trip_start_timestamp > '2017-01-01' AND trip_start_timestamp <'2017-07-01' AND trip_seconds > 0 AND trip_miles > 0 ) SELECT COUNT(1)AS num_trips, hour_of_day, 3600*SUM(trip_miles)/SUM(trip_seconds)AS avg_mph FROM RelevantRides GROUP BY hour_of_day ORDER BY avg_mph DESC
This is first refining the data that we need in order to get meta table RelevantRides
Then it performs calculations in the meta table to output the real table
Joins
Joining is just taking data from two differnt tables, which have some reference to each other.
You start by defining the two tables that you want to pull data from
SELECT a.id, a.body, a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%'
The INNER JOIN of the two data sets is the important part
We rename them as something and then select the pieces of data that we need.
ON is just telling SQL where to make the Join
LIKE is a tool to be able to search for the exact word, or words containing letters
Here is another example of how this could work
WITH experts AS ( SELECT a.id, a.body, a.owner_user_id FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id WHERE q.tags LIKE '%bigquery%' ) SELECT COUNT(1) AS number_of_answers, owner_user_id AS user_id FROM experts GROUP BY user_id ORDER By number_of_answers DESC
Above we are doing much the same, joing the data that we want but we are placing it in to a meta layer with the WITH statement
Then we do some more transforations on it
IF we wanted to get a python script to get the data, then our full operational function would look something like this
def expert_finder(topic, client): ''' Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic. Inputs: topic: A string with the topic of interest client: A Client object that specifies the connection to the Stack Overflow dataset Outputs: results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above. ''' my_query = """ SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_Id WHERE q.tags like '%{topic}%' GROUP BY a.owner_user_id """ # Set up the query (a real service would have good error handling for # queries that scan too much data) safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10) my_query_job = client.query(my_query, job_config=safe_config) # API request - run the query, and return a pandas DataFrame results = my_query_job.to_dataframe() return results