Introduction

This notebook shows how to store a dataset into a database using and analyze data using SQL and Python. In this lab you will:

  1. Understand a dataset of selected socioeconomic indicators in Chicago
  2. Learn how to store data in an Db2 database on IBM Cloud instance
  3. Solve example problems to practice your SQL skills

Selected Socioeconomic Indicators in Chicago

The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal. This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.

A detailed description of the dataset can be found on the city of Chicago's website, but to summarize, the dataset has the following variables:

  • Community Area Number (ca): Used to uniquely identify each row of the dataset

  • Community Area Name (community_area_name): The name of the region in the city of Chicago

  • Percent of Housing Crowded (percent_of_housing_crowded): Percent of occupied housing units with more than one person per room

  • Percent Households Below Poverty (percent_households_below_poverty): Percent of households living below the federal poverty line

  • Percent Aged 16+ Unemployed (percent_aged_16_unemployed): Percent of persons over the age of 16 years that are unemployed

  • Percent Aged 25+ without High School Diploma (percent_aged_25_without_high_school_diploma): Percent of persons over the age of 25 years without a high school education

  • Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (percent_aged_under_18_or_over_64): (ie. dependents)

  • Per Capita Income (per_capita_income_): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population

  • Hardship Index (hardship_index): Score that incorporates each of the six selected socioeconomic indicators

In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.

Connect to the database

Let us first load the SQL extension and establish a connection with the database

In [2]:
%load_ext sql
In [3]:
dsn_driver = "{IBM DB2 ODBC DRIVER}"
dsn_database = "BLUDB"            # e.g. "BLUDB"
dsn_port = "50000"                # e.g. "50000" 
dsn_protocol = "TCPIP"            # i.e. "TCPIP"

%sql ibm_db_sa://rjz07540:xwc76jpv64jfwb%40v@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Out[3]:
'Connected: rjz07540@BLUDB'

Store the dataset in a Table

In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.
We will first read the dataset source .CSV from the internet into pandas dataframe
Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL "magic" simplifies the process of table creation and writing the data from a pandas dataframe into the table
In [4]:
import pandas
chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
%sql PERSIST chicago_socioeconomic_data
 * ibm_db_sa://rjz07540:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Out[4]:
'Persisted chicago_socioeconomic_data'
You can verify that the table creation was successful by making a basic query like:
In [5]:
%sql SELECT * FROM chicago_socioeconomic_data limit 5;
 * ibm_db_sa://rjz07540:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
Out[5]:
index ca community_area_name hardship_index per_capita_income_ percent_aged_16_unemployed percent_aged_25_without_high_school_diploma percent_aged_under_18_or_over_64 percent_households_below_poverty percent_of_housing_crowded
0 1.0 Rogers Park 39.0 23939 8.7 18.2 27.5 23.6 7.7
1 2.0 West Ridge 46.0 23040 8.8 20.8 38.5 17.2 7.8
2 3.0 Uptown 20.0 35787 8.9 11.8 22.2 24.0 3.8
3 4.0 Lincoln Square 17.0 37524 8.2 13.4 25.5 10.9 3.4
4 5.0 North Center 6.0 57123 5.2 4.5 26.2 7.5 0.3

Problems

Problem 1

How many rows are in the dataset?
In [7]:
#len(chicago_socioeconomic_data)
%sql SELECT count(*) FROM chicago_socioeconomic_data;
 * ibm_db_sa://rjz07540:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
Out[7]:
1
78

Double-click here for the solution.

Problem 2

How many community areas in Chicago have a hardship index greater than 50.0?
In [9]:
#%sql SELECT count(*) FROM chicago_socioeconomic_data WHERE HARDSHIP_INDEX > 50.0;
print(len(chicago_socioeconomic_data[chicago_socioeconomic_data['hardship_index'] > 50.0]))
38

Double-click here for the solution.

Problem 3

What is the maximum value of hardship index in this dataset?
In [11]:
#%sql SELECT MAX(HARDSHIP_INDEX) FROM chicago_socioeconomic_data;chicago_socioeconomic_data['hardship_index'].max()
Out[11]:
98.0

Double-click here for the solution.

Problem 4

Which community area which has the highest hardship index?
In [16]:
max_hardship = chicago_socioeconomic_data['hardship_index'].max()
max_com = chicago_socioeconomic_data[chicago_socioeconomic_data['hardship_index'] == max_hardship]
print(max_com['community_area_name'])
53    Riverdale
Name: community_area_name, dtype: object

Double-click here for the solution.

Problem 5

Which Chicago community areas have per-capita incomes greater than $60,000?
In [22]:
greater_incomes = chicago_socioeconomic_data[chicago_socioeconomic_data['per_capita_income_'] >= 60000]
print(greater_incomes['community_area_name'])
#%sql SELECT community_area_name,per_capita_income_ FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;
5           Lake View
6        Lincoln Park
7     Near North Side
31               Loop
Name: community_area_name, dtype: object

Double-click here for the solution.

Problem 6

Create a scatter plot using the variables per_capita_income_ and hardship_index. Explain the correlation between the two variables.
In [23]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

income_vs_hardship = %sql SELECT per_capita_income_, hardship_index FROM chicago_socioeconomic_data;
plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship.DataFrame())
 * ibm_db_sa://rjz07540:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
/home/jupyterlab/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Double-click here for the solution.

Conclusion

Summary

In this lab you learned how to store a real world data set from the internet in a database (Db2 on IBM Cloud), gain insights into data using SQL queries. You also visualized a portion of the data in the database to see what story it tells.

Copyright © 2018 cognitiveclass.ai. This notebook and its source code are released under the terms of the MIT License.