This notebook shows how to store a dataset into a database using and analyze data using SQL and Python. In this lab you will:
The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal. This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.
Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.
A detailed description of the dataset can be found on the city of Chicago's website, but to summarize, the dataset has the following variables:
Community Area Number (ca
): Used to uniquely identify each row of the dataset
Community Area Name (community_area_name
): The name of the region in the city of Chicago
Percent of Housing Crowded (percent_of_housing_crowded
): Percent of occupied housing units with more than one person per room
Percent Households Below Poverty (percent_households_below_poverty
): Percent of households living below the federal poverty line
Percent Aged 16+ Unemployed (percent_aged_16_unemployed
): Percent of persons over the age of 16 years that are unemployed
Percent Aged 25+ without High School Diploma (percent_aged_25_without_high_school_diploma
): Percent of persons over the age of 25 years without a high school education
Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (percent_aged_under_18_or_over_64
): (ie. dependents)
Per Capita Income (per_capita_income_
): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population
Hardship Index (hardship_index
): Score that incorporates each of the six selected socioeconomic indicators
In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.
Let us first load the SQL extension and establish a connection with the database
%load_ext sql
dsn_driver = "{IBM DB2 ODBC DRIVER}"
dsn_database = "BLUDB" # e.g. "BLUDB"
dsn_port = "50000" # e.g. "50000"
dsn_protocol = "TCPIP" # i.e. "TCPIP"
%sql ibm_db_sa://rjz07540:xwc76jpv64jfwb%40v@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
pandas
dataframe into the table¶import pandas
chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
%sql PERSIST chicago_socioeconomic_data
%sql SELECT * FROM chicago_socioeconomic_data limit 5;
#len(chicago_socioeconomic_data)
%sql SELECT count(*) FROM chicago_socioeconomic_data;
Double-click here for the solution.
#%sql SELECT count(*) FROM chicago_socioeconomic_data WHERE HARDSHIP_INDEX > 50.0;
print(len(chicago_socioeconomic_data[chicago_socioeconomic_data['hardship_index'] > 50.0]))
Double-click here for the solution.
#%sql SELECT MAX(HARDSHIP_INDEX) FROM chicago_socioeconomic_data;chicago_socioeconomic_data['hardship_index'].max()
Double-click here for the solution.
max_hardship = chicago_socioeconomic_data['hardship_index'].max()
max_com = chicago_socioeconomic_data[chicago_socioeconomic_data['hardship_index'] == max_hardship]
print(max_com['community_area_name'])
Double-click here for the solution.
greater_incomes = chicago_socioeconomic_data[chicago_socioeconomic_data['per_capita_income_'] >= 60000]
print(greater_incomes['community_area_name'])
#%sql SELECT community_area_name,per_capita_income_ FROM chicago_socioeconomic_data WHERE per_capita_income_ > 60000;
Double-click here for the solution.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
income_vs_hardship = %sql SELECT per_capita_income_, hardship_index FROM chicago_socioeconomic_data;
plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship.DataFrame())
Double-click here for the solution.
per_capita_income_
is related to percent_households_below_poverty
and percent_aged_16_unemployed
. Try to create interesting visualizations!¶Copyright © 2018 cognitiveclass.ai. This notebook and its source code are released under the terms of the MIT License.