China Vs US Vs Middle East – Who smokes more?

Nothing helps like practice. I’m fiddling around with machine learning and data visualization libraries to become proficient in them. So I wanted to try my hand in coaxing information from data. Thats when I came across CO2 emission index of countries in Quandl, one of the excellent sources of open data. So I thought how would it be to take other attributes of the countries and find a relation between those attributes and CO2 emission index.

Below is the code for scraping the data from Quandl. We are scraping the data because the data we are interested is from the summary page and not individual data source per se. At this point, I’m obligated to mention that quandl provides ability to download data as csv for any timeframe range and your choice of frequency. The same can also be obtained through their excellent python library.

    import sys
    import os
    from lxml import html
    import requests
    import csv

    def main(url, outfile):
        dirname = os.path.dirname(outfile)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        source = requests.get(url)
        if not source.status_code == 200:
            print "Got status code", source.status_code, "Exiting"
            sys.exit(1)

        doc = html.fromstring(source.text)
        records = [('country', 'value', 'unit', 'year')]
        table = doc.xpath('//div[@class="datatable"]')[-1] #The last table contains World data
        rows = table.xpath(".//tr")
        print len(rows)
        for row in rows:
            try:
                tds = row.xpath('td')
                if not tds: continue
                country = str((tds[0].xpath('a/@data-sort-value') + 
                                tds[0].xpath('span/@data-sort-value')
                                )[0])
                value = str(tds[1].xpath('span/@data-sort-value')[0])
                if value:
                    value = float(value)
                unit = tds[2].text
                year = str(tds[3].xpath('span/@data-sort-value')[0])
                if year:
                    year = int(year)
            except:
                import traceback
                print traceback.format_exception(*sys.exc_info())
                print html.tostring(row)
            else:
                records.append((country, value, unit, year))
        csv.writer(open(outfile, 'w')).writerows(records)

    if __name__ == "__main__":
        main(sys.argv[1], sys.argv[2])

The above code can be used to download tabular data for ‘All Countries’ from quandl. The data we are interested in are
* GDP – All Countries
* Fuelwood Production – All Countries
* Urban Population – All Countries
* Gini Index – All Countries
* Passenger Vehicle – All Countries
* Cellphone Subscription – All Countries
* GDP -PPP – All Countries
* Population – All Countries

And there is the main data that we are going to work on : “CO2 Emissions – All Countries”

We are going to find which of the above data best predicts the CO2 emission a country would have.

import matplotlib as plot
import numpy as np
import pandas as pd
from mpltools import style

style.use('ggplot')

from pprint import pprint
from glob import glob
csvs = glob('*.csv')
pprint(csvs)

    ['cellphone.csv',
    'urbanpop.csv',
    'population.csv',
    'gdp.csv',
    'gdp_ppp.csv',
    'fuelwood.csv',
    'co2_emissions.csv',
    'vehicles.csv',
    'gini.csv']

First lets read in the CO2 emissions for each country and calculate CO2 emission per 1000 people. Then let us bring the rest of the params to the same denomination of 1000 people where possible.

co2_em = pd.read_csv('co2_emissions.csv', index_col=0)
co2_em[:5]
population = pd.read_csv('population.csv', index_col=0)
population[:5]
data = pd.DataFrame(co2_em['value'])
data['pop'] = population['value'] * 1000 #Population is in millions. We are converting it to 1000s
data['co2_1000'] = data['value']/data['pop']
data[:5]

valuepopco2_1000
country
afghanistan6314.574320170.197226
albania3006.94032430.927209
algeria121311.694364943.324154
angola26655.423202131.318727
antigua and barbuda462.042885.250477
    cellphone_usage = pd.read_csv('cellphone.csv', index_col=0) #cellphone connections per 100 persons
    data['cellphone'] = cellphone_usage['value'] * 10 #converting it to per 1000 persons
    urbanpop = pd.read_csv('urbanpop.csv', index_col=0) #urban pop in percentage
    data['urbanpop'] = urbanpop['value']
    gdp = pd.read_csv('gdp.csv', index_col=0) #gdp in billions 
    data['gdp'] = gdp['value'] * 1000000000/(data['pop']*1000) #gdp per person
    fuelwood = pd.read_csv('fuelwood.csv', index_col=0)#kiloTonnes
    data['fuelwood'] = fuelwood['value']* 1000/data['pop'] #fuelwood in tonnes per person
    gini = pd.read_csv('gini.csv', index_col=0)
    data['gini'] = gini['value']
    vehicles = pd.read_csv('vehicles.csv', index_col=0) # vehicles per thousand people
    data['vehicles'] = vehicles['value']
    gdp_ppp = pd.read_csv('gdp_ppp.csv', index_col=0)
    data['gdp_ppp'] = gdp_ppp['value']
    data[:10]

valuepopco2_1000cellphoneurbanpopgdpfuelwoodginivehiclesgdp_ppp
country
afghanistan6314.574320170.197226538.96963023.8554621.73220549.89689927.8220.1125971053.813
albania3006.94032430.9272091084.47334754.44723912.426765301.03151734.5191.9796758052.178
algeria121311.694364943.3241541033.05864473.70605693.9222897.18415235.3375.8728777477.069
angola26655.423202131.318727486.10002459.90625873.398308982.80623942.668.0000006346.742
antigua and barbuda462.042885.2504771986.17832329.867213363.636364NaNNaN153.27625618026.662
argentina174717.882410284.2585041425.11758492.640611576.338111113.38598044.49NaN18112.330
armenia4492.07533661.3345441068.78961964.16362990.49316711.88354131.3094.0000005838.258
australia400194.3782276617.5785991061.92815289.337267723.666872471.27095535.19556.18548142640.275
austria62313.33184667.3604221612.06988167.880847081.738720733.05230329.15529.33828642408.575
azerbaijan49075.46192355.3140731074.72132453.88867450.35192232.65836533.7183.84969810478.231

Panda’s Dataframe provides a convenient function ‘corr’ that provides the
correlation between each and every column of the dataframe.

data.corr()

valuepopco2_1000cellphoneurbanpopgdpfuelwoodginivehiclesgdp_ppp
value1.0000000.7958400.1562840.0179600.1005570.101115-0.096348-0.0321740.0703600.116638
pop0.7958401.000000-0.013529-0.078453-0.039648-0.046795-0.061197-0.043269-0.093295-0.051414
co2_10000.156284-0.0135291.0000000.4599590.4830940.699502-0.098430-0.2285700.6394730.764942
cellphone0.017960-0.0784530.4599591.0000000.5377690.401796-0.071408-0.0926840.4327110.505035
urbanpop0.100557-0.0396480.4830940.5377691.0000000.607351-0.123386-0.1108080.6250570.670027
gdp0.101115-0.0467950.6995020.4017960.6073511.000000-0.043769-0.3140870.7613610.945029
fuelwood-0.096348-0.061197-0.098430-0.071408-0.123386-0.0437691.0000000.120487-0.169399-0.054676
gini-0.032174-0.043269-0.228570-0.092684-0.110808-0.3140870.1204871.000000-0.423783-0.275202
vehicles0.070360-0.0932950.6394730.4327110.6250570.761361-0.169399-0.4237831.0000000.776197
gdp_ppp0.116638-0.0514140.7649420.5050350.6700270.945029-0.054676-0.2752020.7761971.000000

As we can see from the correlation matrix, plain CO2 emission levels are heavily correlated with population and almost unrelated to the rest. When we discount the population factor by normalizing to get CO2 emission per 1000 people, we find that it somewhat correlated with cellphone connections per 1000 person and urban population percentage. It is well correlated with gdp, vehicles per 1000and gpd_ppp. Now lets try plotting it.

    import matplotlib.pyplot as plt
    fig = plt.figure(tight_layout=True)
    fields = ['cellphone', 'urbanpop', 'gdp', 'fuelwood', 'gini', 'vehicles', 'gdp_ppp']
    fig.set_size_inches(12,12) 
    for index, field in enumerate(fields):
        ax = fig.add_subplot(3, 3, index+1, xlabel=field, ylabel='CO<sub>2</sub> emissions per 1000 people')
        ax.scatter(data[field], data['co2_1000'])


    fig.show()
Factors that correlate the most with CO2 emmisions.

Now lets us remove the records which have NaN as value for CO2 emmisions

clean_data = data[np.isfinite(data['co2_1000'])]
y = clean_data['co2_1000'].values
x = clean_data[['cellphone', 'urbanpop', 'gdp', 'fuelwood', 'gini', 'vehicles', 'gdp_ppp']].values.transpose()
x.shape

(7, 178)

Now lets see which countries have the highest carbon emission per 1000 people.

    plt.cool()
    sorted_data = clean_data.sort(columns=['co2_1000'], ascending=False)
    top50_data = sorted_data[:40]
    labels = top50_data.index
    co2_1000 = top50_data['co2_1000'].values
    figure(figsize=(10,10))
    colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
    colors = ['seashell','pink','plum', 'PapayaWhip','violet', 'PeachPuff','MediumOrchid']
    plt.pie(co2_1000,  labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
    # Set aspect ratio to be equal so that pie is drawn as a circle.
    plt.axis('equal')
    plt.show()


CO2 Emissions per 1000 people
CO2 Emissions per 1000 people

It is interesting to see that the countries that contribute the highest carbon footprint is from middle east followed by the developed countries. So which contribute the least? As expected, countries having the smallest carbon footprint are mostly least developed.

', '.join([x.title() for x in sorted_data[-30:].index])

'Kenya, Laos, Ivory Coast, Togo, Sierra Leone, Haiti, Afghanistan, Guinea-Bissau, Comoros, Myanmar, Timor-Leste, Tanzania, Zambia, Liberia, Mozambique, Nepal, Guinea, Uganda, Burkina Faso, Ethiopia, Eritrea, Madagascar, Niger, Rwanda, Malawi, Central African Republic, Chad, Mali, Congo, Burundi'

Lets make a map of countries and their carbon footprint. We need to map the countries to their 3-letter abbreviations.

Let as look at the distribution of emission amoung the countries. As it is apparent, there is assymmetry in how the carbon emission is distributed.

plt.hist(sorted_data['co2_1000'], color=['green'], bins=15)

    (array([91, 23, 29, 18,  4,  4,  2,  1,  2,  1,  0,  1,  0,  0,  2]),
     array([  2.17303704e-02,   2.57036546e+00,   5.11900054e+00,
             7.66763563e+00,   1.02162707e+01,   1.27649058e+01,
             1.53135409e+01,   1.78621760e+01,   2.04108111e+01,
             2.29594461e+01,   2.55080812e+01,   2.80567163e+01,
             3.06053514e+01,   3.31539865e+01,   3.57026216e+01,
             3.82512567e+01]),
     <a list of 15 Patch objects>)

Biggest Absolute contributors of CO2 emissions
Biggest Absolute contributors of CO2 emissions

This begs the question, who are the biggest absolute contributors?

carbon_absolute = clean_data.sort(columns=['value'], ascending=False)
total_emission = sum(carbon_absolute['value'])
print 'Total Global Emission is:', total_emission
carbon_absolute['percentage'] = 100 * carbon_absolute['value']/total_emission
print carbon_absolute['percentage'][:10]

Total Global Emission is: 29948374.332
country
china 25.667883
usa 17.695662
india 6.609456
russia 5.257000
japan 3.676774
germany 2.452885
iran 2.010311
canada 1.716078
south korea 1.700846
south africa 1.666255
Name: percentage, dtype: float64

plt.hist(carbon_absolute['percentage'], color=['green'], bins=15)

    (array([170,   3,   1,   2,   0,   0,   0,   0,   0,   0,   1,   0,   0,
             0,   1]),
     array([  1.71421659e-04,   1.71135221e+00,   3.42253300e+00,
             5.13371379e+00,   6.84489458e+00,   8.55607537e+00,
             1.02672562e+01,   1.19784370e+01,   1.36896177e+01,
             1.54007985e+01,   1.71119793e+01,   1.88231601e+01,
             2.05343409e+01,   2.22455217e+01,   2.39567025e+01,
             2.56678833e+01]),
     <a list of 15 Patch objects>)

It is interesting to see that even though the Middle East have the largest per capita contribution, they are insignificant compared to the giants, China, USA and India because of the sheer population size of the latter. China, USA and India together contribute nearly 50% of the world’s carbon emissions. Lets generate a map of the countries of the world by net emissions and by per 1000 people footprint. The code used for generating the required javascript are below the map.



Total Emission Map

Emission per 1000 People Map

And here is the code for generating the maps.
    def generate_total_emission_map():


        def get_fill(carb_em, index):
            if carb_em >= 10:
                return 'VERYHIGH'
            elif carb_em >= 2:
                return 'HIGH'
            elif carb_em >= 1:
                return 'MEDIUM'
            else:
                return 'LOW'

        fills = {'VERYHIGH':'#FF003C',
                 'HIGH':'#FF8A00',
                 'MEDIUM':'#FABE28',
                 'LOW':'#88C100',
                 'VERYLOW':'#00C176',
                 'defaultFill':'#87796F'}
        matched_abbr = dict([x for x in csv.reader(open('abbr.csv', 'r'))])
        map_data = defaultdict(lambda: defaultdict(str))
        columns = carbon_absolute.columns

        for index, item in enumerate(carbon_absolute.index):
            abbr = matched_abbr[item]
            map_data[abbr]['rank'] = index + 1
            map_data[abbr]['fillKey'] = get_fill(carbon_absolute.ix[item]['percentage'],item)
            map_data[abbr]['cname'] = item
            map_data[abbr]['population'] = carbon_absolute.ix[item]['pop']/1000
            for col in columns:
                map_data[abbr][col] = "%2.2f" %carbon_absolute.ix[item][col]

        return map_data, fills

        #ratio = /max_carbon
        #fills[abbr] = ratio_to_color(ratio)




    import json
    map_data, fills = generate_total_emission_map()
    jscode = open('javascript_template.js').read().replace('#FILLS#', json.dumps(fills))\
            .replace('#DATA#', json.dumps(map_data))\
            .replace('#MAPNAME#', 'total_emission')

    open('total_emission.js', 'w').write(jscode)





    def generate_co2_1000_map():


        def get_fill(carb_em, index):
            if carb_em >= 20:
                print 'veryhigh', index
                return 'VERYHIGH'
            elif carb_em >= 10:
                return 'HIGH'
            elif carb_em >= 5:
                return 'MEDIUM'
            elif carb_em >= 2:
                return 'LOW'
            else:
                return 'VERYLOW'

        fills = {'VERYHIGH':'#FF003C',
                 'HIGH':'#FF8A00',
                 'MEDIUM':'#FABE28',
                 'LOW':'#88C100',
                 'VERYLOW':'#00C176',
                 'defaultFill':'#87796F'}
        matched_abbr = dict([x for x in csv.reader(open('abbr.csv', 'r'))])
        map_data = defaultdict(lambda: defaultdict(str))
        columns = carbon_absolute.columns

        for index, item in enumerate(carbon_absolute.index):
            abbr = matched_abbr[item]
            map_data[abbr]['rank'] = index + 1
            map_data[abbr]['fillKey'] = get_fill(carbon_absolute.ix[item]['co2_1000'],item)
            map_data[abbr]['cname'] = item
            map_data[abbr]['population'] = carbon_absolute.ix[item]['pop']/1000
            for col in columns:
                map_data[abbr][col] = "%2.2f" %carbon_absolute.ix[item][col]

        return map_data, fills


    map_data, fills = generate_co2_1000_map()
    jscode = open('javascript_template.js').read().replace('#FILLS#', json.dumps(fills))\
            .replace('#DATA#', json.dumps(map_data))\
            .replace('#MAPNAME#', 'co2_1000')

    open('co2_1000.js', 'w').write(jscode)

Leave a Reply

Your email address will not be published. Required fields are marked *