|
@@ -1,172 +0,0 @@
|
|
-{
|
|
|
|
- "cells": [
|
|
|
|
- {
|
|
|
|
- "cell_type": "markdown",
|
|
|
|
- "metadata": {},
|
|
|
|
- "source": [
|
|
|
|
- "# Tutorial: Part 3 - Loading and merging data\n",
|
|
|
|
- "\n",
|
|
|
|
- "\n",
|
|
|
|
- "\n",
|
|
|
|
- "## GeoNames dataset\n",
|
|
|
|
- "\n",
|
|
|
|
- "Dataset source: http://download.geonames.org/export/dump/\n",
|
|
|
|
- "\n",
|
|
|
|
- "Features:\n",
|
|
|
|
- "- **geonameid:** integer id of record in geonames database\n",
|
|
|
|
- "- **name:** name of geographical point (utf8) varchar(200)\n",
|
|
|
|
- "- **asciiname:** name of geographical point in plain ascii characters, varchar(200)\n",
|
|
|
|
- "- **alternatenames:** alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)\n",
|
|
|
|
- "- **latitude:** latitude in decimal degrees (wgs84)\n",
|
|
|
|
- "- **longitude:** longitude in decimal degrees (wgs84)\n",
|
|
|
|
- "- **feature class:** see http://www.geonames.org/export/codes.html, char(1)\n",
|
|
|
|
- "- **feature code:** see http://www.geonames.org/export/codes.html, varchar(10)\n",
|
|
|
|
- "- **country code:** ISO-3166 2-letter country code, 2 characters\n",
|
|
|
|
- "- **cc2:** alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters\n",
|
|
|
|
- "- **admin1 code:** fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)\n",
|
|
|
|
- "- **admin2 code:** code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) \n",
|
|
|
|
- "- **admin3 code:** code for third level administrative division, varchar(20)\n",
|
|
|
|
- "- **admin4 code:** code for fourth level administrative division, varchar(20)\n",
|
|
|
|
- "- **population:** bigint (8 byte int) \n",
|
|
|
|
- "- **elevation:** in meters, integer\n",
|
|
|
|
- "- **dem:** digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.\n",
|
|
|
|
- "- **timezone:** the iana timezone id (see file timeZone.txt) varchar(40)\n",
|
|
|
|
- "- **modification date:** date of last modification in yyyy-MM-dd format"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "import pandas\n",
|
|
|
|
- "\n",
|
|
|
|
- "cities = pandas.read_csv('data/cities1000.zip',\n",
|
|
|
|
- " sep='\\t',\n",
|
|
|
|
- " names=['geonameid', 'name', 'asciiname', 'alternatenames',\n",
|
|
|
|
- " 'latitude', 'longitude', 'feature class',\n",
|
|
|
|
- " 'feature code', 'country code', 'cc2',\n",
|
|
|
|
- " 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code',\n",
|
|
|
|
- " 'population', 'elevation', 'dem', 'timezone', 'modification date'],\n",
|
|
|
|
- " index_col='geonameid',\n",
|
|
|
|
- " parse_dates=['modification date'],\n",
|
|
|
|
- " low_memory=False)"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "markdown",
|
|
|
|
- "metadata": {},
|
|
|
|
- "source": [
|
|
|
|
- "<div class=\"alert alert-success\">\n",
|
|
|
|
- " <p><b>EXERCICE:</b> Explore the dataset</p>\n",
|
|
|
|
- " <p>Tasks:\n",
|
|
|
|
- " <ul>\n",
|
|
|
|
- " <li>Visualize the first rows.</li>\n",
|
|
|
|
- " <li>Check if there are missing values.</li>\n",
|
|
|
|
- " <li>Plot the distribution of the <code>population</code>.</li>\n",
|
|
|
|
- " </ul>\n",
|
|
|
|
- " </p>\n",
|
|
|
|
- " <p>Hints:\n",
|
|
|
|
- " <ul>\n",
|
|
|
|
- " <li>You can change the number of rows to visualize in <code>.head()</code> providing a number (e.g. <code>.head(3)</code></li>\n",
|
|
|
|
- " <li>You can see the number of non-null with <code>.info()</code> but also with <code>.nonull().sum()</code></li>\n",
|
|
|
|
- " <li>When you display data with many columns and few rows, you can use <code>.T</code> to transpose it.</li>\n",
|
|
|
|
- " <li>You can visualize a distribution with <code>.boxplot()</code> and also with <code>.hist</code></li>\n",
|
|
|
|
- " </ul>\n",
|
|
|
|
- " </p>\n",
|
|
|
|
- "</div>"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "cities.info()"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "cities.nlargest(n=5, columns='population')"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "cities.nlargest(n=20, columns='elevation')"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "markdown",
|
|
|
|
- "metadata": {},
|
|
|
|
- "source": [
|
|
|
|
- "https://github.com/mledoze/countries"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "countries = pandas.read_json('data/countries.json.gz')"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "countries.head()"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": [
|
|
|
|
- "pandas.read_table('http://download.geonames.org/export/dump/featureCodes_en.txt',\n",
|
|
|
|
- " names=('code', 'name', 'description'), index_col='code')"
|
|
|
|
- ]
|
|
|
|
- },
|
|
|
|
- {
|
|
|
|
- "cell_type": "code",
|
|
|
|
- "execution_count": null,
|
|
|
|
- "metadata": {},
|
|
|
|
- "outputs": [],
|
|
|
|
- "source": []
|
|
|
|
- }
|
|
|
|
- ],
|
|
|
|
- "metadata": {
|
|
|
|
- "kernelspec": {
|
|
|
|
- "display_name": "Python 3",
|
|
|
|
- "language": "python",
|
|
|
|
- "name": "python3"
|
|
|
|
- },
|
|
|
|
- "language_info": {
|
|
|
|
- "codemirror_mode": {
|
|
|
|
- "name": "ipython",
|
|
|
|
- "version": 3
|
|
|
|
- },
|
|
|
|
- "file_extension": ".py",
|
|
|
|
- "mimetype": "text/x-python",
|
|
|
|
- "name": "python",
|
|
|
|
- "nbconvert_exporter": "python",
|
|
|
|
- "pygments_lexer": "ipython3",
|
|
|
|
- "version": "3.7.0"
|
|
|
|
- }
|
|
|
|
- },
|
|
|
|
- "nbformat": 4,
|
|
|
|
- "nbformat_minor": 2
|
|
|
|
-}
|
|
|