From 8a1dc5ea34d7b9ab52a257671304231e26f309c7 Mon Sep 17 00:00:00 2001 From: Jeremy Teitelbaum Date: Mon, 16 Apr 2018 07:01:31 -0400 Subject: [PATCH] 2.11.13 in progress --- BDA 2.11.13.ipynb | 438 ++++++++++++++++++++++++++++++++++++++++++++++ BDA 2.11.13.md | 32 ++++ BDA 2.11.13.txt | 9 - 3 files changed, 470 insertions(+), 9 deletions(-) create mode 100644 BDA 2.11.13.ipynb create mode 100644 BDA 2.11.13.md delete mode 100644 BDA 2.11.13.txt diff --git a/BDA 2.11.13.ipynb b/BDA 2.11.13.ipynb new file mode 100644 index 0000000..46867cc --- /dev/null +++ b/BDA 2.11.13.ipynb @@ -0,0 +1,438 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Problem 2.10.13\n", + "\n", + "Discrete data: Table 2.2 gives the number of fatal accidents and deaths on scheduled airline flights per year over a ten-year period. \n", + "We use these data as a numerical example for fitting discrete data models. \n", + "\n", + "1. Assume that the numbers of fatal accidents in each year are independent with a Poisson(theta) distribution. Set a prior distribution for theta and determine the posterior distribution based on the data from 1976 through 1985. Under this model, give a 95% predictive interval for the number of fatal accidents in 1986. You can use the normal approximation to the gamma and Poisson or compute using simulation.\n", + "2. Assume that the numbers of fatal accidents in each year follow independent Poisson distributions with a constant rate and an exposure in each year proportional to the number of passenger miles flown. Set a prior distribution for theta and determine the posterior distribution based on the data for 1976–1985. (Estimate the number of passenger miles flown in each year by dividing the appropriate columns of Table 2.2 and ignoring round-off errors.) Give a 95% predictive interval for the number of fatal iaccidents in 1986 under the assumption that 8 × 10 11 passenger miles are flown that year.\n", + "3. Repeat (1) above, replacing ‘fatal accidents’ with ‘passenger deaths.’\n", + "4. Repeat (2) above, replacing ‘fatal accidents’ with ‘passenger deaths.’\n", + "5. In which of the cases above does the Poisson model seem more or less reasonable? Why? Discuss based on general principles,without specific reference to the numbers in Table 2.2. Incidentally, in 1986, there were 22 fatal accidents, 546 passenger deaths, and a death rate of 0.06 per 100 million miles flown. We return to this example in Exercises 3.12, 6.2, 6.3, and 8.14.\n", + "\n", + "|Year |Fatal accidents |Passenger deaths |Death rate\n", + "|---|---|---|---| \n", + "|1976 | 24 | 734 | 0.19 \n", + "|1977 |25 |516 |0.12 \n", + "|1978 |31 |754 |0.15 \n", + "|1979 |31 |877 |0.16 \n", + "|1980 |22 |814 |0.14 \n", + "|1981 |21 |362 |0.06 \n", + "|1982 |26 |764 |0.13 \n", + "|1983 |20 |809 |0.13 \n", + "|1984 |16 |223 |0.03 \n", + "|1985 |22 |1066 |0.15 \n", + "\n", + "+ Table 2.2 Worldwide airline fatalities, 1976–1985.\n", + "+ Death rate is passenger deaths per 100 million passenger miles.\n", + "+ Source: Statistical Abstract of the United States.\n", + "\n", + "Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 60). CRC Press. Kindle Edition. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DeathsFatalRateyearMiles
0734240.1919763863.0
1516250.1219774300.0
2754310.1519785027.0
3877310.1619795481.0
4814220.1419805814.0
5362210.0619816033.0
6764260.1319825877.0
7809200.1319836223.0
8223160.0319847433.0
91066220.1519857107.0
\n", + "
" + ], + "text/plain": [ + " Deaths Fatal Rate year Miles\n", + "0 734 24 0.19 1976 3863.0\n", + "1 516 25 0.12 1977 4300.0\n", + "2 754 31 0.15 1978 5027.0\n", + "3 877 31 0.16 1979 5481.0\n", + "4 814 22 0.14 1980 5814.0\n", + "5 362 21 0.06 1981 6033.0\n", + "6 764 26 0.13 1982 5877.0\n", + "7 809 20 0.13 1983 6223.0\n", + "8 223 16 0.03 1984 7433.0\n", + "9 1066 22 0.15 1985 7107.0" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from scipy.stats import poisson\n", + "from sklearn.linear_model import LinearRegression\n", + "import pystan\n", + "airline_df=pd.DataFrame(dict({'year':[x for x in range(1976,1986)],'Fatal':[24,25,31,31,22,21,26,20,16,22],'Deaths':[734,516,754,877,814,362,764,809,223,1066],'Rate':[.19,.12,.15,.16,.14,.06,.13,.13,.03,.15]}))\n", + "airline_df.set_index('year')\n", + "airline_df['Miles']=np.round(airline_df['Deaths']/airline_df['Rate'],0)\n", + "airline_df" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_62649727c9442f9bc7cbfce75826d859 NOW.\n" + ] + } + ], + "source": [ + "stan_code='''\n", + "data {\n", + " int deaths[10];\n", + "}\n", + "parameters {\n", + " real theta ; \n", + "}\n", + "model {\n", + "\n", + " // no prior here, what should we use?\n", + " deaths~poisson(theta);\n", + "}\n", + "\n", + "'''\n", + "sm_simple=pystan.StanModel(model_code=stan_code)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [], + "source": [ + "deaths=sm_simple.sampling(data=dict({'deaths':airline_df['Deaths']}))" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Inference for Stan model: anon_model_6ace3ad0d872dac6795ff2d2317d4760.\n", + "4 chains, each with iter=2000; warmup=1000; thin=1; \n", + "post-warmup draws per chain=1000, total post-warmup draws=4000.\n", + "\n", + " mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat\n", + "theta 691.94 0.21 8.16 676.54 686.33 691.83 697.4 708.53 1553 1.0\n", + "lp__ 3.8e4 0.02 0.69 3.8e4 3.8e4 3.8e4 3.8e4 3.8e4 1551 1.0\n", + "\n", + "Samples were drawn using NUTS at Sun Apr 15 18:46:11 2018.\n", + "For each parameter, n_eff is a crude measure of effective sample size,\n", + "and Rhat is the potential scale reduction factor on split chains (at \n", + "convergence, Rhat=1).\n" + ] + } + ], + "source": [ + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the stan output reported above, the 95% interval for the poisson rate is (676.5,708.5)." + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_5173738091a79a17032e9cb1d2d99cd3 NOW.\n" + ] + } + ], + "source": [ + "stan_code='''\n", + "data {\n", + " int deaths[10];\n", + " vector[10] miles;\n", + "}\n", + "parameters {\n", + " real theta ; \n", + "}\n", + "model {\n", + "\n", + " // no prior here, what should we use?\n", + " deaths~poisson(miles*theta);\n", + "}\n", + "\n", + "'''\n", + "sm_weights=pystan.StanModel(model_code=stan_code)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jet08013/anaconda3/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", + " elif np.issubdtype(np.asarray(v).dtype, float):\n" + ] + } + ], + "source": [ + "deaths=sm_weights.sampling(data=dict({'deaths':airline_df['Deaths'],'miles':airline_df['Miles']}))" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Inference for Stan model: anon_model_5173738091a79a17032e9cb1d2d99cd3.\n", + "4 chains, each with iter=2000; warmup=1000; thin=1; \n", + "post-warmup draws per chain=1000, total post-warmup draws=4000.\n", + "\n", + " mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat\n", + "theta 0.12 4.4e-5 1.5e-3 0.12 0.12 0.12 0.12 0.12 1144 1.01\n", + "lp__ 3.8e4 0.02 0.71 3.8e4 3.8e4 3.8e4 3.8e4 3.8e4 1640 1.0\n", + "\n", + "Samples were drawn using NUTS at Mon Apr 16 06:52:19 2018.\n", + "For each parameter, n_eff is a crude measure of effective sample size,\n", + "and Rhat is the potential scale reduction factor on split chains (at \n", + "convergence, Rhat=1).\n" + ] + } + ], + "source": [ + "\n", + "print(deaths)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig,ax=plt.subplots(1)\n", + "ax.scatter(airline_df['Miles'],airline_df['Deaths'] )\n", + "ax.plot(np.linspace(4000,8000,10),(.12)*np.linspace(4000,8000,10))\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'copy_X': True, 'fit_intercept': True, 'n_jobs': 1, 'normalize': False}" + ] + }, + "execution_count": 90, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lr=LinearRegression()\n", + "lr.fit(airline_df['Miles'].values.reshape(-1,1),airline_df['Deaths'].values.reshape(-1,1))\n", + "deaths_pred=lr.predict(np.linspace(4000,8000,10).reshape(-1,1))\n", + "ax.plot(np.linspace(4000,8000,10),deaths_pred)\n", + "plt.show()\n", + "lr.get_params()" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "ename": "AttributeError", + "evalue": "'LinearRegression' object has no attribute 'slope'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mlr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mslope\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mAttributeError\u001b[0m: 'LinearRegression' object has no attribute 'slope'" + ] + } + ], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/BDA 2.11.13.md b/BDA 2.11.13.md new file mode 100644 index 0000000..f9126c8 --- /dev/null +++ b/BDA 2.11.13.md @@ -0,0 +1,32 @@ +### Problem 2.10.13 + +Discrete data: Table 2.2 gives the number of fatal accidents and deaths on scheduled airline flights per year over a ten-year period. +We use these data as a numerical example for fitting discrete data models. + +1. Assume that the numbers of fatal accidents in each year ar e independent with a Poisson(θ) distribution. Set a prior distribution forθand determine the posterior distribution based on the data from 1976 through 1985. Under this model, give a 95% predictive interval for the number of fatal accidents in 1986. You can use the normal approximation to the gamma and Poisson or compute using simulation. +2. Assume that the numbers of fatal accidents in each year fo llow independent Poisson distributions with a constant rate and an exposure in each year proportional to the number of passenger miles flown. Set a prior distribution forθand determine the posterior distribution based on the data for 1976–1985. (Estimate the number of passenger miles flown in each year by dividing the appropriate columns of Table 2.2 and ignoring round-off errors.) Give a 95% predictive interval for the number of fatal iaccidents in 1986 under the assumption that 8 × 10 11 passenger miles are flown that year. +3. Repeat (1) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ +4. Repeat (2) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ +5. In which of the cases (1)–(4) above does the Poisson model seem more or less reasonable? Why? Discuss based on general principles, +without specific reference to the numbers in Table 2.2. Incidentally, in 1986, there were 22 fatal accidents, +546 passenger deaths, and a death rate of 0.06 per 100 million miles flown. We return to this example in Exercises 3.12, 6.2, 6.3, and 8.14. + +|Year |Fatal accidents |Passenger deaths |Death rate +|---|---|---|---| +|1976 | 24 | 734 | 0.19 +|1977 |25 |516 |0.12 +|1978 |31 |754 |0.15 +|1979 |31 |877 |0.16 +|1980 |22 |814 |0.14 +|1981 |21 |362 |0.06 +|1982 |26 |764 |0.13 +|1983 |20 |809 |0.13 +|1984 |16 |223 |0.03 +|1985 |22 |1066 |0.15 + +Table 2.2 Worldwide airline fatalities, 1976–1985. + +*Death rate is passenger deaths per 100 million passenger miles. +Source: Statistical Abstract of the United States.* + +Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 60). CRC Press. Kindle Edition. diff --git a/BDA 2.11.13.txt b/BDA 2.11.13.txt deleted file mode 100644 index 37d948a..0000000 --- a/BDA 2.11.13.txt +++ /dev/null @@ -1,9 +0,0 @@ -13. Discrete data: Table 2.2 gives the number of fatal accidents and deaths on scheduled airline flights per year over a ten-year period. We use these data as a numerical example for fitting discrete data models. (a) Assume that the numbers of fatal accidents in each year ar e independent with a Poisson(θ) distribution. Set a prior distribution forθand determine the posterior distribution based on the data from 1976 through 1985. Under this model, give a 95% predictive interval for the number of fatal accidents in 1986. You can use the normal approximation to the gamma and Poisson or compute using simulation. (b) Assume that the numbers of fatal accidents in each year fo llow independent Poisson distributions with a constant rate and an exposure in each year proportional to the number of passenger miles flown. Set a prior distribution forθand determine the posterior distribution based on the data for 1976–1985. (Estimate the number of passenger miles flown in each year by dividing the appropriate columns of Table 2.2 and ignoring round-off errors.) Give a 95% predictive interval for the number of fatal iaccidents in 1986 under the assumption that 8 × 10 11 passenger miles are flown that year. (c) Repeat (a) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ (d) Repeat (b) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ (e) In which of the cases (a)–(d) above does the Poisson model seem more or less reasonable? Why? Discuss based on general principles, without specific reference to the numbers in Table 2.2. Incidentally, in 1986, there were 22 fatal accidents, 546 passenger deaths, and a death rate of 0.06 per 100 million miles flown. We return to this example in Exercises 3.12, 6.2, 6.3, and 8.14. - -Year Fatal Passenger Death accidents deaths rate 1976 24 734 0.19 1977 25 516 0.12 1978 31 754 0.15 1979 31 877 0.16 1980 22 814 0.14 1981 21 362 0.06 1982 26 764 0.13 1983 20 809 0.13 1984 16 223 0.03 1985 22 1066 0.15 Table 2.2 Worldwide airline fatalities, 1976–1985. Death rate is passen ger deaths per 100 million passenger miles. Source: Statistical Abstract of the United States. - -Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 60). CRC Press. Kindle Edition. - -Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 60). CRC Press. Kindle Edition. - -Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B.. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science) (Page 59). CRC Press. Kindle Edition.