# standard imports
import pandas as pd
import numpy as np
# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', None)
First, import the data from the survivor.xlsx
file, calling the respective DataFrames the same as the sheet name but with lowercase and snake case. For example, the sheet called Castaway Details
should be saved as a DataFrame called castaway_details
. Make sure that the data files are in the same folder as your notebook.
Note: You may or may not need to install openpyxl for the code below to work. You can use: $ pip install openpyxl
#pip install openpyxl
# import data from Excel
# setup Filename and Object
fileName = "survivor.xlsx"
xls = pd.ExcelFile(fileName)
# import individual sheets /// next will use copies of each df
castaway_details1 = pd.read_excel(xls, 'Castaway Details')
castaways1 = pd.read_excel(xls, 'Castaways')
challenge_description1 = pd.read_excel(xls, 'Challenge Description')
challenge_results1 = pd.read_excel(xls, 'Challenge Results')
confessionals1 = pd.read_excel(xls, 'Confessionals')
hidden_idols1 = pd.read_excel(xls, 'Hidden Idols')
jury_votes1 = pd.read_excel(xls, 'Jury Votes')
tribe_mapping1 = pd.read_excel(xls, 'Tribe Mapping')
viewers1 = pd.read_excel(xls, 'Viewers')
vote_history1 = pd.read_excel(xls, 'Vote History')
season_summary1 = pd.read_excel(xls, 'Season Summary')
season_palettes1 = pd.read_excel(xls, 'Season Palettes')
tribe_colours1 = pd.read_excel(xls, 'Tribe Colours')
#Copies of the original dataframes
castaway_details = castaway_details1.copy()
castaways = castaways1.copy()
challenge_description = challenge_description1.copy()
challenge_results = challenge_results1.copy()
confessionals = confessionals1.copy()
hidden_idols = hidden_idols1.copy()
jury_votes = jury_votes1.copy()
tribe_mapping = tribe_mapping1.copy()
viewers = viewers1.copy()
vote_history = vote_history1.copy()
season_summary = season_summary1.copy()
season_palettes = season_palettes1.copy()
tribe_colours = tribe_colours1.copy()
## Checking for the shape of each dataframes
print('castaway_details', castaway_details.shape)
print('castaways', castaways.shape)
print('challenge_description', challenge_description.shape)
print('challenge_results', challenge_results.shape)
print('confessionals', confessionals.shape)
print('hidden_idols', hidden_idols.shape)
print('jury_votes',jury_votes.shape)
print('tribe_mapping', tribe_mapping.shape)
print('viewers', viewers.shape)
print('vote_history', vote_history.shape)
print('season_summary', season_summary.shape)
print('season_palettes', season_palettes.shape)
print('tribe_colours', tribe_colours.shape)
castaway_details (608, 10) castaways (762, 20) challenge_description (892, 14) challenge_results (4441, 13) confessionals (6684, 6) hidden_idols (159, 10) jury_votes (933, 7) tribe_mapping (7322, 8) viewers (610, 9) vote_history (4751, 15) season_summary (41, 20) season_palettes (205, 3) tribe_colours (145, 5)
Exercise1: Change every column name of every DataFrame to lowercase and snake case. This is a standard first step for some programmers as lowercase makes it easier to write and snake case makes it easier to copy multiple-word column names.
For example, Castaway Id
should end up being castaway_id
. You should try doing this using a for
loop instead of manually changing the names for each column. It should take you no more than a few lines of code. Use stackoverflow if you need help.
for columns in castaway_details.columns:
castaway_details.columns = castaway_details.columns.str.lower()
castaway_details.columns = castaway_details.columns.str.replace(' ','_')
castaway_details.head()
castaway_id | full_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Sonja Christopher | Sonja | 1937-01-28 | NaT | Female | NaN | NaN | Musician | ENFP |
1 | 2 | B.B. Anderson | B.B. | 1936-01-18 | 2013-10-29 | Male | NaN | NaN | Real Estate Developer | ESTJ |
2 | 3 | Stacey Stillman | Stacey | 1972-08-11 | NaT | Female | NaN | NaN | Attorney | ENTJ |
3 | 4 | Ramona Gray | Ramona | 1971-01-20 | NaT | Female | Black | NaN | Biochemist/Chemist | ISTJ |
4 | 5 | Dirk Been | Dirk | 1976-06-15 | NaT | Male | NaN | NaN | Dairy Farmer | ISFP |
for columns in castaways.columns:
castaways.columns = castaways.columns.str.lower()
castaways.columns = castaways.columns.str.replace(' ','_')
castaways.head()
season_name | season | full_name | castaway_id | castaway | age | city | state | personality_type | episode | day | order | result | jury_status | original_tribe | swapped_tribe | swapped_tribe_2 | merged_tribe | total_votes_received | immunity_idols_won | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Survivor: 41 | 41 | Erika Casupanan | 594 | Erika | 32 | Toronta | Ontario | ENFP | 13 | 26 | 18 | Sole Survivor | NaN | Luvu | NaN | NaN | Via Kana | 2 | 8 |
1 | Survivor: 41 | 41 | Deshawn Radden | 601 | Deshawn | 26 | Miami | Florida | ENTP | 13 | 26 | 17 | Runner-up | NaN | Luvu | NaN | NaN | Via Kana | 7 | 6 |
2 | Survivor: 41 | 41 | Xander Hastings | 597 | Xander | 20 | Chicago | Illinois | INFJ | 13 | 26 | 16 | 2nd runner-up | NaN | Yase | NaN | NaN | Via Kana | 2 | 6 |
3 | Survivor: 41 | 41 | Heather Aldret | 593 | Heather | 52 | Charleston | South Carolina | ISFJ | 13 | 25 | 15 | 15th voted out | 8th jury member | Luvu | NaN | NaN | Via Kana | 4 | 6 |
4 | Survivor: 41 | 41 | Ricard Foye | 596 | Ricard | 31 | Sedro-Woolley | Washington | ENTJ | 13 | 24 | 14 | 14th voted out | 7th jury member | Ua | NaN | NaN | Via Kana | 9 | 5 |
for columns in challenge_description.columns:
challenge_description.columns = challenge_description.columns.str.lower()
challenge_description.columns = challenge_description.columns.str.replace(' ','_')
challenge_description.head()
challenge_id | challenge_name | puzzle | race | precision | endurance | strength | turn_based | balance | food | knowledge | memory | fire | water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CH0001 | Quest for Fire | False | True | False | False | False | False | False | False | False | False | True | True |
1 | CH0002 | Bridging the Gap | False | True | False | False | False | False | False | False | False | False | True | True |
2 | CH0003 | Trail Blazer | False | True | False | False | False | False | False | False | False | False | True | False |
3 | CH0004 | Buggin' Out | False | False | False | False | False | False | False | True | False | False | False | False |
4 | CH0005 | Tucker'd Out | False | True | True | False | False | False | False | True | False | False | False | False |
for columns in challenge_results.columns:
challenge_results.columns = challenge_results.columns.str.lower()
challenge_results.columns = challenge_results.columns.str.replace(' ','_')
challenge_results.head()
season_name | season | episode | day | episode_title | challenge_name | challenge_type | outcome_type | challenge_id | winner_id | winner | winning_tribe | outcome_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 2.0 | B.B. | Pagong | Winner |
1 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 4.0 | Ramona | Pagong | Winner |
2 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 6.0 | Joel | Pagong | Winner |
3 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 7.0 | Gretchen | Pagong | Winner |
4 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 8.0 | Greg | Pagong | Winner |
for columns in confessionals.columns:
confessionals.columns = confessionals.columns.str.lower()
confessionals.columns = confessionals.columns.str.replace(' ','_')
confessionals.head()
season_name | season | episode | castaway | castaway_id | confessional_count | |
---|---|---|---|---|---|---|
0 | Survivor: 41 | 41 | 1 | JD | 603 | 11 |
1 | Survivor: 41 | 41 | 1 | Evvie | 598 | 9 |
2 | Survivor: 41 | 41 | 1 | Danny | 599 | 8 |
3 | Survivor: 41 | 41 | 1 | Xander | 597 | 5 |
4 | Survivor: 41 | 41 | 1 | Deshawn | 601 | 4 |
for columns in hidden_idols.columns:
hidden_idols.columns = hidden_idols.columns.str.lower()
hidden_idols.columns = hidden_idols.columns.str.replace(' ','_')
hidden_idols.head()
season_name | season | castaway_id | castaway | idol_number | idols_held | votes_nullified | day_found | day_played | legacy_advantage | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Survivor: Guatemala | 11 | 161 | Gary | 1 | 1 | 0.0 | 24.0 | NaN | False |
1 | Survivor: Panama | 12 | 180 | Terry | 1 | 1 | 0.0 | NaN | NaN | False |
2 | Survivor: Cook Islands | 13 | 202 | Yul | 1 | 1 | 0.0 | NaN | NaN | False |
3 | Survivor: Fiji | 14 | 218 | Yau-Man | 1 | 1 | 4.0 | 17.0 | 36.0 | False |
4 | Survivor: Fiji | 14 | 214 | Mookie | 1 | 1 | 0.0 | 20.0 | NaN | False |
for columns in jury_votes.columns:
jury_votes.columns = jury_votes.columns.str.lower()
jury_votes.columns = jury_votes.columns.str.replace(' ','_')
jury_votes.head()
season_name | season | castaway | finalist | vote | castaway_id | finalist_id | |
---|---|---|---|---|---|---|---|
0 | Survivor: 41 | 41 | Heather | Deshawn | 0 | 593 | 601 |
1 | Survivor: 41 | 41 | Ricard | Deshawn | 0 | 596 | 601 |
2 | Survivor: 41 | 41 | Danny | Deshawn | 1 | 599 | 601 |
3 | Survivor: 41 | 41 | Liana | Deshawn | 0 | 608 | 601 |
4 | Survivor: 41 | 41 | Shan | Deshawn | 0 | 606 | 601 |
for columns in tribe_mapping.columns:
tribe_mapping.columns = tribe_mapping.columns.str.lower()
tribe_mapping.columns = tribe_mapping.columns.str.replace(' ','_')
tribe_mapping.head()
season_name | season | episode | day | castaway_id | castaway | tribe | tribe_status | |
---|---|---|---|---|---|---|---|---|
0 | Survivor: Borneo | 1 | 1 | 3 | 2.0 | B.B. | Pagong | Original |
1 | Survivor: Borneo | 1 | 1 | 3 | 4.0 | Ramona | Pagong | Original |
2 | Survivor: Borneo | 1 | 1 | 3 | 6.0 | Joel | Pagong | Original |
3 | Survivor: Borneo | 1 | 1 | 3 | 7.0 | Gretchen | Pagong | Original |
4 | Survivor: Borneo | 1 | 1 | 3 | 8.0 | Greg | Pagong | Original |
for columns in viewers.columns:
viewers.columns = viewers.columns.str.lower()
viewers.columns = viewers.columns.str.replace(' ','_')
viewers.head()
season_name | season | episode_number_overall | episode | episode_title | episode_date | viewers | rating_18_49 | share_18_49 | |
---|---|---|---|---|---|---|---|---|---|
0 | Survivor: 41 | 41 | 597.0 | 1 | A New Era | 2021-09-22 | 6.25 | 1.1 | 8.0 |
1 | Survivor: 41 | 41 | 598.0 | 2 | Juggling Chainsaws | 2021-09-29 | 5.90 | 1.0 | 7.0 |
2 | Survivor: 41 | 41 | 599.0 | 3 | My Million Dollar Mistake | 2021-10-06 | 5.79 | 0.9 | 6.0 |
3 | Survivor: 41 | 41 | 600.0 | 4 | They Hate Me Because They Ain't Me | 2021-10-13 | 5.68 | 0.9 | 7.0 |
4 | Survivor: 41 | 41 | 601.0 | 5 | The Strategist or the Loyalist | 2021-10-20 | 5.62 | 1.0 | 7.0 |
for columns in vote_history.columns:
vote_history.columns = vote_history.columns.str.lower()
vote_history.columns = vote_history.columns.str.replace(' ','_')
vote_history.head()
season_name | season | episode | day | tribe_status | castaway | immunity | vote | nullified | voted_out | order | vote_order | castaway_id | vote_id | voted_out_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Survivor: 41 | 41 | 1 | 3 | Original | Evvie | NaN | Abraham | False | Abraham | 1 | 1 | 598 | 591.0 | 591.0 |
1 | Survivor: 41 | 41 | 1 | 3 | Original | Liana | NaN | Abraham | False | Abraham | 1 | 1 | 608 | 591.0 | 591.0 |
2 | Survivor: 41 | 41 | 1 | 3 | Original | Tiffany | NaN | Abraham | False | Abraham | 1 | 1 | 604 | 591.0 | 591.0 |
3 | Survivor: 41 | 41 | 1 | 3 | Original | Voce | NaN | Abraham | False | Abraham | 1 | 1 | 607 | 591.0 | 591.0 |
4 | Survivor: 41 | 41 | 1 | 3 | Original | Xander | NaN | Abraham | False | Abraham | 1 | 1 | 597 | 591.0 | 591.0 |
for columns in season_summary.columns:
season_summary.columns = season_summary.columns.str.lower()
season_summary.columns = season_summary.columns.str.replace(' ','_')
season_summary.head()
season_name | season | location | country | tribe_setup | full_name | winner_id | winner | runner_ups | final_vote | timeslot | premiered | ended | filming_started | filming_ended | viewers_premier | viewers_finale | viewers_reunion | viewers_mean | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Survivor: Borneo | 1 | Pulau Tiga, Sabah, Malaysia | Malaysia | Two tribes of eight new players | Richard Hatch | 16 | Richard | Kelly Wiglesworth | 4-3 | Wednesday 8:00 pm | 2000-05-31 | 2000-08-23 | 2000-03-13 | 2000-04-20 | 15.51 | 51.69 | 36.70 | 28.30 | 2.0 |
1 | Survivor: The Australian Outback | 2 | Herbert River at Goshen Station, Queensland, A... | Australia | Two tribes of eight new players | Tina Wesson | 32 | Tina | Colby Donaldson | 4-3 | Thursday 8:00 pm | 2001-01-28 | 2001-05-03 | 2000-10-23 | 2000-12-03 | 45.37 | 36.35 | 28.01 | 29.80 | 1.0 |
2 | Survivor: Africa | 3 | Shaba National Reserve, Kenya | Kenya | Two tribes of eight new players | Ethan Zohn | 48 | Ethan | Kim Johnson | 5-2 | Thursday 8:00 pm | 2001-10-11 | 2002-01-10 | 2001-07-11 | 2001-08-18 | 23.84 | 27.26 | 19.05 | 20.69 | 8.0 |
3 | Survivor: Marquesas | 4 | Nuku Hiva, Marquesas Islands, French Polynesia | Polynesia | Two tribes of eight new players | Vecepia Towery | 64 | Vecepia | Neleh Dennis | 4-3 | Thursday 8:00 pm | 2002-02-28 | 2002-05-19 | 2001-11-12 | 2001-12-20 | 23.19 | 25.87 | 19.05 | 20.77 | 6.0 |
4 | Survivor: Thailand | 5 | Ko Tarutao, Satun Province, Thailand | Thailand | Two tribes of eight new players; picked by the... | Brian Heidik | 80 | Brian | Clay Jordan | 4-3 | Thursday 8:00 pm | 2002-09-19 | 2002-12-19 | 2002-06-10 | 2002-07-18 | 23.05 | 24.08 | 20.43 | 21.21 | 4.0 |
for columns in season_palettes.columns:
season_palettes.columns = season_palettes.columns.str.lower()
season_palettes.columns = season_palettes.columns.str.replace(' ','_')
season_palettes.head()
season_name | season | palette | |
---|---|---|---|
0 | Survivor: 41 | 41 | #ABC9E4 |
1 | Survivor: 41 | 41 | #6990A7 |
2 | Survivor: 41 | 41 | #1D4164 |
3 | Survivor: 41 | 41 | #F3A92B |
4 | Survivor: 41 | 41 | #E8E086 |
for columns in tribe_colours.columns:
tribe_colours.columns = tribe_colours.columns.str.lower()
tribe_colours.columns = tribe_colours.columns.str.replace(' ','_')
tribe_colours.head()
season_name | season | tribe | tribe_colour | tribe_status | |
---|---|---|---|---|---|
0 | Survivor: Borneo | 1 | Pagong | #FFFF05 | original |
1 | Survivor: Borneo | 1 | Rattana | #7CFC00 | merged |
2 | Survivor: Borneo | 1 | Tagi | #FF9900 | original |
3 | Survivor: The Australian Outback | 2 | Barramundi | #FF6600 | merged |
4 | Survivor: The Australian Outback | 2 | Kucha | #32CCFF | original |
Q2: Who is the oldest contestant that is still alive? We want to look at their current age and NOT the age they were when they played their season. Select their row from the castaway_details
DataFrame and save this as Q2
. This should return a DataFrame and the index and missing values should be left as is.
Important: Remember to show your work (i.e. how you found this answer by using Python/Pandas/Numpy). See note in the instructions above. This is true for all of the following questions as well.
### ENTER CODE HERE ###
Q2 = castaway_details[pd.isnull(castaway_details['date_of_death'])]
Q2 = Q2[Q2['date_of_birth'] == Q2['date_of_birth'].min()]
Q2
castaway_id | full_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Sonja Christopher | Sonja | 1937-01-28 | NaT | Female | NaN | NaN | Musician | ENFP |
Q3: What contestant was the oldest at the time of their season? We want to look at their age at the time of the season and NOT their current age. Select their row from the castaway_details
DataFrame and save this as Q3
. This should return a DataFrame and the index and missing values should be left as is.
x = castaways.groupby(['season','castaway_id'])['age'].max()
y = x.loc[x.groupby(level=0).idxmax()]
y = y.to_frame().reset_index()
y = y['castaway_id'].to_list()
Q3 = castaway_details[castaway_details['castaway_id'].isin(y)]
print(Q3)
castaway_id full_name short_name date_of_birth date_of_death \ 13 14 Rudy Boesch Rudy 1928-01-20 2019-11-01 27 28 Rodger Bingham Rodger 1947-07-05 NaT 31 32 Tina Wesson Tina 1960-12-26 NaT 46 47 Kim Johnson Kim J. 1944-09-18 NaT 60 61 Paschal English Paschal 1945-03-05 NaT 74 75 Jake Billingsley Jake 1941-08-21 NaT 86 87 Roger Sexton Roger 1946-09-26 NaT 98 99 Lillian Morris Lillian 1952-04-03 NaT 127 128 Scout Cloud Lee Scout 1944-11-08 NaT 136 137 Willard Smith Willard 1947-11-30 NaT 150 151 Jim Lynch Jim 1942-01-07 NaT 175 176 Bruce Kanegai Bruce 1947-12-17 NaT 179 180 Terry Deitz Terry 1959-10-10 NaT 182 183 Sekou Bunch Sekou 1960-09-09 NaT 205 206 Gary Stritesky Gary 1951-09-16 NaT 217 218 Yau-Man Chan Yau-Man 1952-08-26 NaT 221 222 Steve Morris Chicken 1959-09-02 NaT 248 249 Gillian Larson Gillian 1947-02-24 NaT 257 258 Randy Bailey Randy 1959-03-31 NaT 268 269 Sandy Burgin Sandy 1955-07-01 NaT 282 283 Mike Borassi Mike 1947-03-13 NaT 303 304 Jimmy Johnson Jimmy J. 1943-07-16 NaT 336 337 Phillip Sheppard Phillip 1958-03-12 NaT 350 351 Rick Nelson Rick 1959-10-24 NaT 363 364 Troy Robertson Troyzan 1961-07-30 NaT 365 366 Greg Smith Tarzan 1947-12-12 NaT 379 380 Artis Silvester Artis 1959-04-18 NaT 385 386 Denise Stapley Denise 1971-01-01 NaT 419 420 Trish Hegarty Trish 1965-10-09 NaT 429 430 Dale Wentworth Dale 1959-03-07 NaT 458 459 Carolyn Rivera Carolyn 1961-10-07 NaT 473 474 Joseph Del Campo Joe 1943-07-04 NaT 480 481 Paul Wachter Paul 1963-10-10 NaT 498 499 Katrina Radke Katrina 1970-12-17 NaT 532 533 Angela Perkins Angela 1974-11-02 NaT 541 542 Natalie Cole Natalie 1961-02-26 NaT 556 557 Reem Daly Reem 1971-07-08 NaT 574 575 Tom Laidlaw Tom 1958-04-15 NaT 592 593 Heather Aldret Heather 1969-03-01 NaT gender race ethnicity occupation \ 13 Male NaN NaN Retired Navy SEAL 27 Male NaN NaN Teacher/Farmer 31 Female NaN NaN Personal Nurse;Motivational Speaker 46 Female NaN NaN Retired Teacher 60 Male NaN NaN Judge 74 Male NaN NaN Land Broker 86 Male NaN NaN Construction Company V.P. 98 Female NaN NaN Scout Troop Leader 127 Female NaN NaN Rancher 136 Male NaN NaN Lawyer 150 Male NaN NaN Retired Fire Captain 175 Male Asian Japanese American Art Teacher 179 Male NaN NaN Pilot 182 Male Black NaN Jazz Musician 205 Male NaN NaN School Bus Driver 217 Male Asian Bornean American Computer Engineer 221 Male NaN NaN Chicken Farmer 248 Female NaN NaN Retired Nurse 257 Male NaN NaN Wedding Videographer 268 Female NaN NaN Bus Driver 282 Male NaN NaN Personal Chef 303 Male NaN NaN Former NFL Coach 336 Male Black NaN Former Federal Agent;Software Sales 350 Male NaN NaN Rancher 363 Male NaN NaN Swimsuit Photographer 365 Male NaN NaN Plastic Surgeon 379 Male Black NaN Computer Engineer 385 Female NaN NaN Sex Therapist 419 Male NaN NaN Pilates Trainer 429 Male NaN NaN Farmer 458 Female NaN NaN Corporate Executive 473 Male NaN NaN Former FBI Agent 480 Male NaN NaN Boat Mechanic 498 Female NaN NaN Olympian 532 Female NaN NaN Army Veteran 541 Female Black NaN Publishing CEO 556 Female NaN NaN Sales Representative 574 Male NaN NaN Former NHL Player 592 Female NaN NaN Stay-at-home Mom personality_type 13 ISTJ 27 ISFJ 31 ESFJ 46 ISFJ 60 ISFJ 74 ISFJ 86 ISTJ 98 INFP 127 INFJ 136 INTP 150 ISTJ 175 INFP 179 ESTJ 182 ESFJ 205 ISFJ 217 INTP 221 ISFP 248 ENFP 257 INTJ 268 ESFP 282 ESTP 303 ESFJ 336 ESTJ 350 ISFJ 363 ESFP 365 INTP 379 ISTJ 385 INFJ 419 ENFJ 429 ISTJ 458 ENTP 473 ISTJ 480 ESTJ 498 ENFP 532 ISTJ 541 ESTJ 556 ESTP 574 ESFJ 592 ISFJ
Q4: What contestant played in the most number of seasons? Select their row from the castaway_details
DataFrame and save this as Q4
. This should return a DataFrame and the index and missing values should be left as is.
combined1 = castaway_details.merge(castaways, how='right', left_on= 'castaway_id', right_on='castaway_id')
combined3 = combined1.groupby(['castaway_id','full_name_x'],as_index=False)[['season']].nunique().sort_values(by = 'season', ascending = False)
combined3 = combined3.reset_index(drop = True)
combined3
most_played = combined3['castaway_id'].head(1).to_list()
Q4 = castaway_details[castaway_details['castaway_id'].isin(most_played)]
print(Q4)
castaway_id full_name short_name date_of_birth date_of_death gender \ 54 55 Rob Mariano Boston Rob 1975-12-25 NaT Male race ethnicity occupation personality_type 54 NaN NaN Construction Worker ESTJ
Q5: Create a DataFrame of all the contestants that won their season (aka their final result in the castaways
DataFrame was the 'Sole Survivor'). Call this DataFrame sole_survivor
. Note that contestants may appear more than one time in this DataFrame if they won more than one season. Make sure that the index goes from 0 to n-1 and that the DataFrame is sorted ascending by season number.
The DataFrame should have the same columns, and the columns should be in the same order, as the castaways
DataFrame.
sole_survivor = castaways[castaways['result'] == 'Sole Survivor'].sort_values(by = 'season')
print(sole_survivor)
season_name season full_name \ 746 Survivor: Borneo 1 Richard Hatch 730 Survivor: The Australian Outback 2 Tina Wesson 714 Survivor: Africa 3 Ethan Zohn 698 Survivor: Marquesas 4 Vecepia Towery 682 Survivor: Thailand 5 Brian Heidik 666 Survivor: The Amazon 6 Jenna Morasca 648 Survivor: Pearl Islands 7 Sandra Diaz-Twine 630 Survivor: All-Stars 8 Amber Brkich 612 Survivor: Vanuatu 9 Chris Daugherty 592 Survivor: Palau 10 Tom Westman 574 Survivor: Guatemala 11 Danni Boatwright 558 Survivor: Panama 12 Aras Baskauskas 538 Survivor: Cook Islands 13 Yul Kwon 519 Survivor: Fiji 14 Earl Cole 503 Survivor: China 15 Todd Herzog 483 Survivor: Micronesia 16 Parvati Shallow 465 Survivor: Gabon 17 Robert Crowley 449 Survivor: Tocantins 18 James Thomas Jr. 429 Survivor: Samoa 19 Natalie White 409 Survivor: Heroes vs. Villains 20 Sandra Diaz-Twine 389 Survivor: Nicaragua 21 Jud Birza 369 Survivor: Redemption Island 22 Rob Mariano 349 Survivor: South Pacific 23 Sophie Clarke 331 Survivor: One World 24 Kim Spradlin 313 Survivor: Philippines 25 Denise Stapley 293 Survivor: Caramoan 26 John Cochran 270 Survivor: Blood vs. Water 27 Tyson Apostol 252 Survivor: Cagayan 28 Tony Vlachos 234 Survivor: San Juan del Sur 29 Natalie Anderson 216 Survivor: Worlds Apart 30 Mike Holloway 196 Survivor: Cambodia 31 Jeremy Collins 178 Survivor: Kaoh Rong 32 Michele Fitzgerald 158 Survivor: Millennials vs. Gen X 33 Adam Klein 138 Survivor: Game Changers 34 Sarah Lacina 120 Survivor: Heroes vs. Healers vs. Hustlers 35 Ben Driebergen 100 Survivor: Ghost Island 36 Wendell Holland 80 Survivor: David vs. Goliath 37 Nick Wilson 60 Survivor: Edge of Extinction 38 Chris Underwood 40 Survivor: Island of the Idols 39 Tommy Sheehan 18 Survivor: Winners at War 40 Tony Vlachos 0 Survivor: 41 41 Erika Casupanan castaway_id castaway age city state \ 746 16 Richard 39 Newport Rhode Island 730 32 Tina 40 Knoxville Tennessee 714 48 Ethan 27 Lexington Massachusetts 698 64 Vecepia 36 Hayward California 682 80 Brian 34 Quartz Hill California 666 96 Jenna 21 Bridgeville Pennsylvania 648 112 Sandra 29 Fort Lewis Washington 630 27 Amber 25 Beaver Pennsylvania 612 130 Chris 33 South Vienna Ohio 592 150 Tom 40 Sayville New York 574 166 Danni 29 Tonganoxie Kansas 558 182 Aras 24 Santa Monica California 538 202 Yul 31 San Mateo California 519 221 Earl 35 Santa Monica California 503 237 Todd 22 Pleasant Grove Utah 483 197 Parvati 25 Los Angeles California 465 265 Bob 57 South Portland Maine 449 281 J.T. 24 Mobile Alabama 429 301 Natalie 26 Van Buren Arkansas 409 112 Sandra 35 Fayetteville North Carolina 389 321 Fabio 21 Venice California 369 55 Boston Rob 34 Pensacola Florida 349 353 Sophie 22 Willsboro New York 331 371 Kim 29 San Antonio Texas 313 386 Denise 41 Cedar Rapids Iowa 293 348 Cochran 25 Washington D.C. 270 274 Tyson 34 Provo Utah 252 424 Tony 39 Jersey City New Jersey 234 442 Natalie 28 Edgewater New Jersey 216 460 Mike 38 North Richland Hills Texas 196 433 Jeremy 37 Foxboro Massachusetts 178 478 Michele 24 Freehold New Jersey 158 498 Adam 25 San Francisco California 138 414 Sarah 32 Marion Iowa 120 516 Ben 34 Boise Idaho 100 536 Wendell 33 Philadelphia Pennsylvania 80 556 Nick 27 Williamsburg Kentucky 60 559 Chris 25 Greenville South Carolina 40 590 Tommy 26 Long Beach New York 18 424 Tony 45 Allendale New Jersey 0 594 Erika 32 Toronta Ontario personality_type episode day order result jury_status \ 746 ENTP 14 39 16 Sole Survivor NaN 730 ESFJ 16 42 16 Sole Survivor NaN 714 ISFP 15 39 16 Sole Survivor NaN 698 ISTJ 15 39 16 Sole Survivor NaN 682 ISTP 15 39 16 Sole Survivor NaN 666 ISTP 15 39 16 Sole Survivor NaN 648 ESTP 15 39 18 Sole Survivor NaN 630 ISFP 17 39 18 Sole Survivor NaN 612 ENTP 15 39 18 Sole Survivor NaN 592 ESTJ 15 39 20 Sole Survivor NaN 574 ENFJ 15 39 18 Sole Survivor NaN 558 INFP 16 39 16 Sole Survivor NaN 538 INTJ 16 39 20 Sole Survivor NaN 519 INFJ 15 39 19 Sole Survivor NaN 503 ENFP 15 39 16 Sole Survivor NaN 483 ENFJ 15 39 20 Sole Survivor NaN 465 INTP 14 39 18 Sole Survivor NaN 449 ESTP 15 39 16 Sole Survivor NaN 429 ISFJ 16 39 20 Sole Survivor NaN 409 ESTP 15 39 20 Sole Survivor NaN 389 ESFP 16 39 20 Sole Survivor NaN 369 ESTJ 15 39 20 Sole Survivor NaN 349 INTJ 16 39 20 Sole Survivor NaN 331 INFJ 15 39 18 Sole Survivor NaN 313 INFJ 15 39 18 Sole Survivor NaN 293 INTP 15 39 20 Sole Survivor NaN 270 ESTP 15 39 23 Sole Survivor NaN 252 ESTP 14 39 18 Sole Survivor NaN 234 ESTP 15 39 18 Sole Survivor NaN 216 ESTP 15 39 18 Sole Survivor NaN 196 ESTJ 15 39 20 Sole Survivor NaN 178 ESFP 15 39 18 Sole Survivor NaN 158 ENTP 14 39 20 Sole Survivor NaN 138 ENTJ 14 39 20 Sole Survivor NaN 120 ESFP 14 39 18 Sole Survivor NaN 100 INFJ 14 39 20 Sole Survivor NaN 80 ENTP 14 39 20 Sole Survivor NaN 60 ENTP 14 39 20 Sole Survivor NaN 40 ENFJ 14 39 20 Sole Survivor NaN 18 ESTP 15 39 22 Sole Survivor NaN 0 ENFP 13 26 18 Sole Survivor NaN original_tribe swapped_tribe swapped_tribe_2 merged_tribe \ 746 Tagi NaN NaN Rattana 730 Ogakor NaN NaN Barramundi 714 Boran Boran NaN Moto Maji 698 Maraamu Rotu NaN Soliantu 682 Chuay Gahn NaN NaN Chuay Jai 666 Jaburu Jaburu NaN Jacaré 648 Drake Drake NaN Balboa 630 Chapera Chapera Chapera Chaboga Mogo 612 Lopevi Lopevi NaN Alinta 592 Koror NaN NaN Koror 574 Nakúm Yaxhá NaN Xhakúm 558 Viveros Casaya NaN Gitanos 538 Puka Puka Aitutaki Aitutaki Aitutonga 519 Ravu Moto NaN Bula Bula 503 Fei Long Fei Long NaN Hae Da Fung 483 Malakal Airai NaN Dabu 465 Kota Kota Kota Nobag 449 Jalapao NaN NaN Forza 429 Foa Foa NaN NaN Aiga 409 Villains NaN NaN Yin Yang 389 La Flor La Flor NaN Libertad 369 Ometepe NaN NaN Murlonio 349 Upolu NaN NaN Te Tuna 331 Salani Salani NaN Tikiano 313 Matsing Kalabaw NaN Dangrayne 293 Bikal Bikal NaN Enil Edam 270 Galang Tadhana NaN Kasama 252 Aparri Solana NaN Solarrion 234 Hunahpu Hunahpu NaN Huyopa 216 Escameca Escameca NaN Merica 196 Bayon Bayon Bayon Orkun 178 Gondol Chan Loh NaN Dara 158 Vanua Takali NaN Vinaka 138 Nuku Tavua NaN Maku Maku 120 Levu Yawa NaN Solewa 100 Naviti Naviti Yanuya Lavita 80 David Jabeni NaN Kalokalo 60 Manu NaN NaN Vata 40 Vokai Vokai NaN Lumuwaku 18 Dakal Dakal NaN Koru 0 Luvu NaN NaN Via Kana total_votes_received immunity_idols_won 746 6 4 730 0 2 714 0 4 698 2 4 682 0 8 666 3 7 648 0 3 630 6 6 612 3 6 592 0 12 574 1 5 558 9 4 538 5 6 519 1 5 503 5 4 483 4 7 465 2 8 449 0 5 429 8 1 409 3 4 389 2 8 369 7 9 349 5 6 331 3 6 313 6 1 293 0 7 270 2 8 252 5 6 234 0 6 216 4 7 196 3 7 178 1 7 158 6 6 138 0 4 120 11 4 100 5 8 80 0 4 60 9 1 40 2 6 18 0 9 0 2 8
Q6: Have any contestants won more than one time? If so, select their records from the sole_survivor
DataFrame, sorting the rows by season. Save this as Q6
. If no contestant has won twice, save Q6 as the string None
.
sole_survivor.groupby('full_name', as_index=False)['season'].count().sort_values(by = 'season', ascending = False)
Q6 = sole_survivor[(sole_survivor['full_name'] == 'Sandra Diaz-Twine') | (sole_survivor['full_name'] == 'Tony Vlachos')].sort_values(by = 'season')
print(Q6)
season_name season full_name castaway_id \ 648 Survivor: Pearl Islands 7 Sandra Diaz-Twine 112 409 Survivor: Heroes vs. Villains 20 Sandra Diaz-Twine 112 252 Survivor: Cagayan 28 Tony Vlachos 424 18 Survivor: Winners at War 40 Tony Vlachos 424 castaway age city state personality_type episode \ 648 Sandra 29 Fort Lewis Washington ESTP 15 409 Sandra 35 Fayetteville North Carolina ESTP 15 252 Tony 39 Jersey City New Jersey ESTP 14 18 Tony 45 Allendale New Jersey ESTP 15 day order result jury_status original_tribe swapped_tribe \ 648 39 18 Sole Survivor NaN Drake Drake 409 39 20 Sole Survivor NaN Villains NaN 252 39 18 Sole Survivor NaN Aparri Solana 18 39 22 Sole Survivor NaN Dakal Dakal swapped_tribe_2 merged_tribe total_votes_received immunity_idols_won 648 NaN Balboa 0 3 409 NaN Yin Yang 3 4 252 NaN Solarrion 5 6 18 NaN Koru 0 9
Q7: Using value_counts(), what is the normalized relative frequencies (percentage) breakdown of gender for all the contestants that have played before? Count someone who played in multiple seasons only once. Round the results to 3 decimal places.
Q7 = castaway_details[castaway_details['full_name'] == castaway_details['full_name'].unique()]
Q7 = round(Q7['gender'].value_counts(normalize='True'),3)
print(Q7)
Male 0.502 Female 0.497 Non-binary 0.002 Name: gender, dtype: float64
Q8:
Q8A
. Q8B
. 55.57
).combined4 = castaway_details.merge(sole_survivor, how='inner', left_on= 'castaway_id', right_on='castaway_id')
combined4['gender'].value_counts()
Male 25 Female 16 Name: gender, dtype: int64
Q8A = round(25/41*100,2)
Q8B = round(16/41*100,2)
print(Q8A,Q8B)
60.98 39.02
Q9: What is the average age of contestants when they appeared on the show? Save this as Q9
. Round to nearest integer.
Q9 = round(castaways['age'].mean(),0)
Q9 = int(Q9)
print(Q9)
33
Q10: Let's say we wanted to analyze the types of occupations that make a good winner on Survivor. Create a DataFrame that includes the occupation of every winner (sorted by season). If a contestant won more than one time, the occupation should only appear once for the first time the contestant was on the show. The DataFrame index will be the respective season number and should be called Q10
.
Code Check: The first five rows of the Q10
DataFrame should look similar to this:
season | occupation |
---|---|
1 | Corporate Trainer |
2 | Personal Nurse; Motivational Speaker |
3 | Professional Soccer Player;Social Entrepreneur... |
4 | Office Manager |
5 | Used Car Salesman |
Q10 = combined4.groupby('castaway_id',as_index = False)[['season','occupation']].min()
Q10 = Q10[['season','occupation']].sort_values(by='season').reset_index(drop=True)
print(Q10)
season occupation 0 1 Corporate Trainer 1 2 Personal Nurse;Motivational Speaker 2 3 Professional Soccer Player;Social Entrepreneur... 3 4 Office Manager 4 5 Used Car Salesman 5 6 Swimsuit Model 6 7 Office Assistant;Case Manager 7 8 Administrative Assistant;Director of Marketing... 8 9 Highway Construction Worker 9 10 NYC Firefighter;Motivational Speaker 10 11 Sports Radio Host;Owner of Sideline Chic 11 12 Yoga Instructor;Musician 12 13 Management Consultant;Product Management 13 14 Ad Executive 14 15 Flight Attendant 15 16 Boxer;Charity Organizer;Yoga Teacher/Life Coac... 16 17 Physics Teacher 17 18 Cattle Rancher 18 19 Pharmaceutical Sales 19 21 Student 20 22 Construction Worker 21 23 Medical Student;Healthcare Consultant 22 24 Bridal Shop Owner;Interior Designer 23 25 Sex Therapist 24 26 Harvard Law Student 25 27 Former Professional Cyclist/Missionary;Shop Ma... 26 28 Police Officer 27 29 Crossfit Coach/Physical Therapy Student;CrossF... 28 30 Oil Driller 29 31 Cambridge Firefighter 30 32 Bartender;Business Development Manager 31 33 Homeless Shelter Manager;Keynote Speaker and Host 32 34 Police Officer 33 35 Marine;Real Estate/Stay-at-home Dad 34 36 Furniture Company Owner;Furniture Designer 35 37 Public Defender;Attorney 36 38 District Sales Manager 37 39 4th Grade Teacher 38 41 Communications Manager
Q11: Who played the most total number of days of Survivor? If a contestant appeared on more than one season, you would add their total days for each season together. Save the top five contestants in terms of total days played as a DataFrame and call it Q11
, sorted in descending order by total days played.
The following columns should be included: castaway_id
, full_name
, and total_days_played
where total_days_played
is the sum of all days a contestant played. The index should go from 0 to n-1.
Note: Be careful because on some seasons, the contestant was allowed to come back into the game after being voted off. Take a look at Season 23's contestant Oscar Lusth in the castaways
DataFrame as an example. He was voted out 7th and then returned to the game. He was then voted out 9th and returned to the game a second time. He was then voted out 17th the final time. Be aware of this in your calculations and make sure you are counting the days according to the last time they were voted off or won.
Q11 = castaways.groupby(['season','castaway_id','full_name'],as_index=False)['day'].max().sort_values(by='day',ascending=False)
Q11 = Q11.groupby(['castaway_id','full_name'],as_index=False)['day'].sum().sort_values(by='day',ascending=False)
Q11 = Q11.rename(columns={'day':'total_days_played'})
print(Q11)
castaway_id full_name total_days_played 55 55 Rob Mariano 131 198 197 Parvati Shallow 130 202 201 Oscar Lusth 128 179 179 Cirie Fields 121 112 112 Sandra Diaz-Twine 110 .. ... ... ... 133 133 Jolanda Jones 3 0 1 Sonja Christopher 3 131 131 Jonathan Libby 2 132 132 Wanda Shirk 2 195 195 Candice Cody 1 [611 rows x 3 columns]
Q12A & Q12B: What is the percentage of total extroverts and introverts that have played the game (count players only once even if they have played in more than one season). Save these percentages as Q12A
and Q12B
respectively. Note: Round all percentages to two decimal points and write as a float (example: 55.57).
For more information on personality types check this Wikipedia article.
total = combined1.groupby(['castaway_id','personality_type_x'],as_index=False)['season'].sum()
count_total = total['castaway_id'].count()
extroverts = total[total['personality_type_x'].str.contains('E')]
count_extroverts = extroverts['personality_type_x'].count()
Q12A = round(count_extroverts/count_total*100,2)
print(Q12A)
53.63
Q12B = 100 - Q12A
print(Q12B)
46.37
What is the percentage of total extroverts and introverts that have won the game (count players only once even if they have won more than one season)? Save these percentages as Q13A
and Q13B
respectively. Note: Round all percentages to two decimal points and write as a float (example: 55.57).
count_total = combined4['castaway_id'].nunique()
extroverts = combined4[combined4['personality_type_x'].str.contains('E')]
count_extroverts = extroverts['castaway_id'].nunique()
Q13A = round(count_extroverts/count_total*100,2)
print(Q13A)
61.54
Q13B = 100 - Q13A
print(Q13B)
38.46
Q14: Which contestants have never received a tribal council vote (i.e. a vote to be voted out of the game as shown in the vote_id
column in the vote_history
DataFrame)? Note that there are various reasons for a contestant to not receive a tribal vote: they quit, made it to the end, medical emergency, etc. Select their rows from the castaway_details
DataFrame and save this as Q14
in ascending order by castaway_id
. This should return a DataFrame and the index and missing values should be left as is.
not_voted = castaways.groupby('castaway_id',as_index=False)['total_votes_received'].sum()
not_voted = not_voted[not_voted['total_votes_received'] == 0]
not_voted_list = not_voted['castaway_id'].to_list()
Q14 = castaway_details[castaway_details['castaway_id'].isin(not_voted_list)]
print(Q14)
castaway_id full_name short_name date_of_birth date_of_death \ 21 22 Michael Skupin Michael 1962-01-29 NaT 60 61 Paschal English Paschal 1945-03-05 NaT 79 80 Brian Heidik Brian 1968-03-09 NaT 130 131 Jonathan Libby Jonathan 1981-09-05 NaT 131 132 Wanda Shirk Wanda 1949-08-24 NaT 205 206 Gary Stritesky Gary 1951-09-16 NaT 241 242 Kathleen Sleckman Kathy 1962-08-14 NaT 282 283 Mike Borassi Mike 1947-03-13 NaT 313 314 Kelly Shinn Purple Kelly 1990-05-05 NaT 353 354 Kourtney Moon Kourtney 1982-02-27 NaT 374 375 Dana Lambert Dana 1979-12-13 NaT 384 385 Lisa Whelchel Lisa 1963-05-29 NaT 411 412 Lindsey Ogle Lindsey 1983-09-04 NaT 467 468 Neal Gottlieb Neal 1977-02-03 NaT 536 537 Pat Cusack Pat 1977-02-25 NaT 539 540 Bi Nguyen Bi 1989-10-30 NaT 552 553 Kara Kay Kara 1987-11-20 NaT 569 570 Gavin Whitson Gavin 1994-07-20 NaT gender race ethnicity \ 21 Male NaN NaN 60 Male NaN NaN 79 Male NaN NaN 130 Male NaN NaN 131 Female NaN NaN 205 Male NaN NaN 241 Female NaN NaN 282 Male NaN NaN 313 Female NaN NaN 353 Female NaN NaN 374 Female NaN NaN 384 Female NaN NaN 411 Female NaN NaN 467 Male NaN NaN 536 Male NaN NaN 539 Female Asian NaN 552 Female NaN NaN 569 Male NaN NaN occupation personality_type 21 Software Publisher;Part-Time Professional Speaker ESFJ 60 Judge ISFJ 79 Used Car Salesman ISTP 130 Sales & Marketing Associate ISTP 131 English Teacher ENFP 205 School Bus Driver ISFJ 241 Golf Course Vendor ENFP 282 Personal Chef ESTP 313 Nursing Student ENFP 353 Motorcycle Repair ISFP 374 Cosmetologist ISTP 384 Former TV Teen Star INFP 411 Hairstylist ESFP 467 Ice Cream Entrepreneur INTP 536 Maintenance Manager ESTP 539 MMA Fighter ISFP 552 Realtor ENFJ 569 YMCA Program Director ISFJ
Q15: What contestant has won the most number of challenges? Select their row from the castaway_details
DataFrame and save this as Q15
. This should return a DataFrame and the index and missing values should be left as is.
x = challenge_results.groupby(['winner_id','winner'],as_index = False)['winning_tribe'].count().sort_values(by = 'winning_tribe',ascending=False)
combined6 = tribe_mapping.merge(x, how='right', left_on= 'castaway', right_on='winner')
most_wins = combined6.groupby('castaway_id',as_index=False)['winning_tribe'].max().sort_values(by='winning_tribe',ascending=False).head(1)
most_wins = most_wins['castaway_id'].to_list()
Q15 = castaway_details[castaway_details['castaway_id'].isin(most_wins)]
print(Q15)
castaway_id full_name short_name date_of_birth date_of_death gender \ 200 201 Oscar Lusth Ozzy 1981-08-23 NaT Male race ethnicity occupation \ 200 Mexican American Hispanic or Latino Waiter;Photographer personality_type 200 ISFP
Q16: What challenge has been played the most number of times in all seasons? To make it easier, a challenge should only count one time for each season it was played even if it was played more than once during a season (although I don't think that is the case for any of these challenges). Select their row from the challenge_description
DataFrame and save this as Q16
. This should return a DataFrame and the index and missing values should be left as is.
challenge_results['challenge_id'].value_counts()
CH0631 157 CH0004 84 CH0388 75 CH0765 73 CH0430 70 ... CH0462 1 CH0764 1 CH0695 1 CH0361 1 CH0606 1 Name: challenge_id, Length: 506, dtype: int64
Q16 = challenge_description[challenge_description['challenge_id'] == 'CH0631']
print(Q16)
challenge_id challenge_name puzzle race precision endurance \ 548 CH0631 Blind Leading the Blind False True False False strength turn_based balance food knowledge memory fire water 548 False False False False False False False False
Q17: Let's see if the use of hidden immunity idols has increased or decreased over the seasons. Create a Series of the number of hidden idols held per season. The season number should be the index and the values should be the sum of the number of idols that were held. Save this as Q17
, sorted by season in ascending order.
Q17 = hidden_idols.notna()
Q17 = hidden_idols.groupby('season')['idols_held'].sum()
print(Q17)
season 11 1 12 1 13 1 14 4 15 3 16 4 17 4 18 3 19 4 20 11 21 4 22 3 23 2 24 3 25 3 26 7 27 3 28 6 29 5 30 3 31 4 32 5 33 7 34 8 35 10 36 9 37 7 38 8 39 13 40 10 Name: idols_held, dtype: int64
Q18: Which contestant held the most number of hidden immunity idols in a single season? Select their row from the castaway_details
DataFrame and save this as Q18
. This should return a DataFrame and the index and missing values should be left as is.
hidden_idols.groupby(['season','castaway_id'],as_index = False)['idols_held'].sum().sort_values(by = 'idols_held',ascending = False).head(1)
season | castaway_id | idols_held | |
---|---|---|---|
102 | 38 | 560 | 4 |
Q18 = castaway_details[castaway_details['castaway_id'] == 560]
print(Q18)
castaway_id full_name short_name date_of_birth date_of_death gender \ 559 560 Rick Devens Rick 1984-04-05 NaT Male race ethnicity occupation personality_type 559 NaN NaN Morning News Anchor ENTP
Q19: What was the largest number of days between when a hidden immunity idol was found and played. Don't count instances with missing values in days found or the days played column. Save the largest number of days as Q19
(as an int).
highest_days = hidden_idols[hidden_idols['day_found'].notna()]
highest_days = hidden_idols[hidden_idols['day_played'].notna()]
highest_days = highest_days.copy()
highest_days['difference'] = hidden_idols['day_played'] - hidden_idols['day_found']
Q19 = highest_days['difference'].max()
Q19 = int(Q19)
print(Q19)
33
Q20: Let's find out which finalist received zero votes from the jury (remember the jury votes are good -- you want jury votes to win the game but these players did not receive any votes).
jury_votes
DataFrame grouped by season, then finalist id. Q20
.votes = jury_votes.groupby(['season','finalist_id'],as_index=False)['vote'].sum()
votes = votes[votes['vote'] == 0]
Q20 = votes.groupby(['season','finalist_id'])['vote'].sum()
print(Q20)
season finalist_id 13 200 0 14 219 0 220 0 17 263 0 18 280 0 19 299 0 20 300 0 21 319 0 22 336 0 23 352 0 24 369 0 26 346 0 396 0 27 10 0 31 419 0 421 0 32 476 0 33 496 0 497 0 34 364 0 36 534 0 37 554 0 38 569 0 39 588 0 40 478 0 41 597 0 Name: vote, dtype: int64
Q21: Let's see how many winners ended up getting unanimous jury votes to win the game. Create a Dataframe that shows the survivors that got unanimous jury votes with these columns in the final output: season
, season_name
, winner_id
, full_name
. The DataFrame should be sorted by season and the index should go from 0 to n-1. Save this as Q21
.
unanimous_list = ['10-0-0','7-0','8-0-0','9-0-0']
Q21 = season_summary[season_summary['final_vote'].isin(unanimous_list)]
Q21 = Q21[['season','season_name','winner_id','full_name']].reset_index(drop=True)
print(Q21)
season season_name winner_id full_name 0 14 Survivor: Fiji 221 Earl Cole 1 18 Survivor: Tocantins 281 James Thomas Jr. 2 26 Survivor: Caramoan 348 John Cochran 3 31 Survivor: Cambodia 433 Jeremy Collins 4 33 Survivor: Millennials vs. Gen X 498 Adam Klein
Q22: Sometimes a contestant might win the game even though they have a lot of other contestants trying to eliminate them. What survivor that won their season had the most votes against them to get voted out during the season (represented as "total_votes_received" from the sole_survivor
DataFrame). Select their row from the castaway_details
DataFrame and save this as Q22
. This should return a DataFrame and the index and missing values should be left as is.
### ENTER CODE HERE ###
combined4.groupby('castaway_id',as_index=False)['total_votes_received'].max().sort_values(by='total_votes_received',ascending=False).head(1)
castaway_id | total_votes_received | |
---|---|---|
33 | 516 | 11 |
Q22 = castaway_details[castaway_details['castaway_id'] == 516]
print(Q22)
castaway_id full_name short_name date_of_birth date_of_death \ 515 516 Ben Driebergen Ben 1983-01-01 NaT gender race ethnicity occupation \ 515 Male NaN NaN Marine;Real Estate/Stay-at-home Dad personality_type 515 ESFP
Q23: Let's see how many times each country was used as a location. Create a Series where the country name is the index and the total number of times a season was played in that country are the values. Sort in descending order and call this Q23
.
Q23 = season_summary.groupby('country')['season'].count()
print(Q23)
country Australia 1 Brazil 2 Cambodia 2 China 1 Fiji 10 Gabon 1 Guatemala 1 Islands 1 Kenya 1 Malaysia 1 Nicaragua 6 Palau 2 Panama 3 Philippines 4 Polynesia 1 Samoa 2 Thailand 1 Vanuatu 1 Name: season, dtype: int64
Q24: For the castaway_details
DataFrame, there is a full_name
column and a short_name
column. It would be helpful for future analysis to have the contestants first and last name split into separate columns. First copy the castaway_details
DataFrame to a new DataFrame called Q24
so that we do not change the original DataFrame.
Create two new columns and add the contestant's first name to a new column called first_name
and their last name to a new column called last_name
. Add these columns to the Q24
DataFrame. Put the first_name
and last_name
columns between the full_name
and short_name
columns.
Note: Be careful as some players have last names with multiple spaces. For example, Lex van den Berghe
. You should code Lex
as his first name and van den Berghe
as his last name.
Q24 = castaway_details.copy()
Q24[['first_name', 'last_name']] = Q24['full_name'].str.split(" ", 1, expand=True)
Q24 = Q24[['castaway_id', 'full_name', 'first_name', 'last_name', 'short_name','date_of_birth','date_of_death','gender','race'
,'ethnicity','occupation','personality_type']]
print(Q24)
castaway_id full_name first_name last_name short_name \ 0 1 Sonja Christopher Sonja Christopher Sonja 1 2 B.B. Anderson B.B. Anderson B.B. 2 3 Stacey Stillman Stacey Stillman Stacey 3 4 Ramona Gray Ramona Gray Ramona 4 5 Dirk Been Dirk Been Dirk .. ... ... ... ... ... 603 604 Tiffany Seely Tiffany Seely Tiffany 604 605 Sydney Segal Sydney Segal Sydney 605 606 Shantel Smith Shantel Smith Shan 606 607 David Voce David Voce Voce 607 608 Liana Wallace Liana Wallace Liana date_of_birth date_of_death gender race ethnicity \ 0 1937-01-28 NaT Female NaN NaN 1 1936-01-18 2013-10-29 Male NaN NaN 2 1972-08-11 NaT Female NaN NaN 3 1971-01-20 NaT Female Black NaN 4 1976-06-15 NaT Male NaN NaN .. ... ... ... ... ... 603 1973-12-08 NaT Female White Jewish 604 1995-07-19 NaT Female White Jewish 605 1987-03-11 NaT Female Black NaN 606 1986-05-01 NaT Male NaN NaN 607 2000-10-25 NaT Female Black Jewish occupation personality_type 0 Musician ENFP 1 Real Estate Developer ESTJ 2 Attorney ENTJ 3 Biochemist/Chemist ISTJ 4 Dairy Farmer ISFP .. ... ... 603 Teacher ENTP 604 Law Student ESTP 605 Pastor ENFJ 606 Neurosurgeon ENTJ 607 College Student ESTJ [608 rows x 12 columns]
Q24[Q24['full_name'] == 'Lex van den Berghe']
castaway_id | full_name | first_name | last_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | 46 | Lex van den Berghe | Lex | van den Berghe | Lex | 1963-06-18 | NaT | Male | NaN | NaN | Marketing Manager | ENTJ |
Q25: Let's say that we have a theory that contestants that stay in their original tribe when the tribes are swapped have a better chance at winning their season.
sole_survivor
DataFrame that you created earlier, drop any castaways that have a missing value in the "swapped tribe" column. Use a copy of the data so that you do not drop rows from the original DataFrame.Q25A
.Q25B
.Note: In reality, it is more complicated than this as you would really want to see how many contestants ended up with the majority of their original tribe on the swapped tribe regardless of whether their tribe stayed the same. You are welcome to research this on your own for further practice.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy()
if needed.
cond = sole_survivor.copy()
cond = cond[cond['swapped_tribe'].notna()]
Q25A = cond[cond['swapped_tribe'] == cond['original_tribe']]
print(Q25A)
season_name season full_name castaway_id \ 714 Survivor: Africa 3 Ethan Zohn 48 666 Survivor: The Amazon 6 Jenna Morasca 96 648 Survivor: Pearl Islands 7 Sandra Diaz-Twine 112 630 Survivor: All-Stars 8 Amber Brkich 27 612 Survivor: Vanuatu 9 Chris Daugherty 130 503 Survivor: China 15 Todd Herzog 237 465 Survivor: Gabon 17 Robert Crowley 265 389 Survivor: Nicaragua 21 Jud Birza 321 331 Survivor: One World 24 Kim Spradlin 371 293 Survivor: Caramoan 26 John Cochran 348 234 Survivor: San Juan del Sur 29 Natalie Anderson 442 216 Survivor: Worlds Apart 30 Mike Holloway 460 196 Survivor: Cambodia 31 Jeremy Collins 433 100 Survivor: Ghost Island 36 Wendell Holland 536 40 Survivor: Island of the Idols 39 Tommy Sheehan 590 18 Survivor: Winners at War 40 Tony Vlachos 424 castaway age city state personality_type \ 714 Ethan 27 Lexington Massachusetts ISFP 666 Jenna 21 Bridgeville Pennsylvania ISTP 648 Sandra 29 Fort Lewis Washington ESTP 630 Amber 25 Beaver Pennsylvania ISFP 612 Chris 33 South Vienna Ohio ENTP 503 Todd 22 Pleasant Grove Utah ENFP 465 Bob 57 South Portland Maine INTP 389 Fabio 21 Venice California ESFP 331 Kim 29 San Antonio Texas INFJ 293 Cochran 25 Washington D.C. INTP 234 Natalie 28 Edgewater New Jersey ESTP 216 Mike 38 North Richland Hills Texas ESTP 196 Jeremy 37 Foxboro Massachusetts ESTJ 100 Wendell 33 Philadelphia Pennsylvania INFJ 40 Tommy 26 Long Beach New York ENFJ 18 Tony 45 Allendale New Jersey ESTP episode day order result jury_status original_tribe \ 714 15 39 16 Sole Survivor NaN Boran 666 15 39 16 Sole Survivor NaN Jaburu 648 15 39 18 Sole Survivor NaN Drake 630 17 39 18 Sole Survivor NaN Chapera 612 15 39 18 Sole Survivor NaN Lopevi 503 15 39 16 Sole Survivor NaN Fei Long 465 14 39 18 Sole Survivor NaN Kota 389 16 39 20 Sole Survivor NaN La Flor 331 15 39 18 Sole Survivor NaN Salani 293 15 39 20 Sole Survivor NaN Bikal 234 15 39 18 Sole Survivor NaN Hunahpu 216 15 39 18 Sole Survivor NaN Escameca 196 15 39 20 Sole Survivor NaN Bayon 100 14 39 20 Sole Survivor NaN Naviti 40 14 39 20 Sole Survivor NaN Vokai 18 15 39 22 Sole Survivor NaN Dakal swapped_tribe swapped_tribe_2 merged_tribe total_votes_received \ 714 Boran NaN Moto Maji 0 666 Jaburu NaN Jacaré 3 648 Drake NaN Balboa 0 630 Chapera Chapera Chaboga Mogo 6 612 Lopevi NaN Alinta 3 503 Fei Long NaN Hae Da Fung 5 465 Kota Kota Nobag 2 389 La Flor NaN Libertad 2 331 Salani NaN Tikiano 3 293 Bikal NaN Enil Edam 0 234 Hunahpu NaN Huyopa 0 216 Escameca NaN Merica 4 196 Bayon Bayon Orkun 3 100 Naviti Yanuya Lavita 5 40 Vokai NaN Lumuwaku 2 18 Dakal NaN Koru 0 immunity_idols_won 714 4 666 7 648 3 630 6 612 6 503 4 465 8 389 8 331 6 293 7 234 6 216 7 196 7 100 8 40 6 18 9
Q25B = cond[cond['swapped_tribe'] != cond['original_tribe']]
print(Q25B)
season_name season full_name \ 698 Survivor: Marquesas 4 Vecepia Towery 574 Survivor: Guatemala 11 Danni Boatwright 558 Survivor: Panama 12 Aras Baskauskas 538 Survivor: Cook Islands 13 Yul Kwon 519 Survivor: Fiji 14 Earl Cole 483 Survivor: Micronesia 16 Parvati Shallow 313 Survivor: Philippines 25 Denise Stapley 270 Survivor: Blood vs. Water 27 Tyson Apostol 252 Survivor: Cagayan 28 Tony Vlachos 178 Survivor: Kaoh Rong 32 Michele Fitzgerald 158 Survivor: Millennials vs. Gen X 33 Adam Klein 138 Survivor: Game Changers 34 Sarah Lacina 120 Survivor: Heroes vs. Healers vs. Hustlers 35 Ben Driebergen 80 Survivor: David vs. Goliath 37 Nick Wilson castaway_id castaway age city state personality_type \ 698 64 Vecepia 36 Hayward California ISTJ 574 166 Danni 29 Tonganoxie Kansas ENFJ 558 182 Aras 24 Santa Monica California INFP 538 202 Yul 31 San Mateo California INTJ 519 221 Earl 35 Santa Monica California INFJ 483 197 Parvati 25 Los Angeles California ENFJ 313 386 Denise 41 Cedar Rapids Iowa INFJ 270 274 Tyson 34 Provo Utah ESTP 252 424 Tony 39 Jersey City New Jersey ESTP 178 478 Michele 24 Freehold New Jersey ESFP 158 498 Adam 25 San Francisco California ENTP 138 414 Sarah 32 Marion Iowa ENTJ 120 516 Ben 34 Boise Idaho ESFP 80 556 Nick 27 Williamsburg Kentucky ENTP episode day order result jury_status original_tribe \ 698 15 39 16 Sole Survivor NaN Maraamu 574 15 39 18 Sole Survivor NaN Nakúm 558 16 39 16 Sole Survivor NaN Viveros 538 16 39 20 Sole Survivor NaN Puka Puka 519 15 39 19 Sole Survivor NaN Ravu 483 15 39 20 Sole Survivor NaN Malakal 313 15 39 18 Sole Survivor NaN Matsing 270 15 39 23 Sole Survivor NaN Galang 252 14 39 18 Sole Survivor NaN Aparri 178 15 39 18 Sole Survivor NaN Gondol 158 14 39 20 Sole Survivor NaN Vanua 138 14 39 20 Sole Survivor NaN Nuku 120 14 39 18 Sole Survivor NaN Levu 80 14 39 20 Sole Survivor NaN David swapped_tribe swapped_tribe_2 merged_tribe total_votes_received \ 698 Rotu NaN Soliantu 2 574 Yaxhá NaN Xhakúm 1 558 Casaya NaN Gitanos 9 538 Aitutaki Aitutaki Aitutonga 5 519 Moto NaN Bula Bula 1 483 Airai NaN Dabu 4 313 Kalabaw NaN Dangrayne 6 270 Tadhana NaN Kasama 2 252 Solana NaN Solarrion 5 178 Chan Loh NaN Dara 1 158 Takali NaN Vinaka 6 138 Tavua NaN Maku Maku 0 120 Yawa NaN Solewa 11 80 Jabeni NaN Kalokalo 0 immunity_idols_won 698 4 574 5 558 4 538 6 519 5 483 7 313 1 270 8 252 6 178 7 158 6 138 4 120 4 80 4
Q26: Let's say that we wanted to predict a contestants personality type based on the information in the data files. Your task is to create a DataFrame that lists the castaway_id
, full_name
and personality_type
for each castaway contestant. However, since most machine learning algorithms use numeric data, you want to change the personality types to the following numbers:
Save this new DataFrame as Q26
and sort based on castaway_id
in ascending order.
Q26 = castaway_details.copy()
Q26 = Q26[['castaway_id','full_name','personality_type']]
Q26["personality_type"].replace({"ISTJ": 1, "ISTP": 2, "ISTP": 2, "ISFJ": 3, "ISFP": 4, "INFJ": 5, "INFP": 6,
"INTJ": 7, "INTP": 8, "ESTP": 9, "ESTJ": 10, "ESFP": 11, "ESFJ": 12, "ENFP": 13,
"ENFJ": 14, "ENTP": 15, "ENTJ": 16}, inplace=True)
Q26['personality_type'] = Q26['personality_type'].fillna(17)
Q26['personality_type'] = Q26['personality_type'].astype('int')
print(Q26)
castaway_id full_name personality_type 0 1 Sonja Christopher 13 1 2 B.B. Anderson 10 2 3 Stacey Stillman 16 3 4 Ramona Gray 1 4 5 Dirk Been 4 .. ... ... ... 603 604 Tiffany Seely 15 604 605 Sydney Segal 9 605 606 Shantel Smith 14 606 607 David Voce 16 607 608 Liana Wallace 10 [608 rows x 3 columns]
Q27: After thinking about this some more, you realize that you don't want to code the personality traits as you did in problem 26 since the data is not ordinal. Some machine learning algorithms will assume that numbers close to each other are more alike than those that are away from each other and that is not the case with these personality types.
Let's create a new DataFrame called Q27
that creates dummy columns (using get_dummies
) for the original personality type column. Add a prefix called "type" and drop the first column to help prevent multicollinearity. The columns should be castaway_id
, full_name
followed by the various dummy columns for the personality types. Don't worry about any missing values in this step.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy()
if needed.
Q27 = castaway_details.copy()
Q27 = Q27[['castaway_id','full_name','personality_type']]
Q27 = pd.get_dummies(data=Q27,columns=['personality_type'], prefix='type',drop_first = True)
print(Q27)
castaway_id full_name type_ENFP type_ENTJ type_ENTP \ 0 1 Sonja Christopher 1 0 0 1 2 B.B. Anderson 0 0 0 2 3 Stacey Stillman 0 1 0 3 4 Ramona Gray 0 0 0 4 5 Dirk Been 0 0 0 .. ... ... ... ... ... 603 604 Tiffany Seely 0 0 1 604 605 Sydney Segal 0 0 0 605 606 Shantel Smith 0 0 0 606 607 David Voce 0 1 0 607 608 Liana Wallace 0 0 0 type_ESFJ type_ESFP type_ESTJ type_ESTP type_INFJ type_INFP \ 0 0 0 0 0 0 0 1 0 0 1 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 .. ... ... ... ... ... ... 603 0 0 0 0 0 0 604 0 0 0 1 0 0 605 0 0 0 0 0 0 606 0 0 0 0 0 0 607 0 0 1 0 0 0 type_INTJ type_INTP type_ISFJ type_ISFP type_ISTJ type_ISTP 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 1 0 4 0 0 0 1 0 0 .. ... ... ... ... ... ... 603 0 0 0 0 0 0 604 0 0 0 0 0 0 605 0 0 0 0 0 0 606 0 0 0 0 0 0 607 0 0 0 0 0 0 [608 rows x 17 columns]
Q28: After running your data above through your machine learning model, you determine that a better prediction might come from breaking the personality type into its four parts (one part for each character in the type). Your task is now to create a DataFrame called Q28
that splits the personality type into the various parts and creates a new column for each part (these columns should be called interaction
that will represent the first letter in the personality type, information
for the second letter, decision
for the third, and organization
for the fourth).
Again, since most machine learning algorithms work with numeric data, perform the following on the four new columns:
interaction
--> code all I
's as 0
and E
's as 1
information
--> code all S
's as 0
and N
's as 1
decision
--> code all T
's as 0
and F
's as 1
organization
--> code as J
's with 0
and P
's as 1
2
For example, if a contestant's personality type was ENTJ
, your columns for that row would be:
1
for interaction
because of the E
1
for information
because of the N
0
for decision
because of the T
0
for organization
because of the J
The new DataFrame should be sorted in castaway_id
order and have the following columns in this order: castaway_id
, full_name
, personality_type
, interaction
, information
, decision
, organization
.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy()
if needed.
Q28 = castaway_details.copy()
Q28 = Q28[['castaway_id','full_name','personality_type']]
Q28[['0','interaction', 'information', 'decision', 'organization','1']] = Q28['personality_type'].str.split('', expand=True)
Q28 = Q28.drop(['0','1'], axis = 1)
Q28["interaction"].replace({"E": '1', "I": '0'}, inplace=True)
Q28["information"].replace({"N": '1', "S": '0'}, inplace=True)
Q28["decision"].replace({"F": '1', "T": '0'}, inplace=True)
Q28["organization"].replace({"P": '1', "J": '0'}, inplace=True)
print(Q28)
castaway_id full_name personality_type interaction information \ 0 1 Sonja Christopher ENFP 1 1 1 2 B.B. Anderson ESTJ 1 0 2 3 Stacey Stillman ENTJ 1 1 3 4 Ramona Gray ISTJ 0 0 4 5 Dirk Been ISFP 0 0 .. ... ... ... ... ... 603 604 Tiffany Seely ENTP 1 1 604 605 Sydney Segal ESTP 1 0 605 606 Shantel Smith ENFJ 1 1 606 607 David Voce ENTJ 1 1 607 608 Liana Wallace ESTJ 1 0 decision organization 0 1 1 1 0 0 2 0 0 3 0 0 4 1 1 .. ... ... 603 0 1 604 0 1 605 1 0 606 0 0 607 0 0 [608 rows x 7 columns]
Q29: Using data from castaways
, create a DataFrame called Q29
that bins the contestant ages (their age when they were on the season, not their current age) into the following age categories:
The final DataFrame should have the following columns in this order: season
, castaway_id
, full_name
, age
, and age_category
. The DataFrame should be sorted by age and then castaway_id. The index should be 0 through n-1. You should have the same amount of rows as in the castaways
DataFrame.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy()
if needed.
bins = [17,24,34,44,54,64,100]
group_names = ["18-24",'25-34','35-44','45-54','55-64','65+']
Q29 = castaways.copy()
Q29['age_category'] = pd.cut(Q29['age'], bins, labels = group_names)
Q29 = Q29[['season','castaway_id','full_name','age','age_category']].sort_values(by = ['age','castaway_id']).reset_index(drop = True)
print(Q29)
season castaway_id full_name age age_category 0 33 491 Will Wahl 18 18-24 1 36 528 Michael Yerger 18 18-24 2 18 270 Spencer Duhm 19 18-24 3 22 336 Natalie Tenerelli 19 18-24 4 23 350 Brandon Hantz 19 18-24 .. ... ... ... ... ... 757 24 366 Greg Smith 64 55-64 758 21 304 Jimmy Johnson 67 65+ 759 32 474 Joseph Del Campo 71 65+ 760 1 14 Rudy Boesch 72 65+ 761 8 14 Rudy Boesch 75 65+ [762 rows x 5 columns]
Q30: Based on the age categories you created above, what are the normalized percentages for the various age categories using value_counts()
. Sort the value counts by index. Save this as Q30
.
Q30= Q29['age_category'].value_counts(normalize=True).sort_index()
print(Q30)
18-24 0.190289 25-34 0.437008 35-44 0.211286 45-54 0.124672 55-64 0.031496 65+ 0.005249 Name: age_category, dtype: float64
Q31: Which contestant(s) played a perfect game? A perfect game is considered when the contestant:
Save this DataFrame as Q31
with the following columns: season_name
, season
, castaway_id
, full_name
, tribal_council_votes
, jury_votes
. The DataFrame should be sorted by season and the index should be 0 to n-1.
unanimous_season = [14,18,26,31,33]
perfect_game = combined4[(combined4['total_votes_received'] == 0)]
Q31 = perfect_game[perfect_game['season'].isin(unanimous_season)]
Q31 = Q31.merge(season_summary, how='inner', left_on= 'season', right_on='season')
Q31 = Q31[['season_name_x','season','castaway_id','full_name_x','total_votes_received','final_vote']]
Q31 = Q31.rename(columns={'season_name_x':'season_name','full_name_x':'full_name','total_votes_received':'tribal_council_votes','final_vote':'jury_votes'})
Q31['jury_votes'] = Q31['jury_votes'].str.split('-').str[0]
print(Q31)
season_name season castaway_id full_name \ 0 Survivor: Tocantins 18 281 James Thomas Jr. 1 Survivor: Caramoan 26 348 John Cochran tribal_council_votes jury_votes 0 0 7 1 0 8