Visualize
# Importing Required Libraries
import pandas as pd
import matplotlib.pylab as plt
plt.style.use('ggplot')
# Loading Data in to Pandas Dataframe
df = pd.read_csv('Data_Survey.csv',skiprows=[1,2])
# Viewing First 5 Rows
df.head()
StartDate | EndDate | Status | IPAddress | Progress | Duration (in seconds) | Finished | RecordedDate | ResponseId | RecipientLastName | ... | Q14 | Q15 | Q16 | Q17_1 | Q18 | Q19 | Q20 | Q21 | Q21_7_TEXT | Q22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4/4/2022 12:09 | 4/4/2022 12:09 | IP Address | 71.176.116.110 | 100 | 9 | True | 4/4/2022 12:09 | R_3iOTGKdgC3iKADr | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 4/4/2022 12:09 | 4/4/2022 12:17 | IP Address | 71.176.116.110 | 100 | 469 | True | 4/4/2022 12:17 | R_3kgJOdcw2NW5yLy | NaN | ... | NaN | NaN | NaN | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 4/4/2022 12:17 | 4/4/2022 12:24 | IP Address | 71.176.116.110 | 100 | 415 | True | 4/4/2022 12:24 | R_3meha5WatdczRqh | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4/4/2022 12:24 | 4/4/2022 12:28 | IP Address | 71.176.116.110 | 100 | 253 | True | 4/4/2022 12:28 | R_pGEDdFForjtw9xv | NaN | ... | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4/5/2022 19:04 | 4/5/2022 19:05 | IP Address | 71.115.229.93 | 100 | 97 | True | 4/5/2022 19:06 | R_eEeXiduEZEDzSx3 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 65 columns
# Dropping All Null values in Q2 column
df.dropna(subset=["Q2"], axis=0, inplace=True)
# Filtering all rows which are equal to Yes
df = df[df['Q2'] == 'Yes']
df = df.sort_values(by='EndDate')
# Dropping all duplicate IPAddress keeping the first
df.drop_duplicates(subset='IPAddress', keep="first")
StartDate | EndDate | Status | IPAddress | Progress | Duration (in seconds) | Finished | RecordedDate | ResponseId | RecipientLastName | ... | Q14 | Q15 | Q16 | Q17_1 | Q18 | Q19 | Q20 | Q21 | Q21_7_TEXT | Q22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
168 | 4/10/2022 0:28 | 4/10/2022 0:28 | IP Address | 194.182.8.81 | 9 | 6 | False | 4/17/2022 0:28 | R_3kNi7S7qB8ixPSc | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
169 | 4/10/2022 10:28 | 4/10/2022 10:29 | IP Address | 173.171.17.193 | 18 | 31 | False | 4/17/2022 10:29 | R_XGNLE1n8XUPkqbv | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
58 | 4/10/2022 14:09 | 4/10/2022 14:13 | IP Address | 108.20.152.110 | 100 | 225 | True | 4/10/2022 14:13 | R_2bPs0M8EkNErq2l | NaN | ... | Northeast | Yes | Yes | 3.0 | Graduate or professional degree (MA, MS, MBA, ... | $150,000 or more | 1923.0 | Working (paid employee) | NaN | Management, professional, and related |
171 | 4/10/2022 14:42 | 4/10/2022 14:43 | IP Address | 73.17.224.217 | 36 | 60 | False | 4/17/2022 14:43 | R_2PaEJgsGQyPJHH9 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
173 | 4/10/2022 16:09 | 4/10/2022 16:10 | IP Address | 71.174.126.166 | 18 | 63 | False | 4/17/2022 16:10 | R_e3Z87iJLpTC8KIx | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
141 | 4/8/2022 9:08 | 4/8/2022 9:09 | IP Address | 163.116.135.114 | 18 | 45 | False | 4/15/2022 9:09 | R_3Icb95xa1FwfQQr | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
34 | 4/8/2022 9:49 | 4/8/2022 9:52 | IP Address | 67.242.27.6 | 100 | 157 | True | 4/8/2022 9:52 | R_Z8iDkYwlA4rg4SJ | NaN | ... | Northeast | Yes | Yes | 3.0 | Graduate or professional degree (MA, MS, MBA, ... | $75,000-$99,999 | 13214.0 | Working (paid employee) | NaN | Service |
52 | 4/9/2022 16:11 | 4/9/2022 16:18 | IP Address | 74.69.106.171 | 100 | 405 | True | 4/9/2022 16:18 | R_2zIx0HaoRH3XVli | NaN | ... | Southwest | Yes | Yes | 4.0 | Some high school or less | Less than $25,000 | 13205.0 | Not working (disabled) | NaN | NaN |
50 | 4/9/2022 2:26 | 4/9/2022 2:33 | IP Address | 24.128.116.150 | 100 | 388 | True | 4/9/2022 2:33 | R_25vcxKvPInRKpPe | NaN | ... | West | Yes | Doesn't make a difference | 3.0 | Bachelor’s degree | $100,000-$149,999 | 80501.0 | Working (self-employed) | NaN | Sales and office |
51 | 4/9/2022 8:33 | 4/9/2022 8:38 | IP Address | 107.77.223.18 | 100 | 302 | True | 4/9/2022 8:38 | R_1kOdhmJiF5NV7l9 | NaN | ... | Northeast | No | Yes | NaN | Graduate or professional degree (MA, MS, MBA, ... | $150,000 or more | 13078.0 | Working (paid employee) | NaN | Management, professional, and related |
250 rows × 65 columns
# Filtering all rows where Q4_1 is greater than 6 and less than 19
df = df[df['Q4_1'] > 6.0]
df = df[df['Q4_1'] < 19.0]
df = df.reset_index()
df.shape
(96, 66)
# Dropping all rows where age is greater than 6 and less than 19 in other columns
rows = df.index[[2,6,7,8,9,10,20,24,30,35,38,40,41,42,47,53,88,94]]
df.drop(rows, inplace=True)
df.head()
index | StartDate | EndDate | Status | IPAddress | Progress | Duration (in seconds) | Finished | RecordedDate | ResponseId | ... | Q14 | Q15 | Q16 | Q17_1 | Q18 | Q19 | Q20 | Q21 | Q21_7_TEXT | Q22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58 | 4/10/2022 14:09 | 4/10/2022 14:13 | IP Address | 108.20.152.110 | 100 | 225 | True | 4/10/2022 14:13 | R_2bPs0M8EkNErq2l | ... | Northeast | Yes | Yes | 3.0 | Graduate or professional degree (MA, MS, MBA, ... | $150,000 or more | 1923.0 | Working (paid employee) | NaN | Management, professional, and related |
1 | 178 | 4/10/2022 21:58 | 4/10/2022 21:59 | IP Address | 174.242.131.198 | 18 | 19 | False | 4/17/2022 21:59 | R_3oTprtbTdAv5MRp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 65 | 4/11/2022 15:25 | 4/11/2022 15:27 | IP Address | 128.230.194.156 | 100 | 141 | True | 4/11/2022 15:27 | R_22qXuvuOs8iCQQK | ... | Northeast | Yes | Yes | 4.0 | Graduate or professional degree (MA, MS, MBA, ... | $50,000-$74,999 | 13244.0 | Working (paid employee) | NaN | Management, professional, and related |
4 | 180 | 4/11/2022 6:38 | 4/11/2022 6:39 | IP Address | 174.192.0.57 | 18 | 26 | False | 4/18/2022 6:39 | R_12areyXnWfxZDsR | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 62 | 4/11/2022 8:44 | 4/11/2022 8:48 | IP Address | 128.230.79.168 | 100 | 291 | True | 4/11/2022 8:48 | R_2QJlWph9SoEoY0N | ... | Northeast | Yes | Doesn't make a difference | 1.0 | Graduate or professional degree (MA, MS, MBA, ... | Prefer not to say | 13090.0 | Working (paid employee) | NaN | Management, professional, and related |
5 rows × 66 columns
df.drop(df.columns[[0]], axis = 1, inplace = True)
# Saving file in csv format
df.to_csv('Data_Survey_Cleaned.csv',index=False)
# Creating Bar Graph of BG Column
df['Q17_1'].value_counts().plot(kind='bar',title="BG Column Bar Chart", backend = 'plotly')
# Creating Pie Chart of BD Column
df_pie = df.Q14.value_counts().plot(kind='pie',title="BD Column Pie Chart")