Baby names in U.S.
For this assignment, you will get a set of txt files. Each file contains the population of baby
names born each year from 1880 to 2021. You can download a zip file containing all of these .txt
files. This assignment has three parts: data preparation, cleaning, and visualization.
Data Preparation:
1. The data comes in multiple txt files, you need to merge all those files into one single
CSV file and then load it into a Pandas data frame. (You cannot do it manually, and you
need to write Pandas code to do that)
2. Each txt file has four columns, but none of the files come with any column label. After
loading all data into one single Datafrmae, give the following labels to the columns:
“BabyName,” “gender,” and “Population”.
3. Txt files follow a naming pattern: boyXXXX.txt and XXXX represent the year the data
belongs to. For example, yoy1999.txt belongs to 1999. You need to add the year as the
fifth column. You can label this new column as ‘yaer’.
Data Clearing:
1. Data includes nan value. Those values can happen in any of the four initial fields. In the
first step, you need to remove nan values. You can drop the rows that contain nan
values.
2. The gender field can only have two types of values F and M, for names traditionally used
only for females and males. Anything else is incorrect data and needs to be removed.
3. The population also has to be positive integers. You can remove anything else.
4. It does not mean having the same baby name with the same gender and year more than
once. To fix this issue, keep the name with the highest population and drop other
repetitions.
Data Visualization:
1. Plot top-10 popular names for a year given by the user. Users may want only to see
male, female, and top-10 regardless of gender. You can use a bar chart for this part.
2. The user can visualize the population of a specific baby name in a specific range. For
example, it shows the number of babies named ‘Marry’ from 1990 to 2001. Users also
can enter multiple names (up to 3 names). You can ignore empty names.
3. Users can enter a specific year (from 1880 to 2021) and plot the total number of baby
boys and baby girls. For this part, you can use a slider.
4. Draw the trend of the total number of baby boys and girl boys over time. More
specifically, you need to draw two lines, each specifying the total number of babies
separately in each year.
Some considerations for this homework:
1. Some parts of this assignment may not have been discussed explicitly in class, however,
you can search online to find the best way to do it.
2.You need to merge txt files automatically. I will run that part to create the data frame. You
also need to do data cleaning using Python and cannot fix it outside of Python (for
example, using Excel)
3. Your code should be efficient. For example, you cannot have loops that iterate entire
data.
4. You need to use Google CoLab forms to get the user’s input for this assignment. You
cannot use input() command for this homework.
5. When using Google CoLab forms, make sure to set attributes appropriately and validate
the input data when it is possible. For example, for the year, you can use a slider from
1880 to 2021.
6. Make sure your visual output is professional. It should have title, x-axis, y-axis, and
legend.

For This or a Similar Paper Click Here To Order Now