This article shows my first attempt into using python programming for data story telling.
The data set I used is the insurance regression csv file. This file is uploaded on my github account, together with the codes I used for this project.
To start, the following modules are to be import to your notebook. Seaborn module is a straight forward very handy module when it comes to plotting in python. However, it is dependent on numpy and matplotlib therefore there is still a need to import those modules.
To check if we successfully uploaded the file to the notebook, we use the following code:
I showed only the first five rows so it would not take much space in our ram and since this is just to check if the .csv file is uploaded.
I started with plotting the number of children vs the number of customers and used the smoker as a filter.
As we can see from the graph, majority of the customers have zero dependents and are non smokers. We can say that it is most likely that people with more number of children barely afford to avail insurance which explains the decreasing number of customers. We can also infer that smokers tend to not avail insurances because it has higher charge for smokers. We can know if our inferences are true on the next graphs.
I then graphed the age of insurance customers according to their age. The graph shows a significantly high number of customers who are early to mid-20s of age. It could mean that at their early age, they already know the benefits of having insurance. Or it could also mean that their parents are paying for their kids’ insurance in preparation for the future.
I used this code to show the scatterplot of bmi vs the charges with filter of age and smokers.
I also changed the size of the markers and legends to make it more visible.
In this scatterplot, we can now prove our first inference on why there are significantly low number of smokers that are availing the insurance. It is apparent in our graph that the insurance charge for smokers are relatively costly as compared to that of the nonsmokers.
Also based on this plot, we can divide the result into 3 clusters as shown below:
The first cluster are those who have the following attributes:
- majority of smokers
- highest insurance charge
- all ages are represented: the older it gets, the higher the charge
- BMIs are overweight to obese
The second cluster have the following attributes:
- combination of smokers and nonsmokers
- mid range insurance charge
- all ages are represented: the older it gets, the higher the charge
- having a healthy BMI
The third cluster have these attributes:
- majority of the nonsmokers belongs in this range
- very low insurance policy
- mostly people with ages 20s to late 30s
- having BMIs of underweight to healthy
Based on the graphs presented, we can conclude that insurance policy is highly dependent on lifestyle and health. Smokers and people with higher BMI tend to have higher insurance charge as compared to nonsmokers and healthy people. It is also evident that age, which can also be associated with health, is another factor to consider in availing an insurance. It is recommendable to avail insurance as early as your 20s for its low price.