TidyTuesday 2025/08/05

Happy inaugural blog post!

Something you will probably end up figuring out about me, if you somehow find this blog and decide that you wish to keep reading my writings, is that I am someone who enjoys digging into a topic for several weeks or months at a time, and then letting it go. This is often cyclical, as is the case with Magic the Gathering or Dungeons and Dragons, other times this has yet to come back (brief 1 1/2 year Muay Thai stint 2 years ago). The only topic I have not bored of is, of course, science. Powerlifting lasted an incredible 5 years. The more that I get into science, the more I realize that my initial reticence at 'dry lab' stuff and almost an (appalling, in retrospect) disdain for bioinformaticians and data analysts stemmed likely from my inability to perform these tasks myself. Decidedly, I chose to undertake the task of learning R and learn a basis of statistics for a few reasons:

Perform my own, reproducible, and shareable data analysis
Create beautiful graphs
Expand my marketability beyond a wet-lab hyperspecialist
Learn something and minimize my reliance on LLMs
Enhance my writing and defensibility of chosen analyses

No mean feat! I know that I love to learn by doing. TidyTuesday inspired me to do this via mini-projects, which take me around 3-4 hours each week. I think this makes it digestible and more easy to stick to - this blog is a way to keep me accountable and learning too.

My objective is to not only practice data wrangling and visualization skills but also learn a new R technique/command each week, as well as improve my data story telling. I apologize in advance if the code is inexpertly written or if my 'revelations' are child's play to some readers. Let's get right to it!

TidyTuesday 2025/08/05 - Income Inequality Before and After Taxes

Please find all sources for the data in the original GitHub link from TT

Disclaimer: Please note the dataset presented does not reflect in any way my personal views or affiliations

Context: how is income distributed? The question of income of a population and how/whether to distribute has long been studied. The dataset provided to us earlier this week contained historical population data alongside the GINI index.

The GINI index is a measure of income inequality.

I found it worthwhile to read a little into exactly what the GINI index is. A brief reading on Wikipedia defines the GINI index as defined by the Lorenz Curve. This Lorenz Curve is a graph which describes what proportion of the population controls what proportion of the wealth:

Thus, the more that a concentrated core of people control larger percent of the wealth, the more inflexed the Lorenz Curve is. The area under the curve is marked as B. The area A represents the 'jump' that a current population would need to undertake to reach an idealized completely equally distributed - where every person controls a share of the wealth proportional to their proportion in the population. Thus, GINI is defined as:

GINI = A / (A + B)

It is a description of the inflexion of this curve. The higher GINI is, the more of a jump the population has to make - the further away it is from an idealized distribution. A GINI index of 1 would be an infinitesimally small proportion containing all the wealth while an index of 0 would be a perfectly even distribution. The dataset given reports 2 variants of the GINI index: 'pre-tax' (market) and 'post-tax' (disposable household income, also includes not just taxation but also government help and subsidies that are both given to and received from the government).

The higher GINI is, the more uneven wealth distribution.

First analysis

The first thing that I did was simply to plot population growth graphs out of curiosity for continents and countries I was interested in.

A graph depicting the population curves out of a select few countries

Figure 1

My next graph was to map out the latest post-tax GINI indices in the map of the world

A world map depicting the post-tax GINI indices

Figure 2

This approach was done with the following code:

    
    library(maps)
    world = map_data("world")
    
    world <- world %>%
      mutate(
        region = case_when(
          region == 'USA' ~ "United States",
          region == 'UK' ~ 'United Kingdom',
          region == 'Ivory Coast' ~ 'Cote d\'Ivoire',
          TRUE ~ region
        )
      )
    #this was done to allow merging of the databases for 
    #superposition of countries in the map
    inequality_change_map_df <- left_join(world, 
                                    income_inequality_processed_cleaned, 
                                    by = 'region') 
    
    inequality_change_map_df <- inequality_change_map_df %>%
      group_by(region) %>%
      filter(!is.na(gini_dhi_eq)) %>%
      slice_max(order_by = year, n = 1)
    #choose only the last value of post-tax gini
  ggplot() +
  geom_polygon(data = world, 
              aes(x = long,y = lat, group = group),
              fill = 'gray90') +
  geom_polygon(data = inequality_change_map_df, 
              aes(x = long,y = lat, group = group, fill = gini_dhi_eq),
              color = 'black', linewidth = 0.1) +
    ...

I learnt that the maps package existed and how to access it for rudimentary map usage and plotting. In the future, I want to use a map more seriously.

These results offer a good snapshot at the world. The map easily highlighted countries with high inequality: South Africa, Colombia, and Ivory Coast stood out to me particularly.

Second analysis: GINI over time

How has GINI changed over time? This was a question I thought was interesting. I first plotted the GINI pre- and post-tax.

A graph depicting time series of GINI indeces.

Figure 3

This graph shows the overall world tendency for the GINI index (for the available n = 59 countries). Overall, it seems that pre-tax income inequality has increased since the 1970s by an increase of almost 0.1 (25% increase from 0.4). However, we must bear in mind that this reflects the inclusion of data of other countries - not all countries' data was available as early as the 1970s as we will see below. This may instead reflect simply the addition of countries with high GINI indices (such as those mentioned above) allowing the mean to flatten out.

A graph depicting time series of GINI indeces for a variety of countries.

Figure 4

On the other hand, if we look at the data country-by-country, we can see that there is possibly the two effects playing a role:

Countries such as the United States, the United Kingdom, Norway, Finland, and to a lesser extent, Canada, all trend upwards in the pre-tax (blue lines). The trend persists in the post-tax (red lines), although (qualitatively) it appears to be to a lesser degree.
Countries like Brazil and Costa Rice increase the mean trend, especially around the 2000s

Overall, while conclusions should not be made without some more rigorous analysis, we can see that overall, the world is trending towards more income inequality. This is not the case in all countries (see Brazil - decreasing, yet still high). The country-by-country analysis also allowed me to see some interesting moments in GINI changes which may or may not be linked to historical events:

The UK's rapid increase in GINI in the 1980s coincides with the Iron Lady's Prime Minister tenure (1979-90)
Turkey's missing 2016 data point and subsequent increase in GINI coincides with the 2016 failed Coup attempt by Gülenists, which I did not know had occurred until time of writing
The Finnish depression of 1990s (unemployment raised from 3.5% to 18.9%), caused by economic bubbles, the collapse of the USSR, and, interestingly, an overreliance on paper production, also coincides with an increase in GINI.

To produce this graph, I merged the 'processed' and 'raw' datasets as some countries were not represented in the processed one provided by the author. The code used was:

    
        gini_by_age <- rbind(gini_by_age_1, gini_by_age_2) %>%
        group_by(region) %>%
          mutate(
            redistribution = gini_mi_eq - gini_dhi_eq,
            current_redistribution = last(redistribution),
            original_redistribution = first(redistribution)
        ) %>% ungroup() %>%
        arrange(region, desc(year), desc(gini_mi_eq)) %>%
          group_by(region) %>%
          distinct(year, .keep_all =  TRUE)
        #this taught me to use arrange and distinct with the .keep_all call
        #to get rid of datapoints which were duplicated in both datasets
        #note that the raw dataset underestimated some observations substantially
        #(esp. Finland) which is why I think this merging / removing can be done
        #2 dfs in rbind are the cleaned up processed and raw respectively

The next thing to look at was - the difference between pre- and post-tax GINI index. This is what I labelled as redistribution and calculated it as:

GINI pre-tax - GINI post-tax

It is interesting to note that (see Figure 4) many countries have similar pre-tax GINI indices, but they are substantially different in their post-tax indices (e.g. compare the USA with Germany). This gives us an indication of how much the goverment intervenes in the economy. As the author of the original dataset puts it:

While the change in inequality before and after tax gives us a measure of the extent of government redistribution, it does not tell us the total reduction in inequality caused by this redistribution.

A graph depicting time series of redistribution index for a variety of countries.

Figure 5

Belgium, Austria, Germany, and Finland seem to be some countries with the highest redistribution metrics. The United States shows a peak in redistribution which coincides with President Biden's term in office and the COVID 19 pandemic.

Finally, I simply looked at what the raw value of the change has been for countries in GINI pre- and post-tax since they first started reporting them.

A graph depicting a bar chart for the change in GINI indices for variety of countries.

Figure 6

Once more bearing in mind the age of the data (here especially highlighted by the color), we can see that the UK is the country which has had a largest increase in GINI pre- and post-tax while Brazil has shown the largest decrease in both metrics! Others, like Germany or Japan, have greatly increased pre-tax GINI but redistribution has kept GINI post-tax lower.

It is worth finishing off by stressing that there are a lot of factors into how GINI is calculated and is produced and it is a simplification to take any of the data here presented and make sweeping conclusions about methods of governance or economics. To find out more about GINI, the data, and how to properly interpret it, I suggest you go into the original post and references therein.

That's all for this post. Until the next one, best of luck.