Data Cleaning and Handling Missing Data

The starwars dataset, provided by the dplyr package, is a structured dataset containing information about characters from the Star Wars universe. In this section, we will clean and transform the dataset to make it more suitable for analysis. The operations performed include column selection, handling missing values, and converting variables into appropriate data types.

Loading the Necessary Libraries

library(tidyverse)  # Loads core tidyverse packages for data manipulation and visualization

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The tidyverse package is a collection of R packages designed for data manipulation, visualization, and analysis. The dplyr and tidyr packages, included in the tidyverse, are essential for data cleaning.

Previewing the Data

starwars |> view()   # Opens the dataset in a viewer window
starwar <- starwars  # Create a copy of the dataset for modifications
colnames(starwars)   # Display column names

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"

view(): Provides an interactive table view of the dataset for manual inspection.
colnames(): Lists all column names in the dataset to identify unnecessary columns.

1. Removing Unnecessary Columns

starwar <- starwar |> dplyr::select(
  -"homeworld", -"films", -"vehicles", -"starships"
)

dplyr::select(): Selects or drops specific columns. The - sign before column names removes them from the dataset.
Columns such as “homeworld”, “films”, “vehicles”, and “starships” are removed because they are not needed for the analysis.

2. Handling Missing Values

Viewing Complete Cases (No Missing Values)

starwar %>% filter(complete.cases(.))

## # A tibble: 32 × 10
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  6 Biggs D…    183    84 black      light      brown           24   male  mascu…
##  7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
##  8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
##  9 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
## 10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
## # ℹ 22 more rows
## # ℹ 1 more variable: species <chr>

complete.cases(): Returns rows where all columns have non-missing values (NA).
This helps identify data ready for analysis without requiring imputation.

Viewing Rows with Missing Values

starwar %>% filter(!complete.cases(.))

## # A tibble: 55 × 10
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
##  2 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
##  3 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
##  4 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
##  5 Greedo      173    74 <NA>       green      black             44 male  mascu…
##  6 Jabba D…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
##  7 Jek Ton…    180   110 brown      fair       blue              NA <NA>  <NA>  
##  8 Mon Mot…    150    NA auburn     fair       blue              48 fema… femin…
##  9 Arvel C…     NA    NA brown      fair       brown             NA male  mascu…
## 10 Nien Nu…    160    68 none       grey       black             NA male  mascu…
## # ℹ 45 more rows
## # ℹ 1 more variable: species <chr>

!complete.cases(): Returns rows with at least one missing value.
‘!’ gives the opposite result, therefore, called as ‘not’. Here, the given function gives not-complete cases, just reverse to the previous ones.
This is useful for assessing where and why missing data occurs.

3. Replacing Missing Values

starwar <- starwar |> 
  mutate(hair_color = replace_na(hair_color, "none"))

mutate(): Used to create or modify variables in the dataset.
replace_na(): Replaces NA values in the “hair_color” column with “none”. After understanding the data, we replace missing values for “hair_color” to reflect characters with no hair.

4. Selecting Specific Columns

starwar <- starwar |>
  select(height, mass, hair_color, birth_year, sex, gender)

str(starwar)

## tibble [87 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:87] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:87] "male" "none" "none" "male" ...
##  $ gender    : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...

The dataset is filtered to include only the relevant columns for analysis:

5. Handling Missing Values in Specific Columns

Removing Rows with Missing mass

starwar <- starwar |> drop_na(mass)

str(starwar)

## tibble [59 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:59] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:59] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:59] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:59] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:59] "male" "none" "none" "male" ...
##  $ gender    : chr [1:59] "masculine" "masculine" "masculine" "masculine" ...

drop_na(mass): Removes rows where the mass column contains NA.

Removing All Rows with Missing Values

starwar <- starwar |> drop_na()
str(starwar)

## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : chr [1:36] "male" "none" "none" "male" ...
##  $ gender    : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...

drop_na(): Removes rows with any NA values across all columns.

Alternatively:

starwar <- starwar |> na.omit()
str(starwar)

## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : chr [1:36] "male" "none" "none" "male" ...
##  $ gender    : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...

na.omit(): Base R function for removing rows with missing values.

6. Converting Columns to Factors

starwar <- starwar |>
  mutate(
    hair_color = as.factor(hair_color),
    sex = as.factor(sex),
    gender = as.factor(gender)
  )

as.factor(): Converts variables to factors, which are categorical data types in R.
Converting columns like “hair_color”, “sex”, and “gender” to factors is useful for analysis, especially for statistical models or visualizations.

str(starwar)

## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...

str(): Displays the structure of the cleaned dataset, showing column names, data types, and sample values.

7. Encoding Nominal Values

Assigning Numeric Values to Categorical Variables To perform numerical analysis on categorical variables, we can recode them into numeric equivalents. Here, “masculine” is assigned a value of 1, and “feminine” is assigned a value of 2.

starwar <- starwar |>
  mutate(gender_new = recode(
    gender, 
    "masculine" = 1,
    "feminine" = 2
  ))

str(starwar)

## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
##  $ gender_new: num [1:36] 1 1 1 1 2 1 2 1 1 1 ...

recode(): Used to replace specific values in a variable.
mutate(): Creates a new column, “gender_new”, with encoded values.

Converting the new column into a factor

starwar$gender_new <- as.factor(starwar$gender_new)

as.factor(): Converts the numeric column back into a categorical variable for further analysis.

str(starwar)

## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
##  $ gender_new: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 2 1 1 1 ...

8. Handling Duplicates

Duplicate rows can skew data analysis and lead to incorrect conclusions. Here, we demonstrate identifying and removing duplicates.

Creating a Dataset with Duplicates

# Example dataset with duplicate rows
Name <- c("A", "B", "C", "A")
value <- c(1, 2, 3, 1)
combined_df <- data.frame(Name, value)

Identifying Duplicates

# Check for duplicate rows
duplicated(combined_df)

## [1] FALSE FALSE FALSE  TRUE

View duplicate rows

combined_df[duplicated(combined_df), ]

##   Name value
## 4    A     1

View non-duplicate rows

combined_df[!duplicated(combined_df), ]

##   Name value
## 1    A     1
## 2    B     2
## 3    C     3

Removing Duplicate rows

combined_df <- combined_df |> distinct()

combined_df

##   Name value
## 1    A     1
## 2    B     2
## 3    C     3

Verifying the data doesn’t have any duplicated results.

duplicated(combined_df) |> table()

## 
## FALSE 
##     3

duplicated(): Identifies rows that are duplicates of previous rows.
distinct(): Removes duplicate rows and retains only unique rows.
Table Output: Displays a count of duplicates (TRUE) vs. unique rows (FALSE).

9. Correcting Data Types

Sometimes, data imported into R may have incorrect data types. For instance, numeric variables might be stored as strings. Here’s how to correct such issues:

Example Dataset with Incorrect Types

data_types <- data.frame(
  ID = c("A", "B", "C"),
  Age = c("25", "30", "35")
)

Check the structure of the dataset

str(data_types)

## 'data.frame':    3 obs. of  2 variables:
##  $ ID : chr  "A" "B" "C"
##  $ Age: chr  "25" "30" "35"

Converting Data Types. Convert Age from character to numeric

data_types$Age <- as.numeric(data_types$Age)

as.numeric(): Converts character variables into numeric format. Conversely, as.character() is used to convert any data types into character format.

Check the updated structure

str(data_types)

## 'data.frame':    3 obs. of  2 variables:
##  $ ID : chr  "A" "B" "C"
##  $ Age: num  25 30 35

10. Removing Outliers

Outliers are the extreme values that are present in data. They can distort statistical analysis. One common method to identify and remove outliers is the Interquartile Range (IQR).

Removing Outliers in mpg (miles per gallon) from the mtcars Dataset

# Preview the dataset
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Calculating Interquartile Range (IQR). It is the difference between third quartile and the first quartile.

# Calculate the IQR
Q1 <- quantile(mtcars$mpg, 0.25)  # First quartile (25th percentile)
Q3 <- quantile(mtcars$mpg, 0.75)  # Third quartile (75th percentile)
IQR <- Q3 - Q1                    # Interquartile range

Determining outlier thresholds. This approach was Proposed by John Tukey in the 1980s, which identifies outliers as data points lying outside 1.5 * IQR below the first quartile or above the third quartile.

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

Removing outliers

no_outliers <- mtcars %>%
  filter(mpg >= lower_bound & mpg <= upper_bound)


# Check the cleaned dataset
head(no_outliers)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

str(no_outliers)

## 'data.frame':    31 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

There are other ways of removing outliers as well