The starwars dataset, provided by the dplyr package, is a structured dataset containing information about characters from the Star Wars universe. In this section, we will clean and transform the dataset to make it more suitable for analysis. The operations performed include column selection, handling missing values, and converting variables into appropriate data types.

Loading the Necessary Libraries

library(tidyverse)  # Loads core tidyverse packages for data manipulation and visualization
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The tidyverse package is a collection of R packages designed for data manipulation, visualization, and analysis. The dplyr and tidyr packages, included in the tidyverse, are essential for data cleaning.


Previewing the Data

starwars |> view()   # Opens the dataset in a viewer window
starwar <- starwars  # Create a copy of the dataset for modifications
colnames(starwars)   # Display column names
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"


1. Removing Unnecessary Columns

starwar <- starwar |> dplyr::select(
  -"homeworld", -"films", -"vehicles", -"starships"
)


2. Handling Missing Values

Viewing Complete Cases (No Missing Values)
starwar %>% filter(complete.cases(.))
## # A tibble: 32 × 10
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  6 Biggs D…    183    84 black      light      brown           24   male  mascu…
##  7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
##  8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
##  9 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
## 10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
## # ℹ 22 more rows
## # ℹ 1 more variable: species <chr>
  • complete.cases(): Returns rows where all columns have non-missing values (NA).
  • This helps identify data ready for analysis without requiring imputation.


Viewing Rows with Missing Values
starwar %>% filter(!complete.cases(.))
## # A tibble: 55 × 10
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 C-3PO       167    75 <NA>       gold       yellow           112 none  mascu…
##  2 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
##  3 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
##  4 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
##  5 Greedo      173    74 <NA>       green      black             44 male  mascu…
##  6 Jabba D…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
##  7 Jek Ton…    180   110 brown      fair       blue              NA <NA>  <NA>  
##  8 Mon Mot…    150    NA auburn     fair       blue              48 fema… femin…
##  9 Arvel C…     NA    NA brown      fair       brown             NA male  mascu…
## 10 Nien Nu…    160    68 none       grey       black             NA male  mascu…
## # ℹ 45 more rows
## # ℹ 1 more variable: species <chr>
  • !complete.cases(): Returns rows with at least one missing value.
  • ‘!’ gives the opposite result, therefore, called as ‘not’. Here, the given function gives not-complete cases, just reverse to the previous ones.
  • This is useful for assessing where and why missing data occurs.


3. Replacing Missing Values

starwar <- starwar |> 
  mutate(hair_color = replace_na(hair_color, "none"))


4. Selecting Specific Columns

starwar <- starwar |>
  select(height, mass, hair_color, birth_year, sex, gender)

str(starwar)
## tibble [87 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:87] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:87] "male" "none" "none" "male" ...
##  $ gender    : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...

The dataset is filtered to include only the relevant columns for analysis:



5. Handling Missing Values in Specific Columns

Removing Rows with Missing mass
starwar <- starwar |> drop_na(mass)

str(starwar)
## tibble [59 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:59] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:59] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:59] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:59] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:59] "male" "none" "none" "male" ...
##  $ gender    : chr [1:59] "masculine" "masculine" "masculine" "masculine" ...
  • drop_na(mass): Removes rows where the mass column contains NA.


Removing All Rows with Missing Values
starwar <- starwar |> drop_na()
str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : chr [1:36] "male" "none" "none" "male" ...
##  $ gender    : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...
  • drop_na(): Removes rows with any NA values across all columns.


Alternatively:

starwar <- starwar |> na.omit()
str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : chr [1:36] "male" "none" "none" "male" ...
##  $ gender    : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...
  • na.omit(): Base R function for removing rows with missing values.


6. Converting Columns to Factors

starwar <- starwar |>
  mutate(
    hair_color = as.factor(hair_color),
    sex = as.factor(sex),
    gender = as.factor(gender)
  )


str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...


7. Encoding Nominal Values

Assigning Numeric Values to Categorical Variables To perform numerical analysis on categorical variables, we can recode them into numeric equivalents. Here, “masculine” is assigned a value of 1, and “feminine” is assigned a value of 2.

starwar <- starwar |>
  mutate(gender_new = recode(
    gender, 
    "masculine" = 1,
    "feminine" = 2
  ))

str(starwar)
## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
##  $ gender_new: num [1:36] 1 1 1 1 2 1 2 1 1 1 ...


Converting the new column into a factor

starwar$gender_new <- as.factor(starwar$gender_new)

as.factor(): Converts the numeric column back into a categorical variable for further analysis.


str(starwar)
## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
##  $ height    : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
##  $ mass      : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
##  $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
##  $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
##  $ sex       : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
##  $ gender    : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
##  $ gender_new: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 2 1 1 1 ...


8. Handling Duplicates

Duplicate rows can skew data analysis and lead to incorrect conclusions. Here, we demonstrate identifying and removing duplicates.

Creating a Dataset with Duplicates

# Example dataset with duplicate rows
Name <- c("A", "B", "C", "A")
value <- c(1, 2, 3, 1)
combined_df <- data.frame(Name, value)


Identifying Duplicates

# Check for duplicate rows
duplicated(combined_df)
## [1] FALSE FALSE FALSE  TRUE


View duplicate rows

combined_df[duplicated(combined_df), ]
##   Name value
## 4    A     1


View non-duplicate rows

combined_df[!duplicated(combined_df), ]
##   Name value
## 1    A     1
## 2    B     2
## 3    C     3


Removing Duplicate rows

combined_df <- combined_df |> distinct()

combined_df
##   Name value
## 1    A     1
## 2    B     2
## 3    C     3


Verifying the data doesn’t have any duplicated results.

duplicated(combined_df) |> table()
## 
## FALSE 
##     3


9. Correcting Data Types

Sometimes, data imported into R may have incorrect data types. For instance, numeric variables might be stored as strings. Here’s how to correct such issues:

Example Dataset with Incorrect Types

data_types <- data.frame(
  ID = c("A", "B", "C"),
  Age = c("25", "30", "35")
)


Check the structure of the dataset

str(data_types)
## 'data.frame':    3 obs. of  2 variables:
##  $ ID : chr  "A" "B" "C"
##  $ Age: chr  "25" "30" "35"


Converting Data Types. Convert Age from character to numeric

data_types$Age <- as.numeric(data_types$Age)


Check the updated structure

str(data_types)
## 'data.frame':    3 obs. of  2 variables:
##  $ ID : chr  "A" "B" "C"
##  $ Age: num  25 30 35


10. Removing Outliers

Outliers are the extreme values that are present in data. They can distort statistical analysis. One common method to identify and remove outliers is the Interquartile Range (IQR).

Removing Outliers in mpg (miles per gallon) from the mtcars Dataset

# Preview the dataset
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


Calculating Interquartile Range (IQR). It is the difference between third quartile and the first quartile.

# Calculate the IQR
Q1 <- quantile(mtcars$mpg, 0.25)  # First quartile (25th percentile)
Q3 <- quantile(mtcars$mpg, 0.75)  # Third quartile (75th percentile)
IQR <- Q3 - Q1                    # Interquartile range


Determining outlier thresholds. This approach was Proposed by John Tukey in the 1980s, which identifies outliers as data points lying outside 1.5 * IQR below the first quartile or above the third quartile.

lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR


Removing outliers

no_outliers <- mtcars %>%
  filter(mpg >= lower_bound & mpg <= upper_bound)


# Check the cleaned dataset
head(no_outliers)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(no_outliers)
## 'data.frame':    31 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

There are other ways of removing outliers as well