The starwars dataset, provided by the dplyr package, is a structured dataset containing information about characters from the Star Wars universe. In this section, we will clean and transform the dataset to make it more suitable for analysis. The operations performed include column selection, handling missing values, and converting variables into appropriate data types.
Loading the Necessary Libraries
library(tidyverse) # Loads core tidyverse packages for data manipulation and visualization
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The tidyverse package is a collection of R packages designed for data manipulation, visualization, and analysis. The dplyr and tidyr packages, included in the tidyverse, are essential for data cleaning.
Previewing the Data
starwars |> view() # Opens the dataset in a viewer window
starwar <- starwars # Create a copy of the dataset for modifications
colnames(starwars) # Display column names
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships"
starwar <- starwar |> dplyr::select(
-"homeworld", -"films", -"vehicles", -"starships"
)
starwar %>% filter(complete.cases(.))
## # A tibble: 32 × 10
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 Darth V… 202 136 none white yellow 41.9 male mascu…
## 3 Leia Or… 150 49 brown light brown 19 fema… femin…
## 4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 5 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 6 Biggs D… 183 84 black light brown 24 male mascu…
## 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## 8 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 9 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 10 Han Solo 180 80 brown fair brown 29 male mascu…
## # ℹ 22 more rows
## # ℹ 1 more variable: species <chr>
starwar %>% filter(!complete.cases(.))
## # A tibble: 55 × 10
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 3 R5-D4 97 32 <NA> white, red red NA none mascu…
## 4 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
## 5 Greedo 173 74 <NA> green black 44 male mascu…
## 6 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 7 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
## 8 Mon Mot… 150 NA auburn fair blue 48 fema… femin…
## 9 Arvel C… NA NA brown fair brown NA male mascu…
## 10 Nien Nu… 160 68 none grey black NA male mascu…
## # ℹ 45 more rows
## # ℹ 1 more variable: species <chr>
starwar <- starwar |>
mutate(hair_color = replace_na(hair_color, "none"))
starwar <- starwar |>
select(height, mass, hair_color, birth_year, sex, gender)
str(starwar)
## tibble [87 × 6] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
## $ mass : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
## $ hair_color: chr [1:87] "blond" "none" "none" "none" ...
## $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
## $ sex : chr [1:87] "male" "none" "none" "male" ...
## $ gender : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...
The dataset is filtered to include only the relevant columns for analysis:
starwar <- starwar |> drop_na(mass)
str(starwar)
## tibble [59 × 6] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:59] 172 167 96 202 150 178 165 97 183 182 ...
## $ mass : num [1:59] 77 75 32 136 49 120 75 32 84 77 ...
## $ hair_color: chr [1:59] "blond" "none" "none" "none" ...
## $ birth_year: num [1:59] 19 112 33 41.9 19 52 47 NA 24 57 ...
## $ sex : chr [1:59] "male" "none" "none" "male" ...
## $ gender : chr [1:59] "masculine" "masculine" "masculine" "masculine" ...
starwar <- starwar |> drop_na()
str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
## $ mass : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
## $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
## $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
## $ sex : chr [1:36] "male" "none" "none" "male" ...
## $ gender : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...
Alternatively:
starwar <- starwar |> na.omit()
str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
## $ mass : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
## $ hair_color: chr [1:36] "blond" "none" "none" "none" ...
## $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
## $ sex : chr [1:36] "male" "none" "none" "male" ...
## $ gender : chr [1:36] "masculine" "masculine" "masculine" "masculine" ...
starwar <- starwar |>
mutate(
hair_color = as.factor(hair_color),
sex = as.factor(sex),
gender = as.factor(gender)
)
str(starwar)
## tibble [36 × 6] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
## $ mass : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
## $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
## $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
## $ sex : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
## $ gender : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
Assigning Numeric Values to Categorical Variables To perform numerical analysis on categorical variables, we can recode them into numeric equivalents. Here, “masculine” is assigned a value of 1, and “feminine” is assigned a value of 2.
starwar <- starwar |>
mutate(gender_new = recode(
gender,
"masculine" = 1,
"feminine" = 2
))
str(starwar)
## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
## $ mass : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
## $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
## $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
## $ sex : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
## $ gender : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
## $ gender_new: num [1:36] 1 1 1 1 2 1 2 1 1 1 ...
Converting the new column into a factor
starwar$gender_new <- as.factor(starwar$gender_new)
as.factor(): Converts the numeric column back into a categorical variable for further analysis.
str(starwar)
## tibble [36 × 7] (S3: tbl_df/tbl/data.frame)
## $ height : int [1:36] 172 167 96 202 150 178 165 183 182 188 ...
## $ mass : num [1:36] 77 75 32 136 49 120 75 84 77 84 ...
## $ hair_color: Factor w/ 8 levels "auburn, white",..: 3 7 7 7 4 5 4 2 1 3 ...
## $ birth_year: num [1:36] 19 112 33 41.9 19 52 47 24 57 41.9 ...
## $ sex : Factor w/ 4 levels "female","hermaphroditic",..: 3 4 4 3 1 3 1 3 3 3 ...
## $ gender : Factor w/ 2 levels "feminine","masculine": 2 2 2 2 1 2 1 2 2 2 ...
## $ gender_new: Factor w/ 2 levels "1","2": 1 1 1 1 2 1 2 1 1 1 ...
Duplicate rows can skew data analysis and lead to incorrect conclusions. Here, we demonstrate identifying and removing duplicates.
Creating a Dataset with Duplicates
# Example dataset with duplicate rows
Name <- c("A", "B", "C", "A")
value <- c(1, 2, 3, 1)
combined_df <- data.frame(Name, value)
Identifying Duplicates
# Check for duplicate rows
duplicated(combined_df)
## [1] FALSE FALSE FALSE TRUE
View duplicate rows
combined_df[duplicated(combined_df), ]
## Name value
## 4 A 1
View non-duplicate rows
combined_df[!duplicated(combined_df), ]
## Name value
## 1 A 1
## 2 B 2
## 3 C 3
Removing Duplicate rows
combined_df <- combined_df |> distinct()
combined_df
## Name value
## 1 A 1
## 2 B 2
## 3 C 3
Verifying the data doesn’t have any duplicated results.
duplicated(combined_df) |> table()
##
## FALSE
## 3
Sometimes, data imported into R may have incorrect data types. For instance, numeric variables might be stored as strings. Here’s how to correct such issues:
Example Dataset with Incorrect Types
data_types <- data.frame(
ID = c("A", "B", "C"),
Age = c("25", "30", "35")
)
Check the structure of the dataset
str(data_types)
## 'data.frame': 3 obs. of 2 variables:
## $ ID : chr "A" "B" "C"
## $ Age: chr "25" "30" "35"
Converting Data Types. Convert Age from character to numeric
data_types$Age <- as.numeric(data_types$Age)
as.numeric()
: Converts character variables into numeric
format. Conversely, as.character()
is used to convert any
data types into character format.Check the updated structure
str(data_types)
## 'data.frame': 3 obs. of 2 variables:
## $ ID : chr "A" "B" "C"
## $ Age: num 25 30 35
Outliers are the extreme values that are present in data. They can distort statistical analysis. One common method to identify and remove outliers is the Interquartile Range (IQR).
Removing Outliers in mpg (miles per gallon) from the mtcars Dataset
# Preview the dataset
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Calculating Interquartile Range (IQR). It is the difference between third quartile and the first quartile.
# Calculate the IQR
Q1 <- quantile(mtcars$mpg, 0.25) # First quartile (25th percentile)
Q3 <- quantile(mtcars$mpg, 0.75) # Third quartile (75th percentile)
IQR <- Q3 - Q1 # Interquartile range
Determining outlier thresholds. This approach was Proposed by John Tukey in the 1980s, which identifies outliers as data points lying outside 1.5 * IQR below the first quartile or above the third quartile.
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
Removing outliers
no_outliers <- mtcars %>%
filter(mpg >= lower_bound & mpg <= upper_bound)
# Check the cleaned dataset
head(no_outliers)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(no_outliers)
## 'data.frame': 31 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
There are other ways of removing outliers as well