Common commands that I use for data analysis and simulation.
First, libraries and a data frame.
using DataFrames, DataFramesMeta, Query, Statistics, CSV, CSVFiles, Distributions, Random, PrettyTables df = DataFrame( name = ["josh", "adam", "jenny", "rory", "thomas", "bill", "sarah", "allie", "shannon", "cloe"], id = collect(1:10), satisfaction = [2, 4, 5, 1, 3, 3, 3, 2, 4, 5], need_achievement = [8, 3, 4, 7, 2, 9, 10, 5, 4, 6], fav_color = ["red", "red", "green", "orange", "blue", "yellow", "purple", "orange", "silver", "blue"], number_of_kids = [2, 3, 0, 0, 0, 0, 2, 1, 1, 0], instrument = ["piano", "guitar", "none", "piano", "saxophone", "none", "none", "none", "none", "clarinet"], team_id = vcat( repeat(["awesome"], 5), repeat(["best"], 5) ), dyad_id = vcat( repeat(["v"], 2), repeat(["w"], 2), repeat(["x"], 2), repeat(["y"], 2), repeat(["z"], 2) ), dyad_trust = [2, 2, 5, 5, 7, 7, 8, 8, 3, 3], team_performance = [7, 7, 7, 7, 7, 4, 4, 4, 4, 4], nation = repeat(["united_states"], 10), common_value = repeat([3], 10) )
name | id | satisfaction | need_achievement | fav_color | number_of_kids | instrument | team_id | dyad_id | dyad_trust | team_performance | nation | common_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | Int64 | String | Int64 | String | String | String | Int64 | Int64 | String | Int64 | |
1 | josh | 1 | 2 | 8 | red | 2 | piano | awesome | v | 2 | 7 | united_states | 3 |
2 | adam | 2 | 4 | 3 | red | 3 | guitar | awesome | v | 2 | 7 | united_states | 3 |
3 | jenny | 3 | 5 | 4 | green | 0 | none | awesome | w | 5 | 7 | united_states | 3 |
4 | rory | 4 | 1 | 7 | orange | 0 | piano | awesome | w | 5 | 7 | united_states | 3 |
5 | thomas | 5 | 3 | 2 | blue | 0 | saxophone | awesome | x | 7 | 7 | united_states | 3 |
6 | bill | 6 | 3 | 9 | yellow | 0 | none | best | x | 7 | 4 | united_states | 3 |
7 | sarah | 7 | 3 | 10 | purple | 2 | none | best | y | 8 | 4 | united_states | 3 |
8 | allie | 8 | 2 | 5 | orange | 1 | none | best | y | 8 | 4 | united_states | 3 |
9 | shannon | 9 | 4 | 4 | silver | 1 | none | best | z | 3 | 4 | united_states | 3 |
10 | cloe | 10 | 5 | 6 | blue | 0 | clarinet | best | z | 3 | 4 | united_states | 3 |
Identify characteristics about the data.
names(df) size(df) describe(df) first(df) last(df) show(df, allrows = true, allcols = true)
10×13 DataFrames.DataFrame │ Row │ name │ id │ satisfaction │ need_achievement │ fav_color │ │ │ String │ Int64 │ Int64 │ Int64 │ String │ ├─────┼─────────┼───────┼──────────────┼──────────────────┼───────────┤ │ 1 │ josh │ 1 │ 2 │ 8 │ red │ │ 2 │ adam │ 2 │ 4 │ 3 │ red │ │ 3 │ jenny │ 3 │ 5 │ 4 │ green │ │ 4 │ rory │ 4 │ 1 │ 7 │ orange │ │ 5 │ thomas │ 5 │ 3 │ 2 │ blue │ │ 6 │ bill │ 6 │ 3 │ 9 │ yellow │ │ 7 │ sarah │ 7 │ 3 │ 10 │ purple │ │ 8 │ allie │ 8 │ 2 │ 5 │ orange │ │ 9 │ shannon │ 9 │ 4 │ 4 │ silver │ │ 10 │ cloe │ 10 │ 5 │ 6 │ blue │ │ Row │ number_of_kids │ instrument │ team_id │ dyad_id │ dyad_trust │ │ │ Int64 │ String │ String │ String │ Int64 │ ├─────┼────────────────┼────────────┼─────────┼─────────┼────────────┤ │ 1 │ 2 │ piano │ awesome │ v │ 2 │ │ 2 │ 3 │ guitar │ awesome │ v │ 2 │ │ 3 │ 0 │ none │ awesome │ w │ 5 │ │ 4 │ 0 │ piano │ awesome │ w │ 5 │ │ 5 │ 0 │ saxophone │ awesome │ x │ 7 │ │ 6 │ 0 │ none │ best │ x │ 7 │ │ 7 │ 2 │ none │ best │ y │ 8 │ │ 8 │ 1 │ none │ best │ y │ 8 │ │ 9 │ 1 │ none │ best │ z │ 3 │ │ 10 │ 0 │ clarinet │ best │ z │ 3 │ │ Row │ team_performance │ nation │ common_value │ │ │ Int64 │ String │ Int64 │ ├─────┼──────────────────┼───────────────┼──────────────┤ │ 1 │ 7 │ united_states │ 3 │ │ 2 │ 7 │ united_states │ 3 │ │ 3 │ 7 │ united_states │ 3 │ │ 4 │ 7 │ united_states │ 3 │ │ 5 │ 7 │ united_states │ 3 │ │ 6 │ 4 │ united_states │ 3 │ │ 7 │ 4 │ united_states │ 3 │ │ 8 │ 4 │ united_states │ 3 │ │ 9 │ 4 │ united_states │ 3 │ │ 10 │ 4 │ united_states │ 3 │
Create new columns, one documenting that everyone speaks english and another representing each individual's favorite season.
# with base df[!, :language] = repeat(["english"], 10) # with data frames meta df = @linq df |> transform(fav_season = vcat( repeat(["summer"], 5), repeat(["winter"], 5) )) @pt df[:, [:team_performance, :language, :fav_season]]
┌──────────────────┬──────────┬────────────┐ │ team_performance │ language │ fav_season │ │ Int64 │ String │ String │ ├──────────────────┼──────────┼────────────┤ │ 7 │ english │ summer │ │ 7 │ english │ summer │ │ 7 │ english │ summer │ │ 7 │ english │ summer │ │ 7 │ english │ summer │ │ 4 │ english │ winter │ │ 4 │ english │ winter │ │ 4 │ english │ winter │ │ 4 │ english │ winter │ │ 4 │ english │ winter │ └──────────────────┴──────────┴────────────┘
Rename columns. Change "satisfaction" to "job_sat" and "nation" to "country."
rename!(df, :satisfaction => :job_sat) df = df |> @rename(:nation => :country) |> DataFrame @pt df[:, [:job_sat, :country]]
┌─────────┬───────────────┐ │ job_sat │ country │ │ Int64 │ String │ ├─────────┼───────────────┤ │ 2 │ united_states │ │ 4 │ united_states │ │ 5 │ united_states │ │ 1 │ united_states │ │ 3 │ united_states │ │ 3 │ united_states │ │ 3 │ united_states │ │ 2 │ united_states │ │ 4 │ united_states │ │ 5 │ united_states │ └─────────┴───────────────┘
Reorder columns. Not easy to do unless you want to re-type every column name. Waiting for something like "df = df[, [:team.id, :country, rest.of.columns()]" but that doesn't exist yet.
Pull out a single column, in one of many ways. The last implementation also demonstrates how to select columns.
id_vec1 = df.id id_vec2 = df[:, :id] id_vec3 = @linq df |> select(:id) id_vec3 = Vector(id_vec3.id)
10-element Array{Int64,1}: 1 2 3 4 5 6 7 8 9 10
Randomly select columns.
num_cols = 4 possible_values = collect(1:1:10) use_cols = sample(possible_values, num_cols, replace = false) rand_df = @linq df |> select(use_cols) @pt rand_df
┌─────────┬──────────────────┬─────────┬─────────┐ │ team_id │ need_achievement │ name │ job_sat │ │ String │ Int64 │ String │ Int64 │ ├─────────┼──────────────────┼─────────┼─────────┤ │ awesome │ 8 │ josh │ 2 │ │ awesome │ 3 │ adam │ 4 │ │ awesome │ 4 │ jenny │ 5 │ │ awesome │ 7 │ rory │ 1 │ │ awesome │ 2 │ thomas │ 3 │ │ best │ 9 │ bill │ 3 │ │ best │ 10 │ sarah │ 3 │ │ best │ 5 │ allie │ 2 │ │ best │ 4 │ shannon │ 4 │ │ best │ 6 │ cloe │ 5 │ └─────────┴──────────────────┴─────────┴─────────┘
Create new columns. The first standardizes the job satisfaction scores, whereas the second adds the trust and performance columns together to create a new variable.
mean_jobsat = 2.5 df = @linq df |> transform(job_sat_standardized = :job_sat .- mean_jobsat) df = @linq df |> transform(trust_plus_performance = :dyad_trust .+ :team_performance) @pt df[:, [:dyad_trust, :team_performance, :trust_plus_performance]]
┌────────────┬──────────────────┬────────────────────────┐ │ dyad_trust │ team_performance │ trust_plus_performance │ │ Int64 │ Int64 │ Int64 │ ├────────────┼──────────────────┼────────────────────────┤ │ 2 │ 7 │ 9 │ │ 2 │ 7 │ 9 │ │ 5 │ 7 │ 12 │ │ 5 │ 7 │ 12 │ │ 7 │ 7 │ 14 │ │ 7 │ 4 │ 11 │ │ 8 │ 4 │ 12 │ │ 8 │ 4 │ 12 │ │ 3 │ 4 │ 7 │ │ 3 │ 4 │ 7 │ └────────────┴──────────────────┴────────────────────────┘
Sort the data based on the values within a given column.
df = @linq df |> orderby(:number_of_kids, :fav_color) @pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]
┌───────┬─────────┬─────────┬────────────────┬───────────┐ │ id │ team_id │ job_sat │ number_of_kids │ fav_color │ │ Int64 │ String │ Int64 │ Int64 │ String │ ├───────┼─────────┼─────────┼────────────────┼───────────┤ │ 5 │ awesome │ 3 │ 0 │ blue │ │ 10 │ best │ 5 │ 0 │ blue │ │ 3 │ awesome │ 5 │ 0 │ green │ │ 4 │ awesome │ 1 │ 0 │ orange │ │ 6 │ best │ 3 │ 0 │ yellow │ │ 8 │ best │ 2 │ 1 │ orange │ │ 9 │ best │ 4 │ 1 │ silver │ │ 7 │ best │ 3 │ 2 │ purple │ │ 1 │ awesome │ 2 │ 2 │ red │ │ 2 │ awesome │ 4 │ 3 │ red │ └───────┴─────────┴─────────┴────────────────┴───────────┘
sort!(df, [order(:id), order(:team_id)]) @pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]
┌───────┬─────────┬─────────┬────────────────┬───────────┐ │ id │ team_id │ job_sat │ number_of_kids │ fav_color │ │ Int64 │ String │ Int64 │ Int64 │ String │ ├───────┼─────────┼─────────┼────────────────┼───────────┤ │ 1 │ awesome │ 2 │ 2 │ red │ │ 2 │ awesome │ 4 │ 3 │ red │ │ 3 │ awesome │ 5 │ 0 │ green │ │ 4 │ awesome │ 1 │ 0 │ orange │ │ 5 │ awesome │ 3 │ 0 │ blue │ │ 6 │ best │ 3 │ 0 │ yellow │ │ 7 │ best │ 3 │ 2 │ purple │ │ 8 │ best │ 2 │ 1 │ orange │ │ 9 │ best │ 4 │ 1 │ silver │ │ 10 │ best │ 5 │ 0 │ blue │ └───────┴─────────┴─────────┴────────────────┴───────────┘
Filter the data according to values within rows.
# simple row commands @linq df |> where(:job_sat .== 2) @linq df |> where(:fav_color .== "red", :team_id .== "awesome") # specify which columns to keep df[df.job_sat .> 3, [:name, :id, :job_sat, :need_achievement]] @linq df |> where(:job_sat .> 3) |> select(:name, :id, :job_sat, :need_achievement) # multiple filtering commands df[(df.job_sat .> 3) .& (df.need_achievement .> 4), :] @linq df |> where(:job_sat .> 3, :need_achievement .> 4)
name | id | job_sat | need_achievement | fav_color | number_of_kids | instrument | team_id | dyad_id | dyad_trust | team_performance | country | common_value | language | fav_season | job_sat_standardized | trust_plus_performance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | Int64 | String | Int64 | String | String | String | Int64 | Int64 | String | Int64 | String | String | Float64 | Int64 | |
1 | cloe | 10 | 5 | 6 | blue | 0 | clarinet | best | z | 3 | 4 | united_states | 3 | english | winter | 2.5 | 7 |
Obtain a subset of rows. Similar to R's "%in%" function.
use_people = ["josh", "rory"] small_df = df # using DataFramesMeta @linq df |> where(:name .∈ Ref(use_people)) # using Base small_df[findall(in(use_people), small_df[:name]), :] # a third option... filter(small_df -> any(occursin.(use_people, small_df.name)), small_df)
name | id | job_sat | need_achievement | fav_color | number_of_kids | instrument | team_id | dyad_id | dyad_trust | team_performance | country | common_value | language | fav_season | job_sat_standardized | trust_plus_performance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | Int64 | String | Int64 | String | String | String | Int64 | Int64 | String | Int64 | String | String | Float64 | Int64 | |
1 | josh | 1 | 2 | 8 | red | 2 | piano | awesome | v | 2 | 7 | united_states | 3 | english | summer | -0.5 | 9 |
2 | rory | 4 | 1 | 7 | orange | 0 | piano | awesome | w | 5 | 7 | united_states | 3 | english | summer | -1.5 | 12 |
Identify a row and column value, use it in a computation, then replace the original value with the new computed value.
This seems archaic for a basic cheat sheet but I use this kind of script in almost every computational model that I create.
# filter to a row person_i_need = "josh" row_i_need = @linq df |> where(:name .== person_i_need) # get josh's value of job satisfaction value_i_need = Vector(row_i_need.job_sat)[1] ### don't do this ### Vector(row_i_need.job_sat) - doing so will give you a vector # do something with his job sat value new_js = value_i_need + 10 * 4 # put the new value back into the data df[ (df.name .== person_i_need) .& (df.team_id .== "awesome"), :job_sat] = new_js @pt df[:, [:name, :job_sat]]
┌─────────┬─────────┐ │ name │ job_sat │ │ String │ Int64 │ ├─────────┼─────────┤ │ josh │ 42 │ │ adam │ 4 │ │ jenny │ 5 │ │ rory │ 1 │ │ thomas │ 3 │ │ bill │ 3 │ │ sarah │ 3 │ │ allie │ 2 │ │ shannon │ 4 │ │ cloe │ 5 │ └─────────┴─────────┘
Re-lable rows or create cutoffs. Similar to R's case_when() function.
First example. Turn time points such as 1, 2, and 3 into character-labeled time points such as "affect1", "affect2", and "affect3." This seems odd but it's necessary for creating wide data later on.
# data frame affect_df = DataFrame( person = [1,1,1,2,2,2,3,3,3], time = [1,2,3,1,2,3,1,2,3], affect = [3,4,2,5,3,2,4,3,2] ) @pt affect_df
┌────────┬───────┬────────┐ │ person │ time │ affect │ │ Int64 │ Int64 │ Int64 │ ├────────┼───────┼────────┤ │ 1 │ 1 │ 3 │ │ 1 │ 2 │ 4 │ │ 1 │ 3 │ 2 │ │ 2 │ 1 │ 5 │ │ 2 │ 2 │ 3 │ │ 2 │ 3 │ 2 │ │ 3 │ 1 │ 4 │ │ 3 │ 2 │ 3 │ │ 3 │ 3 │ 2 │ └────────┴───────┴────────┘
# function specifying how the labels should change function convert_rows(val) if val == 1 result = "affect1" elseif val == 2 result = "affect2" else result = "affect3" return(result) end end # use that function within a transform call to create a new column # use f.() syntax to vectorize affect_df = @linq affect_df |> transform(label_time = convert_rows.(:time)) @pt affect_df
┌────────┬───────┬────────┬────────────┐ │ person │ time │ affect │ label_time │ │ Int64 │ Int64 │ Int64 │ String │ ├────────┼───────┼────────┼────────────┤ │ 1 │ 1 │ 3 │ affect1 │ │ 1 │ 2 │ 4 │ affect2 │ │ 1 │ 3 │ 2 │ affect3 │ │ 2 │ 1 │ 5 │ affect1 │ │ 2 │ 2 │ 3 │ affect2 │ │ 2 │ 3 │ 2 │ affect3 │ │ 3 │ 1 │ 4 │ affect1 │ │ 3 │ 2 │ 3 │ affect2 │ │ 3 │ 3 │ 2 │ affect3 │ └────────┴───────┴────────┴────────────┘
Second example. Change character values into different character values.
# data labels_df = DataFrame( labels = ["dog", "shoe", "barn", "door", "cash"] ) # function specifying how the values should change function convert_labels(label) if label == "dog" result = "blue" elseif label == "shoe" result = "green" elseif label == "barn" result = "orange" elseif label == "door" result = "black" elseif label == "cash" result = "yellow" return(result) end end # transform a column and use f.() syntax labels_df = @linq labels_df |> transform(color_labels = convert_labels.(:labels)) @pt labels_df
┌────────┬──────────────┐ │ labels │ color_labels │ │ String │ String │ ├────────┼──────────────┤ │ dog │ blue │ │ shoe │ green │ │ barn │ orange │ │ door │ black │ │ cash │ yellow │ └────────┴──────────────┘
Third example. Create cutoffs with boolean logic.
# data intermediates_df = DataFrame( person = [1,2,3,4,5,6,7,8], score = [1,2,3,7,8,9,20,30] ) # function specifying the cutoffs function create_limits(x) if x < 5 result = "low" elseif 5 < x < 10 result = "medium" elseif x > 10 result = "high" return(result) end end # transform a column and use f.() syntax intermediates_df = @linq intermediates_df |> transform(cutoffs = create_limits.(:score)) @pt intermediates_df
┌────────┬───────┬─────────┐ │ person │ score │ cutoffs │ │ Int64 │ Int64 │ String │ ├────────┼───────┼─────────┤ │ 1 │ 1 │ low │ │ 2 │ 2 │ low │ │ 3 │ 3 │ low │ │ 4 │ 7 │ medium │ │ 5 │ 8 │ medium │ │ 6 │ 9 │ medium │ │ 7 │ 20 │ high │ │ 8 │ 30 │ high │ └────────┴───────┴─────────┘
Fourth example: logic using multiple columns.
# data weather_df = DataFrame( sun = [3,2,1,8,9,7], clouds = [8,5,7,2,1,3], day = [1,2,3,4,5,6] ) # function specifying logic function determine_weather(col1, col2) if col1 < col2 result = "cloudy" elseif col1 > col2 result = "sunny" elseif col1 == col2 result = "hazzy" end end # transform the data using a create new column command and f.() syntax weather_df = @linq weather_df |> transform(weather = determine_weather.(:sun, :clouds)) @pt weather_df
┌───────┬────────┬───────┬─────────┐ │ sun │ clouds │ day │ weather │ │ Int64 │ Int64 │ Int64 │ String │ ├───────┼────────┼───────┼─────────┤ │ 3 │ 8 │ 1 │ cloudy │ │ 2 │ 5 │ 2 │ cloudy │ │ 1 │ 7 │ 3 │ cloudy │ │ 8 │ 2 │ 4 │ sunny │ │ 9 │ 1 │ 5 │ sunny │ │ 7 │ 3 │ 6 │ sunny │ └───────┴────────┴───────┴─────────┘
Using the "zeros" command seems to be the safest approach in Julia. I haven't yet learned how to use "Vector{Int}(undef, 10)" commands.
periods = 10 scores = zeros(periods) dm = DataFrame( performance = zeros(periods), satisfaction = zeros(periods) )
performance | satisfaction | |
---|---|---|
Float64 | Float64 | |
1 | 0.0 | 0.0 |
2 | 0.0 | 0.0 |
3 | 0.0 | 0.0 |
4 | 0.0 | 0.0 |
5 | 0.0 | 0.0 |
6 | 0.0 | 0.0 |
7 | 0.0 | 0.0 |
8 | 0.0 | 0.0 |
9 | 0.0 | 0.0 |
10 | 0.0 | 0.0 |
time = 20 # example 1: append an entire data frame to an empty df dfzz = DataFrame( group_p = Float64[], performance = Float64[] ) for i in 1:time append!(dfzz, DataFrame( group_p = i + 3, performance = i - 3 )) end @pt dfzz
┌─────────┬─────────────┐ │ group_p │ performance │ │ Float64 │ Float64 │ ├─────────┼─────────────┤ │ 4.0 │ -2.0 │ │ 5.0 │ -1.0 │ │ 6.0 │ 0.0 │ │ 7.0 │ 1.0 │ │ 8.0 │ 2.0 │ │ 9.0 │ 3.0 │ │ 10.0 │ 4.0 │ │ 11.0 │ 5.0 │ │ 12.0 │ 6.0 │ │ 13.0 │ 7.0 │ │ 14.0 │ 8.0 │ │ 15.0 │ 9.0 │ │ 16.0 │ 10.0 │ │ 17.0 │ 11.0 │ │ 18.0 │ 12.0 │ │ 19.0 │ 13.0 │ │ 20.0 │ 14.0 │ │ 21.0 │ 15.0 │ │ 22.0 │ 16.0 │ │ 23.0 │ 17.0 │ └─────────┴─────────────┘
# example 2: add to an initialized df dfyy = DataFrame( group_p = repeat([0], length(1:time)), performance = repeat([0], length(1:time)) ) for i in 1:time dfyy[i, :group_p] = i + 2 dfyy[i, :performance] = i - 3 end @pt dfyy
┌─────────┬─────────────┐ │ group_p │ performance │ │ Int64 │ Int64 │ ├─────────┼─────────────┤ │ 3 │ -2 │ │ 4 │ -1 │ │ 5 │ 0 │ │ 6 │ 1 │ │ 7 │ 2 │ │ 8 │ 3 │ │ 9 │ 4 │ │ 10 │ 5 │ │ 11 │ 6 │ │ 12 │ 7 │ │ 13 │ 8 │ │ 14 │ 9 │ │ 15 │ 10 │ │ 16 │ 11 │ │ 17 │ 12 │ │ 18 │ 13 │ │ 19 │ 14 │ │ 20 │ 15 │ │ 21 │ 16 │ │ 22 │ 17 │ └─────────┴─────────────┘
Use dot notation to apply a function to every value in a vector.
vec1 = [0, 3, 4, 5, 6] function addtwo(x) y = x + 2 return(y) end # don't use addtwo(vec1) # use addtwo.(vec1) addtwo.(vec1)
5-element Array{Int64,1}: 2 5 6 7 8
Calculate column means.
# mean of single column mean(df.need_achievement) # within each team, calculate the mean job satisfaction @linq df |> groupby(:team_id) |> based_on(mean_jobsat = mean(:job_sat)) # within each dyad, calculate the mean and sd need for achievement @linq df |> groupby(:dyad_id) |> based_on( mean_need_achievement = mean(:need_achievement), sd_need_achievement = std(:need_achievement) )
dyad_id | mean_need_achievement | sd_need_achievement | |
---|---|---|---|
String | Float64 | Float64 | |
1 | v | 5.5 | 3.53553 |
2 | w | 5.5 | 2.12132 |
3 | x | 5.5 | 4.94975 |
4 | y | 7.5 | 3.53553 |
5 | z | 5.0 | 1.41421 |
What if there is missing data?
missing_df = DataFrame( team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"], name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"], cohesion = [3,6,missing,7,8,missing] ) @pt missing_df
┌─────────┬─────────┬───────────────────────┐ │ team_id │ name │ cohesion │ │ String │ String │ Union{Missing, Int64} │ ├─────────┼─────────┼───────────────────────┤ │ team_a │ ted │ 3 │ │ team_a │ stewart │ 6 │ │ team_a │ jessie │ missing │ │ team_b │ susan │ 7 │ │ team_b │ carie │ 8 │ │ team_b │ rachel │ missing │ └─────────┴─────────┴───────────────────────┘
@linq missing_df |> groupby(:team_id) |> based_on( mean_cohesion = mean(skipmissing(:cohesion)), sd_cohesion = std(skipmissing(:cohesion)) )
team_id | mean_cohesion | sd_cohesion | |
---|---|---|---|
String | Float64 | Float64 | |
1 | team_a | 4.5 | 2.12132 |
2 | team_b | 7.5 | 0.707107 |
What if the missing data is labeled as NaN?
missing_df_nan = DataFrame( team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"], name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"], cohesion = [3,6,NaN,7,8,NaN] ) @pt missing_df_nan
┌─────────┬─────────┬──────────┐ │ team_id │ name │ cohesion │ │ String │ String │ Float64 │ ├─────────┼─────────┼──────────┤ │ team_a │ ted │ 3.0 │ │ team_a │ stewart │ 6.0 │ │ team_a │ jessie │ NaN │ │ team_b │ susan │ 7.0 │ │ team_b │ carie │ 8.0 │ │ team_b │ rachel │ NaN │ └─────────┴─────────┴──────────┘
nanmean(x) = mean(filter(!isnan,x)) nansd(x) = std(filter(!isnan,x)) @linq missing_df_nan |> groupby(:team_id) |> based_on( mean_cohesion = nanmean(:cohesion), sd_cohesion = nansd(:cohesion) )
team_id | mean_cohesion | sd_cohesion | |
---|---|---|---|
String | Float64 | Float64 | |
1 | team_a | 4.5 | 2.12132 |
2 | team_b | 7.5 | 0.707107 |
Keep in mind that missingness in julia
should be labeled with missing
, not NaN
.
Take data stored in a separate data set (team_cohesion
) and merge it with the larger, full data (full_data
).
Left join retains only rows in the first df. For example, if "team_cohesion" had data on team 4 then it would not be included in the join that I show here.
The key is specified with the on
command.
full_data = DataFrame( team_id = ["team_1", "team_2", "team_3", "team_1", "team_2", "team_3"], person_id = [1,2,3,4,5,6], openness = [6,7,8,5,6,7] ) @pt full_data
┌─────────┬───────────┬──────────┐ │ team_id │ person_id │ openness │ │ String │ Int64 │ Int64 │ ├─────────┼───────────┼──────────┤ │ team_1 │ 1 │ 6 │ │ team_2 │ 2 │ 7 │ │ team_3 │ 3 │ 8 │ │ team_1 │ 4 │ 5 │ │ team_2 │ 5 │ 6 │ │ team_3 │ 6 │ 7 │ └─────────┴───────────┴──────────┘
team_cohesion = DataFrame( team_id = ["team_1", "team_2", "team_3"], team_cohesion = [30,20,10] ) @pt team_cohesion
┌─────────┬───────────────┐ │ team_id │ team_cohesion │ │ String │ Int64 │ ├─────────┼───────────────┤ │ team_1 │ 30 │ │ team_2 │ 20 │ │ team_3 │ 10 │ └─────────┴───────────────┘
full_df = join(full_data, team_cohesion, on = :team_id, kind = :left) @pt full_df
┌─────────┬───────────┬──────────┬───────────────────────┐ │ team_id │ person_id │ openness │ team_cohesion │ │ String │ Int64 │ Int64 │ Union{Missing, Int64} │ ├─────────┼───────────┼──────────┼───────────────────────┤ │ team_1 │ 1 │ 6 │ 30 │ │ team_2 │ 2 │ 7 │ 20 │ │ team_3 │ 3 │ 8 │ 10 │ │ team_1 │ 4 │ 5 │ 30 │ │ team_2 │ 5 │ 6 │ 20 │ │ team_3 │ 6 │ 7 │ 10 │ └─────────┴───────────┴──────────┴───────────────────────┘
Take data in long form and make it wide. Doing so in julia
is more cumbersome than the reshape
command in R, especially if the goal of creating wide data is to run structural equations models.
# a long df long_df = DataFrame( person_id = [1,1,1, 2,2,2, 3,3,3, 4,4,4], time = [1,2,3, 1,2,3, 1,2,3, 1,2,3], affect = [2,3,1, 2,2,2, 1,3,1, 1,2,3], esteem = [5,6,7, 6,6,6, 3,4,5, 4,8,9], stable_measure = [5,5,5, 7,7,7, 3,3,3, 2,2,2] ) @pt long_df
┌───────────┬───────┬────────┬────────┬────────────────┐ │ person_id │ time │ affect │ esteem │ stable_measure │ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ ├───────────┼───────┼────────┼────────┼────────────────┤ │ 1 │ 1 │ 2 │ 5 │ 5 │ │ 1 │ 2 │ 3 │ 6 │ 5 │ │ 1 │ 3 │ 1 │ 7 │ 5 │ │ 2 │ 1 │ 2 │ 6 │ 7 │ │ 2 │ 2 │ 2 │ 6 │ 7 │ │ 2 │ 3 │ 2 │ 6 │ 7 │ │ 3 │ 1 │ 1 │ 3 │ 3 │ │ 3 │ 2 │ 3 │ 4 │ 3 │ │ 3 │ 3 │ 1 │ 5 │ 3 │ │ 4 │ 1 │ 1 │ 4 │ 2 │ │ 4 │ 2 │ 2 │ 8 │ 2 │ │ 4 │ 3 │ 3 │ 9 │ 2 │ └───────────┴───────┴────────┴────────┴────────────────┘
# create wide data with esteem as the "widen" variable print( unstack(long_df, :person_id, :time, :esteem) )
4×4 DataFrames.DataFrame │ Row │ person_id │ 1 │ 2 │ 3 │ │ │ Int64 │ Int64⍰ │ Int64⍰ │ Int64⍰ │ ├─────┼───────────┼────────┼────────┼────────┤ │ 1 │ 1 │ 5 │ 6 │ 7 │ │ 2 │ 2 │ 6 │ 6 │ 6 │ │ 3 │ 3 │ 3 │ 4 │ 5 │ │ 4 │ 4 │ 4 │ 8 │ 9 │
# create wide data with affect as the "widen" variable print( unstack(long_df, :person_id, :time, :affect) )
4×4 DataFrames.DataFrame │ Row │ person_id │ 1 │ 2 │ 3 │ │ │ Int64 │ Int64⍰ │ Int64⍰ │ Int64⍰ │ ├─────┼───────────┼────────┼────────┼────────┤ │ 1 │ 1 │ 2 │ 3 │ 1 │ │ 2 │ 2 │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 1 │ 3 │ 1 │ │ 4 │ 4 │ 1 │ 2 │ 3 │
So, there are two issues with reshaping in julia
. First, I can't make affect and esteem the wide variables simultaneously.
Second, the names of the columns become integers such as 1, 2, 3 and so they need to be relabeled.
Here is how I would prepare the data for SEM.
wide_affect = unstack(long_df, [:person_id, :stable_measure], :time, :affect) rename!(wide_affect, "1" => "affect1", "2" => "affect2", "3" => "affect3") wide_esteem = unstack(long_df, [:person_id], :time, :esteem) rename!(wide_esteem, "1" => "esteem1", "2" => "esteem2", "3" => "esteem3") wide_df = join(wide_affect, wide_esteem, on = :person_id, kind = :left)
person_id | stable_measure | affect1 | affect2 | affect3 | esteem1 | esteem2 | esteem3 | |
---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | Int64⍰ | |
1 | 1 | 5 | 2 | 3 | 1 | 5 | 6 | 7 |
2 | 2 | 7 | 2 | 2 | 2 | 6 | 6 | 6 |
3 | 3 | 3 | 1 | 3 | 1 | 3 | 4 | 5 |
4 | 4 | 2 | 1 | 2 | 3 | 4 | 8 | 9 |
Disclaimer. I'm not saying that julia
is incapable of unstacking multiple variables or that there is not a workaround (I just demonstrated one).
I'm simply saying that the reshape
command in R is currently more user-friendly for SEM.
What about wide to long?
long_affect = stack(wide_affect, [:affect1, :affect2, :affect3])
variable | value | person_id | stable_measure | |
---|---|---|---|---|
Symbol | Int64⍰ | Int64 | Int64 | |
1 | affect1 | 2 | 1 | 5 |
2 | affect1 | 2 | 2 | 7 |
3 | affect1 | 1 | 3 | 3 |
4 | affect1 | 1 | 4 | 2 |
5 | affect2 | 3 | 1 | 5 |
6 | affect2 | 2 | 2 | 7 |
7 | affect2 | 3 | 3 | 3 |
8 | affect2 | 2 | 4 | 2 |
9 | affect3 | 1 | 1 | 5 |
10 | affect3 | 2 | 2 | 7 |
11 | affect3 | 1 | 3 | 3 |
12 | affect3 | 3 | 4 | 2 |
Identify directory and adjust.
# current directory pwd() # change directory cd("") # change directory to wherever the opened file is located cd(dirname(@__FILE__)) # move up a level cd("../data") # if you know the file name but not the directory filename = "whatever_name.csv" filepath = joinpath(@__DIR__, filename) println(filepath)
Working with csv files, either with CSV.jl
or CSVFiles.jl
.
# read csv df = CSV.File(file) |> DataFrame! # if the file contains data with no header CSV.File(file; header = false) # or to incorporate your own column names CSV.File(file; header = ["col1", "col2", "col3"]) # if the csv has spaces in the header names # e.g., column names like "column one" will become "column_one" CSV.File(file; normalizenames=true) # if the csv has meta data within the first few rows that needs to be skipped CSV.File(file; datarow = 4) # if something like NA or 999 indicate missingness in csv CSV.File(file; missingstrings=["999", "NA"]) # identify column types CSV.File(file; types =[Int, Int, String]) # the csv package has many of the same commands ## append two csvs df = CSV.read(file) df = CSV.read(file2, df; apppend=true) # write csvs ## specify that missing values should be "NA" CSV.write("out.csv", df; null = "NA")
Identify all files within a folder.
readdir([path])
Run R code within julia
. I don't evaluate these commands to save space.
# run R code R""" library(ggplot2) ggplot($df, aes(x = time, y = y)) + geom_point() library(actuar) total <- 4 + 9 power_law <- rpareto(1,2,3 """ # pull objects from r environment to julia environment @rget total @rget power_law # put objects from julia environment into r environment jul_variable = 4 @rput jul_variables R""" new_in_r <- jul_variable + 10 """ @rget new_in_r
Add packages. In terminal, press ]
and then type add
package_name
. Do not include .jl
after the package name.
In a script, do below.
Pkg.Add("RCall") # from github Pkg.cone("https://github.com/JuliaDiffEq/ParameterizedFunctions.jl")
Create vectors and arrays.
# vector of length n with all zeros n=10 result_store = zeros(n) # undefined vector of length n Vector(undef, n) # array from 0 to 100 in steps of 10 print( collect(0:10:100) )
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# array from 4 to 1 decreasing by 1 print( collect(4:-1:1) )
[4, 3, 2, 1]
# array from 1 to 100 with 12 steps print( collect(range(1, stop = 100, length = 12)) )
[1.0, 10.0, 19.0, 28.0, 37.0, 46.0, 55.0, 64.0, 73.0, 82.0, 91.0, 100.0]
# start at 1, add 3, do that 20 times or for 20 steps print( range(1, step = 3, length = 20) |> collect )
[1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 5 8]
Count the number of "trues" in a vector. For this example, refer back to the weather data frame.
@pt weather_df
┌───────┬────────┬───────┬─────────┐ │ sun │ clouds │ day │ weather │ │ Int64 │ Int64 │ Int64 │ String │ ├───────┼────────┼───────┼─────────┤ │ 3 │ 8 │ 1 │ cloudy │ │ 2 │ 5 │ 2 │ cloudy │ │ 1 │ 7 │ 3 │ cloudy │ │ 8 │ 2 │ 4 │ sunny │ │ 9 │ 1 │ 5 │ sunny │ │ 7 │ 3 │ 6 │ sunny │ └───────┴────────┴───────┴─────────┘
# how many times is it sunny? days_sunny = count(i == "sunny" for i in weather_df.weather) print(days_sunny)
3
Create distributions and pull samples.
# set seed Random.seed!(4) # draw 100 random samples from a normal distribution with mean 1 and sd 2 draw_values = rand(Normal(1,2), 100) # draw a single sample but retain it as an integer rather than an array draw_integer = rand(Normal(1,2), 1)[1] # estimate parameters of empirical distribution draw_values fit(Normal, draw_values) # for any distribution (see distributions "continuous" documentation ## abstract_distribution = Normal(parameter1, parameter2) ## x = rand(abstract_distribution, 100) # multivariate normal with 2 variables, draw 5 values mu = [10.0; 20.1] covariance_matrix = [2.0 1.2; 1.2 3.5] d = MvNormal(mu, covariance_matrix) x = rand(d, 5) # random number from uniform between 0 and 1 rand() # 5 random numbers drawn from a uniform betwen 0 and 1 rand(5) # matrix of random numbers, 4 by 3 rand(4,3) # random numbers from Uniform[0,100] rand()*100 # randomly choose an index within a vector rand(1:10) # 2 random draws from Uniform[3:9] rand(3:9, 2)
2-element Array{Int64,1}: 3 3
Read a text file.
datafilename = "data.txt" datafile = open(datafilename) data = readlines(datafile) close(datafile)