Julia Basics For Data Wrangling and Computational Modeling

Christopher R. Dishop

Common commands that I use for data analysis and simulation.

First, libraries and a data frame.

using DataFrames, DataFramesMeta, Query, Statistics, CSV, CSVFiles, Distributions, Random, PrettyTables

df = DataFrame(
  name = ["josh", "adam", "jenny", "rory", "thomas", "bill", "sarah", "allie", "shannon", "cloe"],
  id = collect(1:10),
  satisfaction = [2, 4, 5, 1, 3, 3, 3, 2, 4, 5],
  need_achievement = [8, 3, 4, 7, 2, 9, 10, 5, 4, 6],
  fav_color = ["red", "red", "green", "orange", "blue", "yellow", "purple", "orange", "silver", "blue"],
  number_of_kids = [2, 3, 0, 0, 0, 0, 2, 1, 1, 0],
  instrument = ["piano", "guitar", "none", "piano", "saxophone", "none", "none", "none", "none", "clarinet"],
  team_id = vcat(
                repeat(["awesome"], 5),
                repeat(["best"], 5)
                ),
  dyad_id = vcat(
                repeat(["v"], 2),
                repeat(["w"], 2),
                repeat(["x"], 2),
                repeat(["y"], 2),
                repeat(["z"], 2)
                ),
  dyad_trust =   [2, 2,
                  5, 5,
                  7, 7,
                  8, 8,
                  3, 3],
  team_performance = [7, 7, 7, 7, 7,
                        4, 4, 4, 4, 4],
  nation = repeat(["united_states"], 10),
  common_value = repeat([3], 10)

)

10 rows × 13 columns

nameidsatisfactionneed_achievementfav_colornumber_of_kidsinstrumentteam_iddyad_iddyad_trustteam_performancenationcommon_value
StringInt64Int64Int64StringInt64StringStringStringInt64Int64StringInt64
1josh128red2pianoawesomev27united_states3
2adam243red3guitarawesomev27united_states3
3jenny354green0noneawesomew57united_states3
4rory417orange0pianoawesomew57united_states3
5thomas532blue0saxophoneawesomex77united_states3
6bill639yellow0nonebestx74united_states3
7sarah7310purple2nonebesty84united_states3
8allie825orange1nonebesty84united_states3
9shannon944silver1nonebestz34united_states3
10cloe1056blue0clarinetbestz34united_states3

Structure

Identify characteristics about the data.

names(df)
size(df)
describe(df)
first(df)
last(df)
show(df, allrows = true, allcols = true)
10×13 DataFrames.DataFrame
│ Row │ name    │ id    │ satisfaction │ need_achievement │ fav_color │
│     │ String  │ Int64 │ Int64        │ Int64            │ String    │
├─────┼─────────┼───────┼──────────────┼──────────────────┼───────────┤
│ 1   │ josh    │ 1     │ 2            │ 8                │ red       │
│ 2   │ adam    │ 2     │ 4            │ 3                │ red       │
│ 3   │ jenny   │ 3     │ 5            │ 4                │ green     │
│ 4   │ rory    │ 4     │ 1            │ 7                │ orange    │
│ 5   │ thomas  │ 5     │ 3            │ 2                │ blue      │
│ 6   │ bill    │ 6     │ 3            │ 9                │ yellow    │
│ 7   │ sarah   │ 7     │ 3            │ 10               │ purple    │
│ 8   │ allie   │ 8     │ 2            │ 5                │ orange    │
│ 9   │ shannon │ 9     │ 4            │ 4                │ silver    │
│ 10  │ cloe    │ 10    │ 5            │ 6                │ blue      │

│ Row │ number_of_kids │ instrument │ team_id │ dyad_id │ dyad_trust │
│     │ Int64          │ String     │ String  │ String  │ Int64      │
├─────┼────────────────┼────────────┼─────────┼─────────┼────────────┤
│ 1   │ 2              │ piano      │ awesome │ v       │ 2          │
│ 2   │ 3              │ guitar     │ awesome │ v       │ 2          │
│ 3   │ 0              │ none       │ awesome │ w       │ 5          │
│ 4   │ 0              │ piano      │ awesome │ w       │ 5          │
│ 5   │ 0              │ saxophone  │ awesome │ x       │ 7          │
│ 6   │ 0              │ none       │ best    │ x       │ 7          │
│ 7   │ 2              │ none       │ best    │ y       │ 8          │
│ 8   │ 1              │ none       │ best    │ y       │ 8          │
│ 9   │ 1              │ none       │ best    │ z       │ 3          │
│ 10  │ 0              │ clarinet   │ best    │ z       │ 3          │

│ Row │ team_performance │ nation        │ common_value │
│     │ Int64            │ String        │ Int64        │
├─────┼──────────────────┼───────────────┼──────────────┤
│ 1   │ 7                │ united_states │ 3            │
│ 2   │ 7                │ united_states │ 3            │
│ 3   │ 7                │ united_states │ 3            │
│ 4   │ 7                │ united_states │ 3            │
│ 5   │ 7                │ united_states │ 3            │
│ 6   │ 4                │ united_states │ 3            │
│ 7   │ 4                │ united_states │ 3            │
│ 8   │ 4                │ united_states │ 3            │
│ 9   │ 4                │ united_states │ 3            │
│ 10  │ 4                │ united_states │ 3            │

Columns

Create new columns, one documenting that everyone speaks english and another representing each individual's favorite season.

# with base
df[!, :language] = repeat(["english"], 10)

# with data frames meta
df = @linq df |>
    transform(fav_season = vcat(
                                repeat(["summer"], 5),
                                repeat(["winter"], 5)
    ))


@pt df[:, [:team_performance, :language, :fav_season]]
┌──────────────────┬──────────┬────────────┐
│ team_performance │ language │ fav_season │
│            Int64 │   String │     String │
├──────────────────┼──────────┼────────────┤
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
└──────────────────┴──────────┴────────────┘

Rename columns. Change "satisfaction" to "job_sat" and "nation" to "country."

rename!(df, :satisfaction => :job_sat)

df = df |> @rename(:nation => :country) |> DataFrame

@pt df[:, [:job_sat, :country]]
┌─────────┬───────────────┐
│ job_sat │       country │
│   Int64 │        String │
├─────────┼───────────────┤
│       2 │ united_states │
│       4 │ united_states │
│       5 │ united_states │
│       1 │ united_states │
│       3 │ united_states │
│       3 │ united_states │
│       3 │ united_states │
│       2 │ united_states │
│       4 │ united_states │
│       5 │ united_states │
└─────────┴───────────────┘

Reorder columns. Not easy to do unless you want to re-type every column name. Waiting for something like "df = df[, [:team.id, :country, rest.of.columns()]" but that doesn't exist yet.

Pull out a single column, in one of many ways. The last implementation also demonstrates how to select columns.

id_vec1 = df.id

id_vec2 = df[:, :id]

id_vec3 = @linq df |>
        select(:id)
id_vec3 = Vector(id_vec3.id)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

Randomly select columns.

num_cols = 4
possible_values = collect(1:1:10)

use_cols = sample(possible_values, num_cols, replace = false)

rand_df = @linq df |>
        select(use_cols)

@pt rand_df
┌─────────┬──────────────────┬─────────┬─────────┐
│ team_id │ need_achievement │    name │ job_sat │
│  String │            Int64 │  String │   Int64 │
├─────────┼──────────────────┼─────────┼─────────┤
│ awesome │                8 │    josh │       2 │
│ awesome │                3 │    adam │       4 │
│ awesome │                4 │   jenny │       5 │
│ awesome │                7 │    rory │       1 │
│ awesome │                2 │  thomas │       3 │
│    best │                9 │    bill │       3 │
│    best │               10 │   sarah │       3 │
│    best │                5 │   allie │       2 │
│    best │                4 │ shannon │       4 │
│    best │                6 │    cloe │       5 │
└─────────┴──────────────────┴─────────┴─────────┘

Create new columns. The first standardizes the job satisfaction scores, whereas the second adds the trust and performance columns together to create a new variable.

mean_jobsat = 2.5

df = @linq df |>
    transform(job_sat_standardized = :job_sat .- mean_jobsat)

df = @linq df |>
    transform(trust_plus_performance = :dyad_trust .+ :team_performance)


@pt df[:, [:dyad_trust, :team_performance, :trust_plus_performance]]
┌────────────┬──────────────────┬────────────────────────┐
│ dyad_trust │ team_performance │ trust_plus_performance │
│      Int64 │            Int64 │                  Int64 │
├────────────┼──────────────────┼────────────────────────┤
│          2 │                7 │                      9 │
│          2 │                7 │                      9 │
│          5 │                7 │                     12 │
│          5 │                7 │                     12 │
│          7 │                7 │                     14 │
│          7 │                4 │                     11 │
│          8 │                4 │                     12 │
│          8 │                4 │                     12 │
│          3 │                4 │                      7 │
│          3 │                4 │                      7 │
└────────────┴──────────────────┴────────────────────────┘

Sort the data based on the values within a given column.

df = @linq df |>
    orderby(:number_of_kids, :fav_color)

@pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]
┌───────┬─────────┬─────────┬────────────────┬───────────┐
│    id │ team_id │ job_sat │ number_of_kids │ fav_color │
│ Int64 │  String │   Int64 │          Int64 │    String │
├───────┼─────────┼─────────┼────────────────┼───────────┤
│     5 │ awesome │       3 │              0 │      blue │
│    10 │    best │       5 │              0 │      blue │
│     3 │ awesome │       5 │              0 │     green │
│     4 │ awesome │       1 │              0 │    orange │
│     6 │    best │       3 │              0 │    yellow │
│     8 │    best │       2 │              1 │    orange │
│     9 │    best │       4 │              1 │    silver │
│     7 │    best │       3 │              2 │    purple │
│     1 │ awesome │       2 │              2 │       red │
│     2 │ awesome │       4 │              3 │       red │
└───────┴─────────┴─────────┴────────────────┴───────────┘
sort!(df, [order(:id), order(:team_id)])

@pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]
┌───────┬─────────┬─────────┬────────────────┬───────────┐
│    id │ team_id │ job_sat │ number_of_kids │ fav_color │
│ Int64 │  String │   Int64 │          Int64 │    String │
├───────┼─────────┼─────────┼────────────────┼───────────┤
│     1 │ awesome │       2 │              2 │       red │
│     2 │ awesome │       4 │              3 │       red │
│     3 │ awesome │       5 │              0 │     green │
│     4 │ awesome │       1 │              0 │    orange │
│     5 │ awesome │       3 │              0 │      blue │
│     6 │    best │       3 │              0 │    yellow │
│     7 │    best │       3 │              2 │    purple │
│     8 │    best │       2 │              1 │    orange │
│     9 │    best │       4 │              1 │    silver │
│    10 │    best │       5 │              0 │      blue │
└───────┴─────────┴─────────┴────────────────┴───────────┘

Rows

Filter the data according to values within rows.

# simple row commands

@linq df |>
    where(:job_sat .== 2)

@linq df |>
    where(:fav_color .== "red", :team_id .== "awesome")

# specify which columns to keep

df[df.job_sat .> 3, [:name, :id, :job_sat, :need_achievement]]

@linq df |>
    where(:job_sat .> 3) |>
    select(:name, :id, :job_sat, :need_achievement)

# multiple filtering commands

df[(df.job_sat .> 3) .& (df.need_achievement .> 4), :]

@linq df |>
    where(:job_sat .> 3, :need_achievement .> 4)

1 rows × 17 columns

nameidjob_satneed_achievementfav_colornumber_of_kidsinstrumentteam_iddyad_iddyad_trustteam_performancecountrycommon_valuelanguagefav_seasonjob_sat_standardizedtrust_plus_performance
StringInt64Int64Int64StringInt64StringStringStringInt64Int64StringInt64StringStringFloat64Int64
1cloe1056blue0clarinetbestz34united_states3englishwinter2.57

Obtain a subset of rows. Similar to R's "%in%" function.

use_people = ["josh", "rory"]
small_df = df

# using DataFramesMeta
@linq df |>
    where(:name .∈ Ref(use_people))

# using Base
small_df[findall(in(use_people), small_df[:name]), :]

# a third option...
filter(small_df -> any(occursin.(use_people, small_df.name)), small_df)

2 rows × 17 columns

nameidjob_satneed_achievementfav_colornumber_of_kidsinstrumentteam_iddyad_iddyad_trustteam_performancecountrycommon_valuelanguagefav_seasonjob_sat_standardizedtrust_plus_performance
StringInt64Int64Int64StringInt64StringStringStringInt64Int64StringInt64StringStringFloat64Int64
1josh128red2pianoawesomev27united_states3englishsummer-0.59
2rory417orange0pianoawesomew57united_states3englishsummer-1.512

Identify a row and column value, use it in a computation, then replace the original value with the new computed value.

This seems archaic for a basic cheat sheet but I use this kind of script in almost every computational model that I create.

# filter to a row

person_i_need = "josh"

row_i_need = @linq df |>
            where(:name .== person_i_need)

# get josh's value of job satisfaction

value_i_need = Vector(row_i_need.job_sat)[1]

### don't do this
### Vector(row_i_need.job_sat) - doing so will give you a vector

# do something with his job sat value

new_js = value_i_need + 10 * 4

# put the new value back into the data

df[ (df.name .== person_i_need) .& (df.team_id .== "awesome"), :job_sat] = new_js

@pt df[:, [:name, :job_sat]]
┌─────────┬─────────┐
│    name │ job_sat │
│  String │   Int64 │
├─────────┼─────────┤
│    josh │      42 │
│    adam │       4 │
│   jenny │       5 │
│    rory │       1 │
│  thomas │       3 │
│    bill │       3 │
│   sarah │       3 │
│   allie │       2 │
│ shannon │       4 │
│    cloe │       5 │
└─────────┴─────────┘

Re-lable rows or create cutoffs. Similar to R's case_when() function.

First example. Turn time points such as 1, 2, and 3 into character-labeled time points such as "affect1", "affect2", and "affect3." This seems odd but it's necessary for creating wide data later on.

# data frame

affect_df = DataFrame(
    person = [1,1,1,2,2,2,3,3,3],
    time = [1,2,3,1,2,3,1,2,3],
    affect = [3,4,2,5,3,2,4,3,2]
)

@pt affect_df
┌────────┬───────┬────────┐
│ person │  time │ affect │
│  Int64 │ Int64 │  Int64 │
├────────┼───────┼────────┤
│      1 │     1 │      3 │
│      1 │     2 │      4 │
│      1 │     3 │      2 │
│      2 │     1 │      5 │
│      2 │     2 │      3 │
│      2 │     3 │      2 │
│      3 │     1 │      4 │
│      3 │     2 │      3 │
│      3 │     3 │      2 │
└────────┴───────┴────────┘
# function specifying how the labels should change

function convert_rows(val)
    if val == 1
        result = "affect1"
    elseif val == 2
        result = "affect2"
    else
        result = "affect3"
    return(result)
end
end



# use that function within a transform call to create a new column
# use f.() syntax to vectorize

affect_df = @linq affect_df |>
            transform(label_time = convert_rows.(:time))


@pt affect_df
┌────────┬───────┬────────┬────────────┐
│ person │  time │ affect │ label_time │
│  Int64 │ Int64 │  Int64 │     String │
├────────┼───────┼────────┼────────────┤
│      1 │     1 │      3 │    affect1 │
│      1 │     2 │      4 │    affect2 │
│      1 │     3 │      2 │    affect3 │
│      2 │     1 │      5 │    affect1 │
│      2 │     2 │      3 │    affect2 │
│      2 │     3 │      2 │    affect3 │
│      3 │     1 │      4 │    affect1 │
│      3 │     2 │      3 │    affect2 │
│      3 │     3 │      2 │    affect3 │
└────────┴───────┴────────┴────────────┘

Second example. Change character values into different character values.

# data

labels_df = DataFrame(
    labels = ["dog", "shoe", "barn", "door", "cash"]
)

# function specifying how the values should change

function convert_labels(label)
    if label == "dog"
        result = "blue"
    elseif label == "shoe"
        result = "green"
    elseif label == "barn"
        result = "orange"
    elseif label == "door"
        result = "black"
    elseif label == "cash"
        result = "yellow"
    return(result)
end
end

# transform a column and use f.() syntax

labels_df = @linq labels_df |>
        transform(color_labels = convert_labels.(:labels))


@pt labels_df
┌────────┬──────────────┐
│ labels │ color_labels │
│ String │       String │
├────────┼──────────────┤
│    dog │         blue │
│   shoe │        green │
│   barn │       orange │
│   door │        black │
│   cash │       yellow │
└────────┴──────────────┘

Third example. Create cutoffs with boolean logic.

# data

intermediates_df = DataFrame(
    person = [1,2,3,4,5,6,7,8],
    score = [1,2,3,7,8,9,20,30]
)

# function specifying the cutoffs

function create_limits(x)
    if x < 5
        result = "low"
    elseif 5 < x < 10
        result = "medium"
    elseif x > 10
        result = "high"
    return(result)
end
end

# transform a column and use f.() syntax

intermediates_df = @linq intermediates_df |>
            transform(cutoffs = create_limits.(:score))


@pt intermediates_df
┌────────┬───────┬─────────┐
│ person │ score │ cutoffs │
│  Int64 │ Int64 │  String │
├────────┼───────┼─────────┤
│      1 │     1 │     low │
│      2 │     2 │     low │
│      3 │     3 │     low │
│      4 │     7 │  medium │
│      5 │     8 │  medium │
│      6 │     9 │  medium │
│      7 │    20 │    high │
│      8 │    30 │    high │
└────────┴───────┴─────────┘

Fourth example: logic using multiple columns.

# data

weather_df = DataFrame(
    sun = [3,2,1,8,9,7],
    clouds = [8,5,7,2,1,3],
    day = [1,2,3,4,5,6]
)

# function specifying logic

function determine_weather(col1, col2)
    if col1 < col2
        result = "cloudy"
    elseif col1 > col2
        result = "sunny"
    elseif col1 == col2
        result = "hazzy"
    end
end

# transform the data using a create new column command and f.() syntax

weather_df = @linq weather_df |>
    transform(weather = determine_weather.(:sun, :clouds))

@pt weather_df
┌───────┬────────┬───────┬─────────┐
│   sun │ clouds │   day │ weather │
│ Int64 │  Int64 │ Int64 │  String │
├───────┼────────┼───────┼─────────┤
│     3 │      8 │     1 │  cloudy │
│     2 │      5 │     2 │  cloudy │
│     1 │      7 │     3 │  cloudy │
│     8 │      2 │     4 │   sunny │
│     9 │      1 │     5 │   sunny │
│     7 │      3 │     6 │   sunny │
└───────┴────────┴───────┴─────────┘

Initialize Storage Vectors

Using the "zeros" command seems to be the safest approach in Julia. I haven't yet learned how to use "Vector{Int}(undef, 10)" commands.

periods = 10
scores = zeros(periods)
dm = DataFrame(
    performance = zeros(periods),
    satisfaction = zeros(periods)
)

10 rows × 2 columns

performancesatisfaction
Float64Float64
10.00.0
20.00.0
30.00.0
40.00.0
50.00.0
60.00.0
70.00.0
80.00.0
90.00.0
100.00.0

Build Data Frame Iteratively

time = 20


# example 1: append an entire data frame to an empty df

dfzz = DataFrame(
    group_p = Float64[],
    performance = Float64[]
)



for i in 1:time

    append!(dfzz, DataFrame(
    group_p = i + 3,
    performance = i - 3
    ))

end

@pt dfzz
┌─────────┬─────────────┐
│ group_p │ performance │
│ Float64 │     Float64 │
├─────────┼─────────────┤
│     4.0 │        -2.0 │
│     5.0 │        -1.0 │
│     6.0 │         0.0 │
│     7.0 │         1.0 │
│     8.0 │         2.0 │
│     9.0 │         3.0 │
│    10.0 │         4.0 │
│    11.0 │         5.0 │
│    12.0 │         6.0 │
│    13.0 │         7.0 │
│    14.0 │         8.0 │
│    15.0 │         9.0 │
│    16.0 │        10.0 │
│    17.0 │        11.0 │
│    18.0 │        12.0 │
│    19.0 │        13.0 │
│    20.0 │        14.0 │
│    21.0 │        15.0 │
│    22.0 │        16.0 │
│    23.0 │        17.0 │
└─────────┴─────────────┘
# example 2: add to an initialized df

dfyy = DataFrame(
    group_p = repeat([0], length(1:time)),
    performance = repeat([0], length(1:time))
)



for i in 1:time

    dfyy[i, :group_p] = i + 2
    dfyy[i, :performance] = i - 3

end

@pt dfyy
┌─────────┬─────────────┐
│ group_p │ performance │
│   Int64 │       Int64 │
├─────────┼─────────────┤
│       3 │          -2 │
│       4 │          -1 │
│       5 │           0 │
│       6 │           1 │
│       7 │           2 │
│       8 │           3 │
│       9 │           4 │
│      10 │           5 │
│      11 │           6 │
│      12 │           7 │
│      13 │           8 │
│      14 │           9 │
│      15 │          10 │
│      16 │          11 │
│      17 │          12 │
│      18 │          13 │
│      19 │          14 │
│      20 │          15 │
│      21 │          16 │
│      22 │          17 │
└─────────┴─────────────┘

Element-Wise Functions

Use dot notation to apply a function to every value in a vector.

vec1 = [0, 3, 4, 5, 6]

function addtwo(x)
    y = x + 2
    return(y)
end

# don't use addtwo(vec1)
# use addtwo.(vec1)
addtwo.(vec1)
5-element Array{Int64,1}:
 2
 5
 6
 7
 8

Summary Calculations

Calculate column means.

# mean of single column

mean(df.need_achievement)

# within each team, calculate the mean job satisfaction

@linq df |>
    groupby(:team_id) |>
    based_on(mean_jobsat = mean(:job_sat))

# within each dyad, calculate the mean and sd need for achievement

@linq df |>
    groupby(:dyad_id) |>
    based_on(
        mean_need_achievement = mean(:need_achievement),
        sd_need_achievement = std(:need_achievement)
    )

5 rows × 3 columns

dyad_idmean_need_achievementsd_need_achievement
StringFloat64Float64
1v5.53.53553
2w5.52.12132
3x5.54.94975
4y7.53.53553
5z5.01.41421

What if there is missing data?

missing_df = DataFrame(
    team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"],
    name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"],
    cohesion = [3,6,missing,7,8,missing]
)

@pt missing_df
┌─────────┬─────────┬───────────────────────┐
│ team_id │    name │              cohesion │
│  String │  String │ Union{Missing, Int64} │
├─────────┼─────────┼───────────────────────┤
│  team_a │     ted │                     3 │
│  team_a │ stewart │                     6 │
│  team_a │  jessie │               missing │
│  team_b │   susan │                     7 │
│  team_b │   carie │                     8 │
│  team_b │  rachel │               missing │
└─────────┴─────────┴───────────────────────┘
@linq missing_df |>
    groupby(:team_id) |>
    based_on(
        mean_cohesion = mean(skipmissing(:cohesion)),
        sd_cohesion = std(skipmissing(:cohesion))
    )

2 rows × 3 columns

team_idmean_cohesionsd_cohesion
StringFloat64Float64
1team_a4.52.12132
2team_b7.50.707107

What if the missing data is labeled as NaN?

missing_df_nan = DataFrame(
        team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"],
        name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"],
        cohesion = [3,6,NaN,7,8,NaN]
    )

@pt missing_df_nan
┌─────────┬─────────┬──────────┐
│ team_id │    name │ cohesion │
│  String │  String │  Float64 │
├─────────┼─────────┼──────────┤
│  team_a │     ted │      3.0 │
│  team_a │ stewart │      6.0 │
│  team_a │  jessie │      NaN │
│  team_b │   susan │      7.0 │
│  team_b │   carie │      8.0 │
│  team_b │  rachel │      NaN │
└─────────┴─────────┴──────────┘
nanmean(x) = mean(filter(!isnan,x))
nansd(x) = std(filter(!isnan,x))

@linq missing_df_nan |>
        groupby(:team_id) |>
        based_on(
            mean_cohesion = nanmean(:cohesion),
            sd_cohesion = nansd(:cohesion)
        )

2 rows × 3 columns

team_idmean_cohesionsd_cohesion
StringFloat64Float64
1team_a4.52.12132
2team_b7.50.707107

Keep in mind that missingness in julia should be labeled with missing, not NaN.

Merging

Take data stored in a separate data set (team_cohesion) and merge it with the larger, full data (full_data).

Left join retains only rows in the first df. For example, if "team_cohesion" had data on team 4 then it would not be included in the join that I show here.

The key is specified with the on command.

full_data = DataFrame(
    team_id = ["team_1", "team_2", "team_3", "team_1", "team_2", "team_3"],
    person_id = [1,2,3,4,5,6],
    openness = [6,7,8,5,6,7]
)

@pt full_data
┌─────────┬───────────┬──────────┐
│ team_id │ person_id │ openness │
│  String │     Int64 │    Int64 │
├─────────┼───────────┼──────────┤
│  team_1 │         1 │        6 │
│  team_2 │         2 │        7 │
│  team_3 │         3 │        8 │
│  team_1 │         4 │        5 │
│  team_2 │         5 │        6 │
│  team_3 │         6 │        7 │
└─────────┴───────────┴──────────┘
team_cohesion = DataFrame(
    team_id = ["team_1", "team_2", "team_3"],
    team_cohesion = [30,20,10]

)

@pt team_cohesion
┌─────────┬───────────────┐
│ team_id │ team_cohesion │
│  String │         Int64 │
├─────────┼───────────────┤
│  team_1 │            30 │
│  team_2 │            20 │
│  team_3 │            10 │
└─────────┴───────────────┘
full_df = join(full_data, team_cohesion, on = :team_id, kind = :left)

@pt full_df
┌─────────┬───────────┬──────────┬───────────────────────┐
│ team_id │ person_id │ openness │         team_cohesion │
│  String │     Int64 │    Int64 │ Union{Missing, Int64} │
├─────────┼───────────┼──────────┼───────────────────────┤
│  team_1 │         1 │        6 │                    30 │
│  team_2 │         2 │        7 │                    20 │
│  team_3 │         3 │        8 │                    10 │
│  team_1 │         4 │        5 │                    30 │
│  team_2 │         5 │        6 │                    20 │
│  team_3 │         6 │        7 │                    10 │
└─────────┴───────────┴──────────┴───────────────────────┘

Wide & Long

Take data in long form and make it wide. Doing so in julia is more cumbersome than the reshape command in R, especially if the goal of creating wide data is to run structural equations models.

# a long df


long_df = DataFrame(
    person_id = [1,1,1,
                 2,2,2,
                 3,3,3,
                 4,4,4],
    time = [1,2,3,
            1,2,3,
            1,2,3,
            1,2,3],
    affect = [2,3,1,
              2,2,2,
              1,3,1,
              1,2,3],
    esteem = [5,6,7,
              6,6,6,
              3,4,5,
              4,8,9],
    stable_measure = [5,5,5,
                      7,7,7,
                      3,3,3,
                      2,2,2]
)

@pt long_df
┌───────────┬───────┬────────┬────────┬────────────────┐
│ person_id │  time │ affect │ esteem │ stable_measure │
│     Int64 │ Int64 │  Int64 │  Int64 │          Int64 │
├───────────┼───────┼────────┼────────┼────────────────┤
│         1 │     1 │      2 │      5 │              5 │
│         1 │     2 │      3 │      6 │              5 │
│         1 │     3 │      1 │      7 │              5 │
│         2 │     1 │      2 │      6 │              7 │
│         2 │     2 │      2 │      6 │              7 │
│         2 │     3 │      2 │      6 │              7 │
│         3 │     1 │      1 │      3 │              3 │
│         3 │     2 │      3 │      4 │              3 │
│         3 │     3 │      1 │      5 │              3 │
│         4 │     1 │      1 │      4 │              2 │
│         4 │     2 │      2 │      8 │              2 │
│         4 │     3 │      3 │      9 │              2 │
└───────────┴───────┴────────┴────────┴────────────────┘
# create wide data with esteem as the "widen" variable

print(

unstack(long_df, :person_id, :time, :esteem)

)
4×4 DataFrames.DataFrame
│ Row │ person_id │ 1      │ 2      │ 3      │
│     │ Int64     │ Int64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼───────────┼────────┼────────┼────────┤
│ 1   │ 1         │ 5      │ 6      │ 7      │
│ 2   │ 2         │ 6      │ 6      │ 6      │
│ 3   │ 3         │ 3      │ 4      │ 5      │
│ 4   │ 4         │ 4      │ 8      │ 9      │
# create wide data with affect as the "widen" variable

print(

unstack(long_df, :person_id, :time, :affect)

)
4×4 DataFrames.DataFrame
│ Row │ person_id │ 1      │ 2      │ 3      │
│     │ Int64     │ Int64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼───────────┼────────┼────────┼────────┤
│ 1   │ 1         │ 2      │ 3      │ 1      │
│ 2   │ 2         │ 2      │ 2      │ 2      │
│ 3   │ 3         │ 1      │ 3      │ 1      │
│ 4   │ 4         │ 1      │ 2      │ 3      │

So, there are two issues with reshaping in julia. First, I can't make affect and esteem the wide variables simultaneously.

Second, the names of the columns become integers such as 1, 2, 3 and so they need to be relabeled.

Here is how I would prepare the data for SEM.

wide_affect = unstack(long_df, [:person_id, :stable_measure], :time, :affect)
rename!(wide_affect, "1" => "affect1", "2" => "affect2", "3" => "affect3")

wide_esteem = unstack(long_df, [:person_id], :time, :esteem)
rename!(wide_esteem, "1" => "esteem1", "2" => "esteem2", "3" => "esteem3")

wide_df = join(wide_affect, wide_esteem, on = :person_id, kind = :left)

4 rows × 8 columns

person_idstable_measureaffect1affect2affect3esteem1esteem2esteem3
Int64Int64Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰Int64⍰
115231567
227222666
333131345
442123489

Disclaimer. I'm not saying that julia is incapable of unstacking multiple variables or that there is not a workaround (I just demonstrated one).

I'm simply saying that the reshape command in R is currently more user-friendly for SEM.

What about wide to long?

long_affect = stack(wide_affect, [:affect1, :affect2, :affect3])

12 rows × 4 columns

variablevalueperson_idstable_measure
SymbolInt64⍰Int64Int64
1affect1215
2affect1227
3affect1133
4affect1142
5affect2315
6affect2227
7affect2333
8affect2242
9affect3115
10affect3227
11affect3133
12affect3342

Input & Output

Identify directory and adjust.

# current directory

pwd()

# change directory

cd("")

# change directory to wherever the opened file is located

cd(dirname(@__FILE__))

# move up a level

cd("../data")

# if you know the file name but not the directory

filename = "whatever_name.csv"
filepath = joinpath(@__DIR__, filename)
println(filepath)

Working with csv files, either with CSV.jl or CSVFiles.jl.

# read csv

df = CSV.File(file) |> DataFrame!

# if the file contains data with no header

CSV.File(file; header = false)

# or to incorporate your own column names

CSV.File(file; header = ["col1", "col2", "col3"])

# if the csv has spaces in the header names
# e.g., column names like "column one" will become "column_one"

CSV.File(file; normalizenames=true)

# if the csv has meta data within the first few rows that needs to be skipped

CSV.File(file; datarow = 4)

# if something like NA or 999 indicate missingness in csv

CSV.File(file; missingstrings=["999", "NA"])

# identify column types

CSV.File(file; types =[Int, Int, String])


# the csv package has many of the same commands
## append two csvs

df = CSV.read(file)
df = CSV.read(file2, df; apppend=true)


# write csvs
## specify that missing values should be "NA"

CSV.write("out.csv", df; null = "NA")

Identify all files within a folder.

readdir([path])

RCall

Run R code within julia. I don't evaluate these commands to save space.

# run R code

R"""

library(ggplot2)
ggplot($df, aes(x = time, y = y)) +
	geom_point()


library(actuar)
total <- 4 + 9
power_law <- rpareto(1,2,3

"""

# pull objects from r environment to julia environment

@rget total
@rget power_law

# put objects from julia environment into r environment

jul_variable = 4

@rput jul_variables

R"""
new_in_r <- jul_variable + 10
"""

@rget new_in_r

Packages

Add packages. In terminal, press ] and then type add package_name. Do not include .jl after the package name.

In a script, do below.

Pkg.Add("RCall")

# from github

Pkg.cone("https://github.com/JuliaDiffEq/ParameterizedFunctions.jl")

Vectors & Sampling

Create vectors and arrays.

# vector of length n with all zeros

n=10
result_store = zeros(n)

# undefined vector of length n

Vector(undef, n)

# array from 0 to 100 in steps of 10

print(
collect(0:10:100)
)
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# array from 4 to 1 decreasing by 1

print(
collect(4:-1:1)
)
[4, 3, 2, 1]
# array from 1 to 100 with 12 steps

print(
collect(range(1, stop = 100, length = 12))
)
[1.0, 10.0, 19.0, 28.0, 37.0, 46.0, 55.0, 64.0, 73.0, 82.0, 91.0, 100.0]
# start at 1, add 3, do that 20 times or for 20 steps

print(
range(1, step = 3, length = 20) |> collect
)
[1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 5
8]

Count the number of "trues" in a vector. For this example, refer back to the weather data frame.

@pt weather_df
┌───────┬────────┬───────┬─────────┐
│   sun │ clouds │   day │ weather │
│ Int64 │  Int64 │ Int64 │  String │
├───────┼────────┼───────┼─────────┤
│     3 │      8 │     1 │  cloudy │
│     2 │      5 │     2 │  cloudy │
│     1 │      7 │     3 │  cloudy │
│     8 │      2 │     4 │   sunny │
│     9 │      1 │     5 │   sunny │
│     7 │      3 │     6 │   sunny │
└───────┴────────┴───────┴─────────┘
# how many times is it sunny?

days_sunny = count(i == "sunny" for i in weather_df.weather)

print(days_sunny)
3

Sampling and Distributions

Create distributions and pull samples.

# set seed

Random.seed!(4)

# draw 100 random samples from a normal distribution with mean 1 and sd 2

draw_values = rand(Normal(1,2), 100)

# draw a single sample but retain it as an integer rather than an array

draw_integer = rand(Normal(1,2), 1)[1]

# estimate parameters of empirical distribution draw_values

fit(Normal, draw_values)

# for any distribution (see distributions "continuous" documentation

## abstract_distribution = Normal(parameter1, parameter2)
## x = rand(abstract_distribution, 100)


# multivariate normal with 2 variables, draw 5 values

mu = [10.0; 20.1]
covariance_matrix = [2.0 1.2; 1.2 3.5]
d = MvNormal(mu, covariance_matrix)
x = rand(d, 5)


# random number from uniform between 0 and 1

rand()

# 5 random numbers drawn from a uniform betwen 0 and 1

rand(5)

# matrix of random numbers, 4 by 3

rand(4,3)

# random numbers from Uniform[0,100]

rand()*100


# randomly choose an index within a vector

rand(1:10)

# 2 random draws from Uniform[3:9]

rand(3:9, 2)
2-element Array{Int64,1}:
 3
 3

Miscillaneous

Read a text file.

datafilename = "data.txt"
datafile = open(datafilename)
data = readlines(datafile)
close(datafile)