Julia Basics For Data Wrangling and Computational Modeling

Christopher R. Dishop

Common commands that I use for data analysis and simulation.

First, libraries and a data frame.

using DataFrames, DataFramesMeta, Query, Statistics, CSV, CSVFiles, Distributions, Random, PrettyTables

df = DataFrame(
  name = ["josh", "adam", "jenny", "rory", "thomas", "bill", "sarah", "allie", "shannon", "cloe"],
  id = collect(1:10),
  satisfaction = [2, 4, 5, 1, 3, 3, 3, 2, 4, 5],
  need_achievement = [8, 3, 4, 7, 2, 9, 10, 5, 4, 6],
  fav_color = ["red", "red", "green", "orange", "blue", "yellow", "purple", "orange", "silver", "blue"],
  number_of_kids = [2, 3, 0, 0, 0, 0, 2, 1, 1, 0],
  instrument = ["piano", "guitar", "none", "piano", "saxophone", "none", "none", "none", "none", "clarinet"],
  team_id = vcat(
                repeat(["awesome"], 5),
                repeat(["best"], 5)
                ),
  dyad_id = vcat(
                repeat(["v"], 2),
                repeat(["w"], 2),
                repeat(["x"], 2),
                repeat(["y"], 2),
                repeat(["z"], 2)
                ),
  dyad_trust =   [2, 2,
                  5, 5,
                  7, 7,
                  8, 8,
                  3, 3],
  team_performance = [7, 7, 7, 7, 7,
                        4, 4, 4, 4, 4],
  nation = repeat(["united_states"], 10),
  common_value = repeat([3], 10)

)

10 rows × 13 columns

	name	id	satisfaction	need_achievement	fav_color	number_of_kids	instrument	team_id	dyad_id	dyad_trust	team_performance	nation	common_value
	String	Int64	Int64	Int64	String	Int64	String	String	String	Int64	Int64	String	Int64
1	josh	1	2	8	red	2	piano	awesome	v	2	7	united_states	3
2	adam	2	4	3	red	3	guitar	awesome	v	2	7	united_states	3
3	jenny	3	5	4	green	0	none	awesome	w	5	7	united_states	3
4	rory	4	1	7	orange	0	piano	awesome	w	5	7	united_states	3
5	thomas	5	3	2	blue	0	saxophone	awesome	x	7	7	united_states	3
6	bill	6	3	9	yellow	0	none	best	x	7	4	united_states	3
7	sarah	7	3	10	purple	2	none	best	y	8	4	united_states	3
8	allie	8	2	5	orange	1	none	best	y	8	4	united_states	3
9	shannon	9	4	4	silver	1	none	best	z	3	4	united_states	3
10	cloe	10	5	6	blue	0	clarinet	best	z	3	4	united_states	3

Structure

Identify characteristics about the data.

names(df)
size(df)
describe(df)
first(df)
last(df)
show(df, allrows = true, allcols = true)

10×13 DataFrames.DataFrame
│ Row │ name    │ id    │ satisfaction │ need_achievement │ fav_color │
│     │ String  │ Int64 │ Int64        │ Int64            │ String    │
├─────┼─────────┼───────┼──────────────┼──────────────────┼───────────┤
│ 1   │ josh    │ 1     │ 2            │ 8                │ red       │
│ 2   │ adam    │ 2     │ 4            │ 3                │ red       │
│ 3   │ jenny   │ 3     │ 5            │ 4                │ green     │
│ 4   │ rory    │ 4     │ 1            │ 7                │ orange    │
│ 5   │ thomas  │ 5     │ 3            │ 2                │ blue      │
│ 6   │ bill    │ 6     │ 3            │ 9                │ yellow    │
│ 7   │ sarah   │ 7     │ 3            │ 10               │ purple    │
│ 8   │ allie   │ 8     │ 2            │ 5                │ orange    │
│ 9   │ shannon │ 9     │ 4            │ 4                │ silver    │
│ 10  │ cloe    │ 10    │ 5            │ 6                │ blue      │

│ Row │ number_of_kids │ instrument │ team_id │ dyad_id │ dyad_trust │
│     │ Int64          │ String     │ String  │ String  │ Int64      │
├─────┼────────────────┼────────────┼─────────┼─────────┼────────────┤
│ 1   │ 2              │ piano      │ awesome │ v       │ 2          │
│ 2   │ 3              │ guitar     │ awesome │ v       │ 2          │
│ 3   │ 0              │ none       │ awesome │ w       │ 5          │
│ 4   │ 0              │ piano      │ awesome │ w       │ 5          │
│ 5   │ 0              │ saxophone  │ awesome │ x       │ 7          │
│ 6   │ 0              │ none       │ best    │ x       │ 7          │
│ 7   │ 2              │ none       │ best    │ y       │ 8          │
│ 8   │ 1              │ none       │ best    │ y       │ 8          │
│ 9   │ 1              │ none       │ best    │ z       │ 3          │
│ 10  │ 0              │ clarinet   │ best    │ z       │ 3          │

│ Row │ team_performance │ nation        │ common_value │
│     │ Int64            │ String        │ Int64        │
├─────┼──────────────────┼───────────────┼──────────────┤
│ 1   │ 7                │ united_states │ 3            │
│ 2   │ 7                │ united_states │ 3            │
│ 3   │ 7                │ united_states │ 3            │
│ 4   │ 7                │ united_states │ 3            │
│ 5   │ 7                │ united_states │ 3            │
│ 6   │ 4                │ united_states │ 3            │
│ 7   │ 4                │ united_states │ 3            │
│ 8   │ 4                │ united_states │ 3            │
│ 9   │ 4                │ united_states │ 3            │
│ 10  │ 4                │ united_states │ 3            │

Columns

Create new columns, one documenting that everyone speaks english and another representing each individual's favorite season.

# with base
df[!, :language] = repeat(["english"], 10)

# with data frames meta
df = @linq df |>
    transform(fav_season = vcat(
                                repeat(["summer"], 5),
                                repeat(["winter"], 5)
    ))


@pt df[:, [:team_performance, :language, :fav_season]]

┌──────────────────┬──────────┬────────────┐
│ team_performance │ language │ fav_season │
│            Int64 │   String │     String │
├──────────────────┼──────────┼────────────┤
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                7 │  english │     summer │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
│                4 │  english │     winter │
└──────────────────┴──────────┴────────────┘

Rename columns. Change "satisfaction" to "job_sat" and "nation" to "country."

rename!(df, :satisfaction => :job_sat)

df = df |> @rename(:nation => :country) |> DataFrame

@pt df[:, [:job_sat, :country]]

┌─────────┬───────────────┐
│ job_sat │       country │
│   Int64 │        String │
├─────────┼───────────────┤
│       2 │ united_states │
│       4 │ united_states │
│       5 │ united_states │
│       1 │ united_states │
│       3 │ united_states │
│       3 │ united_states │
│       3 │ united_states │
│       2 │ united_states │
│       4 │ united_states │
│       5 │ united_states │
└─────────┴───────────────┘

Reorder columns. Not easy to do unless you want to re-type every column name. Waiting for something like "df = df[, [:team.id, :country, rest.of.columns()]" but that doesn't exist yet.

Pull out a single column, in one of many ways. The last implementation also demonstrates how to select columns.

id_vec1 = df.id

id_vec2 = df[:, :id]

id_vec3 = @linq df |>
        select(:id)
id_vec3 = Vector(id_vec3.id)

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

Randomly select columns.

num_cols = 4
possible_values = collect(1:1:10)

use_cols = sample(possible_values, num_cols, replace = false)

rand_df = @linq df |>
        select(use_cols)

@pt rand_df

┌─────────┬──────────────────┬─────────┬─────────┐
│ team_id │ need_achievement │    name │ job_sat │
│  String │            Int64 │  String │   Int64 │
├─────────┼──────────────────┼─────────┼─────────┤
│ awesome │                8 │    josh │       2 │
│ awesome │                3 │    adam │       4 │
│ awesome │                4 │   jenny │       5 │
│ awesome │                7 │    rory │       1 │
│ awesome │                2 │  thomas │       3 │
│    best │                9 │    bill │       3 │
│    best │               10 │   sarah │       3 │
│    best │                5 │   allie │       2 │
│    best │                4 │ shannon │       4 │
│    best │                6 │    cloe │       5 │
└─────────┴──────────────────┴─────────┴─────────┘

Create new columns. The first standardizes the job satisfaction scores, whereas the second adds the trust and performance columns together to create a new variable.

mean_jobsat = 2.5

df = @linq df |>
    transform(job_sat_standardized = :job_sat .- mean_jobsat)

df = @linq df |>
    transform(trust_plus_performance = :dyad_trust .+ :team_performance)


@pt df[:, [:dyad_trust, :team_performance, :trust_plus_performance]]

┌────────────┬──────────────────┬────────────────────────┐
│ dyad_trust │ team_performance │ trust_plus_performance │
│      Int64 │            Int64 │                  Int64 │
├────────────┼──────────────────┼────────────────────────┤
│          2 │                7 │                      9 │
│          2 │                7 │                      9 │
│          5 │                7 │                     12 │
│          5 │                7 │                     12 │
│          7 │                7 │                     14 │
│          7 │                4 │                     11 │
│          8 │                4 │                     12 │
│          8 │                4 │                     12 │
│          3 │                4 │                      7 │
│          3 │                4 │                      7 │
└────────────┴──────────────────┴────────────────────────┘

Sort the data based on the values within a given column.

df = @linq df |>
    orderby(:number_of_kids, :fav_color)

@pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]

┌───────┬─────────┬─────────┬────────────────┬───────────┐
│    id │ team_id │ job_sat │ number_of_kids │ fav_color │
│ Int64 │  String │   Int64 │          Int64 │    String │
├───────┼─────────┼─────────┼────────────────┼───────────┤
│     5 │ awesome │       3 │              0 │      blue │
│    10 │    best │       5 │              0 │      blue │
│     3 │ awesome │       5 │              0 │     green │
│     4 │ awesome │       1 │              0 │    orange │
│     6 │    best │       3 │              0 │    yellow │
│     8 │    best │       2 │              1 │    orange │
│     9 │    best │       4 │              1 │    silver │
│     7 │    best │       3 │              2 │    purple │
│     1 │ awesome │       2 │              2 │       red │
│     2 │ awesome │       4 │              3 │       red │
└───────┴─────────┴─────────┴────────────────┴───────────┘

sort!(df, [order(:id), order(:team_id)])

@pt df[:, [:id, :team_id, :job_sat, :number_of_kids, :fav_color]]

┌───────┬─────────┬─────────┬────────────────┬───────────┐
│    id │ team_id │ job_sat │ number_of_kids │ fav_color │
│ Int64 │  String │   Int64 │          Int64 │    String │
├───────┼─────────┼─────────┼────────────────┼───────────┤
│     1 │ awesome │       2 │              2 │       red │
│     2 │ awesome │       4 │              3 │       red │
│     3 │ awesome │       5 │              0 │     green │
│     4 │ awesome │       1 │              0 │    orange │
│     5 │ awesome │       3 │              0 │      blue │
│     6 │    best │       3 │              0 │    yellow │
│     7 │    best │       3 │              2 │    purple │
│     8 │    best │       2 │              1 │    orange │
│     9 │    best │       4 │              1 │    silver │
│    10 │    best │       5 │              0 │      blue │
└───────┴─────────┴─────────┴────────────────┴───────────┘

Rows

Filter the data according to values within rows.

# simple row commands

@linq df |>
    where(:job_sat .== 2)

@linq df |>
    where(:fav_color .== "red", :team_id .== "awesome")

# specify which columns to keep

df[df.job_sat .> 3, [:name, :id, :job_sat, :need_achievement]]

@linq df |>
    where(:job_sat .> 3) |>
    select(:name, :id, :job_sat, :need_achievement)

# multiple filtering commands

df[(df.job_sat .> 3) .& (df.need_achievement .> 4), :]

@linq df |>
    where(:job_sat .> 3, :need_achievement .> 4)

1 rows × 17 columns

	name	id	job_sat	need_achievement	fav_color	number_of_kids	instrument	team_id	dyad_id	dyad_trust	team_performance	country	common_value	language	fav_season	job_sat_standardized	trust_plus_performance
	String	Int64	Int64	Int64	String	Int64	String	String	String	Int64	Int64	String	Int64	String	String	Float64	Int64
1	cloe	10	5	6	blue	0	clarinet	best	z	3	4	united_states	3	english	winter	2.5	7

Obtain a subset of rows. Similar to R's "%in%" function.

use_people = ["josh", "rory"]
small_df = df

# using DataFramesMeta
@linq df |>
    where(:name .∈ Ref(use_people))

# using Base
small_df[findall(in(use_people), small_df[:name]), :]

# a third option...
filter(small_df -> any(occursin.(use_people, small_df.name)), small_df)

2 rows × 17 columns

	name	id	job_sat	need_achievement	fav_color	number_of_kids	instrument	team_id	dyad_id	dyad_trust	team_performance	country	common_value	language	fav_season	job_sat_standardized	trust_plus_performance
	String	Int64	Int64	Int64	String	Int64	String	String	String	Int64	Int64	String	Int64	String	String	Float64	Int64
1	josh	1	2	8	red	2	piano	awesome	v	2	7	united_states	3	english	summer	-0.5	9
2	rory	4	1	7	orange	0	piano	awesome	w	5	7	united_states	3	english	summer	-1.5	12

Identify a row and column value, use it in a computation, then replace the original value with the new computed value.

This seems archaic for a basic cheat sheet but I use this kind of script in almost every computational model that I create.

# filter to a row

person_i_need = "josh"

row_i_need = @linq df |>
            where(:name .== person_i_need)

# get josh's value of job satisfaction

value_i_need = Vector(row_i_need.job_sat)[1]

### don't do this
### Vector(row_i_need.job_sat) - doing so will give you a vector

# do something with his job sat value

new_js = value_i_need + 10 * 4

# put the new value back into the data

df[ (df.name .== person_i_need) .& (df.team_id .== "awesome"), :job_sat] = new_js

@pt df[:, [:name, :job_sat]]

┌─────────┬─────────┐
│    name │ job_sat │
│  String │   Int64 │
├─────────┼─────────┤
│    josh │      42 │
│    adam │       4 │
│   jenny │       5 │
│    rory │       1 │
│  thomas │       3 │
│    bill │       3 │
│   sarah │       3 │
│   allie │       2 │
│ shannon │       4 │
│    cloe │       5 │
└─────────┴─────────┘

Re-lable rows or create cutoffs. Similar to R's case_when() function.

First example. Turn time points such as 1, 2, and 3 into character-labeled time points such as "affect1", "affect2", and "affect3." This seems odd but it's necessary for creating wide data later on.

# data frame

affect_df = DataFrame(
    person = [1,1,1,2,2,2,3,3,3],
    time = [1,2,3,1,2,3,1,2,3],
    affect = [3,4,2,5,3,2,4,3,2]
)

@pt affect_df

┌────────┬───────┬────────┐
│ person │  time │ affect │
│  Int64 │ Int64 │  Int64 │
├────────┼───────┼────────┤
│      1 │     1 │      3 │
│      1 │     2 │      4 │
│      1 │     3 │      2 │
│      2 │     1 │      5 │
│      2 │     2 │      3 │
│      2 │     3 │      2 │
│      3 │     1 │      4 │
│      3 │     2 │      3 │
│      3 │     3 │      2 │
└────────┴───────┴────────┘

# function specifying how the labels should change

function convert_rows(val)
    if val == 1
        result = "affect1"
    elseif val == 2
        result = "affect2"
    else
        result = "affect3"
    return(result)
end
end



# use that function within a transform call to create a new column
# use f.() syntax to vectorize

affect_df = @linq affect_df |>
            transform(label_time = convert_rows.(:time))


@pt affect_df

┌────────┬───────┬────────┬────────────┐
│ person │  time │ affect │ label_time │
│  Int64 │ Int64 │  Int64 │     String │
├────────┼───────┼────────┼────────────┤
│      1 │     1 │      3 │    affect1 │
│      1 │     2 │      4 │    affect2 │
│      1 │     3 │      2 │    affect3 │
│      2 │     1 │      5 │    affect1 │
│      2 │     2 │      3 │    affect2 │
│      2 │     3 │      2 │    affect3 │
│      3 │     1 │      4 │    affect1 │
│      3 │     2 │      3 │    affect2 │
│      3 │     3 │      2 │    affect3 │
└────────┴───────┴────────┴────────────┘

Second example. Change character values into different character values.

# data

labels_df = DataFrame(
    labels = ["dog", "shoe", "barn", "door", "cash"]
)

# function specifying how the values should change

function convert_labels(label)
    if label == "dog"
        result = "blue"
    elseif label == "shoe"
        result = "green"
    elseif label == "barn"
        result = "orange"
    elseif label == "door"
        result = "black"
    elseif label == "cash"
        result = "yellow"
    return(result)
end
end

# transform a column and use f.() syntax

labels_df = @linq labels_df |>
        transform(color_labels = convert_labels.(:labels))


@pt labels_df

┌────────┬──────────────┐
│ labels │ color_labels │
│ String │       String │
├────────┼──────────────┤
│    dog │         blue │
│   shoe │        green │
│   barn │       orange │
│   door │        black │
│   cash │       yellow │
└────────┴──────────────┘

Third example. Create cutoffs with boolean logic.

# data

intermediates_df = DataFrame(
    person = [1,2,3,4,5,6,7,8],
    score = [1,2,3,7,8,9,20,30]
)

# function specifying the cutoffs

function create_limits(x)
    if x < 5
        result = "low"
    elseif 5 < x < 10
        result = "medium"
    elseif x > 10
        result = "high"
    return(result)
end
end

# transform a column and use f.() syntax

intermediates_df = @linq intermediates_df |>
            transform(cutoffs = create_limits.(:score))


@pt intermediates_df

┌────────┬───────┬─────────┐
│ person │ score │ cutoffs │
│  Int64 │ Int64 │  String │
├────────┼───────┼─────────┤
│      1 │     1 │     low │
│      2 │     2 │     low │
│      3 │     3 │     low │
│      4 │     7 │  medium │
│      5 │     8 │  medium │
│      6 │     9 │  medium │
│      7 │    20 │    high │
│      8 │    30 │    high │
└────────┴───────┴─────────┘

Fourth example: logic using multiple columns.

# data

weather_df = DataFrame(
    sun = [3,2,1,8,9,7],
    clouds = [8,5,7,2,1,3],
    day = [1,2,3,4,5,6]
)

# function specifying logic

function determine_weather(col1, col2)
    if col1 < col2
        result = "cloudy"
    elseif col1 > col2
        result = "sunny"
    elseif col1 == col2
        result = "hazzy"
    end
end

# transform the data using a create new column command and f.() syntax

weather_df = @linq weather_df |>
    transform(weather = determine_weather.(:sun, :clouds))

@pt weather_df

┌───────┬────────┬───────┬─────────┐
│   sun │ clouds │   day │ weather │
│ Int64 │  Int64 │ Int64 │  String │
├───────┼────────┼───────┼─────────┤
│     3 │      8 │     1 │  cloudy │
│     2 │      5 │     2 │  cloudy │
│     1 │      7 │     3 │  cloudy │
│     8 │      2 │     4 │   sunny │
│     9 │      1 │     5 │   sunny │
│     7 │      3 │     6 │   sunny │
└───────┴────────┴───────┴─────────┘

Initialize Storage Vectors

Using the "zeros" command seems to be the safest approach in Julia. I haven't yet learned how to use "Vector{Int}(undef, 10)" commands.

periods = 10
scores = zeros(periods)
dm = DataFrame(
    performance = zeros(periods),
    satisfaction = zeros(periods)
)

10 rows × 2 columns

	performance	satisfaction
	Float64	Float64
1	0.0	0.0
2	0.0	0.0
3	0.0	0.0
4	0.0	0.0
5	0.0	0.0
6	0.0	0.0
7	0.0	0.0
8	0.0	0.0
9	0.0	0.0
10	0.0	0.0

Build Data Frame Iteratively

time = 20


# example 1: append an entire data frame to an empty df

dfzz = DataFrame(
    group_p = Float64[],
    performance = Float64[]
)



for i in 1:time

    append!(dfzz, DataFrame(
    group_p = i + 3,
    performance = i - 3
    ))

end

@pt dfzz

┌─────────┬─────────────┐
│ group_p │ performance │
│ Float64 │     Float64 │
├─────────┼─────────────┤
│     4.0 │        -2.0 │
│     5.0 │        -1.0 │
│     6.0 │         0.0 │
│     7.0 │         1.0 │
│     8.0 │         2.0 │
│     9.0 │         3.0 │
│    10.0 │         4.0 │
│    11.0 │         5.0 │
│    12.0 │         6.0 │
│    13.0 │         7.0 │
│    14.0 │         8.0 │
│    15.0 │         9.0 │
│    16.0 │        10.0 │
│    17.0 │        11.0 │
│    18.0 │        12.0 │
│    19.0 │        13.0 │
│    20.0 │        14.0 │
│    21.0 │        15.0 │
│    22.0 │        16.0 │
│    23.0 │        17.0 │
└─────────┴─────────────┘

# example 2: add to an initialized df

dfyy = DataFrame(
    group_p = repeat([0], length(1:time)),
    performance = repeat([0], length(1:time))
)



for i in 1:time

    dfyy[i, :group_p] = i + 2
    dfyy[i, :performance] = i - 3

end

@pt dfyy

┌─────────┬─────────────┐
│ group_p │ performance │
│   Int64 │       Int64 │
├─────────┼─────────────┤
│       3 │          -2 │
│       4 │          -1 │
│       5 │           0 │
│       6 │           1 │
│       7 │           2 │
│       8 │           3 │
│       9 │           4 │
│      10 │           5 │
│      11 │           6 │
│      12 │           7 │
│      13 │           8 │
│      14 │           9 │
│      15 │          10 │
│      16 │          11 │
│      17 │          12 │
│      18 │          13 │
│      19 │          14 │
│      20 │          15 │
│      21 │          16 │
│      22 │          17 │
└─────────┴─────────────┘

Element-Wise Functions

Use dot notation to apply a function to every value in a vector.

vec1 = [0, 3, 4, 5, 6]

function addtwo(x)
    y = x + 2
    return(y)
end

# don't use addtwo(vec1)
# use addtwo.(vec1)
addtwo.(vec1)

5-element Array{Int64,1}:
 2
 5
 6
 7
 8

Summary Calculations

Calculate column means.

# mean of single column

mean(df.need_achievement)

# within each team, calculate the mean job satisfaction

@linq df |>
    groupby(:team_id) |>
    based_on(mean_jobsat = mean(:job_sat))

# within each dyad, calculate the mean and sd need for achievement

@linq df |>
    groupby(:dyad_id) |>
    based_on(
        mean_need_achievement = mean(:need_achievement),
        sd_need_achievement = std(:need_achievement)
    )

5 rows × 3 columns

	dyad_id	mean_need_achievement	sd_need_achievement
	String	Float64	Float64
1	v	5.5	3.53553
2	w	5.5	2.12132
3	x	5.5	4.94975
4	y	7.5	3.53553
5	z	5.0	1.41421

What if there is missing data?

missing_df = DataFrame(
    team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"],
    name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"],
    cohesion = [3,6,missing,7,8,missing]
)

@pt missing_df

┌─────────┬─────────┬───────────────────────┐
│ team_id │    name │              cohesion │
│  String │  String │ Union{Missing, Int64} │
├─────────┼─────────┼───────────────────────┤
│  team_a │     ted │                     3 │
│  team_a │ stewart │                     6 │
│  team_a │  jessie │               missing │
│  team_b │   susan │                     7 │
│  team_b │   carie │                     8 │
│  team_b │  rachel │               missing │
└─────────┴─────────┴───────────────────────┘

@linq missing_df |>
    groupby(:team_id) |>
    based_on(
        mean_cohesion = mean(skipmissing(:cohesion)),
        sd_cohesion = std(skipmissing(:cohesion))
    )

2 rows × 3 columns

	team_id	mean_cohesion	sd_cohesion
	String	Float64	Float64
1	team_a	4.5	2.12132
2	team_b	7.5	0.707107

What if the missing data is labeled as NaN?

missing_df_nan = DataFrame(
        team_id = ["team_a", "team_a", "team_a", "team_b", "team_b", "team_b"],
        name = ["ted", "stewart", "jessie", "susan", "carie", "rachel"],
        cohesion = [3,6,NaN,7,8,NaN]
    )

@pt missing_df_nan

┌─────────┬─────────┬──────────┐
│ team_id │    name │ cohesion │
│  String │  String │  Float64 │
├─────────┼─────────┼──────────┤
│  team_a │     ted │      3.0 │
│  team_a │ stewart │      6.0 │
│  team_a │  jessie │      NaN │
│  team_b │   susan │      7.0 │
│  team_b │   carie │      8.0 │
│  team_b │  rachel │      NaN │
└─────────┴─────────┴──────────┘

nanmean(x) = mean(filter(!isnan,x))
nansd(x) = std(filter(!isnan,x))

@linq missing_df_nan |>
        groupby(:team_id) |>
        based_on(
            mean_cohesion = nanmean(:cohesion),
            sd_cohesion = nansd(:cohesion)
        )

2 rows × 3 columns

	team_id	mean_cohesion	sd_cohesion
	String	Float64	Float64
1	team_a	4.5	2.12132
2	team_b	7.5	0.707107

Keep in mind that missingness in julia should be labeled with missing, not NaN.

Merging

Take data stored in a separate data set (team_cohesion) and merge it with the larger, full data (full_data).

Left join retains only rows in the first df. For example, if "team_cohesion" had data on team 4 then it would not be included in the join that I show here.

The key is specified with the on command.

full_data = DataFrame(
    team_id = ["team_1", "team_2", "team_3", "team_1", "team_2", "team_3"],
    person_id = [1,2,3,4,5,6],
    openness = [6,7,8,5,6,7]
)

@pt full_data

┌─────────┬───────────┬──────────┐
│ team_id │ person_id │ openness │
│  String │     Int64 │    Int64 │
├─────────┼───────────┼──────────┤
│  team_1 │         1 │        6 │
│  team_2 │         2 │        7 │
│  team_3 │         3 │        8 │
│  team_1 │         4 │        5 │
│  team_2 │         5 │        6 │
│  team_3 │         6 │        7 │
└─────────┴───────────┴──────────┘

team_cohesion = DataFrame(
    team_id = ["team_1", "team_2", "team_3"],
    team_cohesion = [30,20,10]

)

@pt team_cohesion

┌─────────┬───────────────┐
│ team_id │ team_cohesion │
│  String │         Int64 │
├─────────┼───────────────┤
│  team_1 │            30 │
│  team_2 │            20 │
│  team_3 │            10 │
└─────────┴───────────────┘

full_df = join(full_data, team_cohesion, on = :team_id, kind = :left)

@pt full_df

┌─────────┬───────────┬──────────┬───────────────────────┐
│ team_id │ person_id │ openness │         team_cohesion │
│  String │     Int64 │    Int64 │ Union{Missing, Int64} │
├─────────┼───────────┼──────────┼───────────────────────┤
│  team_1 │         1 │        6 │                    30 │
│  team_2 │         2 │        7 │                    20 │
│  team_3 │         3 │        8 │                    10 │
│  team_1 │         4 │        5 │                    30 │
│  team_2 │         5 │        6 │                    20 │
│  team_3 │         6 │        7 │                    10 │
└─────────┴───────────┴──────────┴───────────────────────┘

Wide & Long

Take data in long form and make it wide. Doing so in julia is more cumbersome than the reshape command in R, especially if the goal of creating wide data is to run structural equations models.

# a long df


long_df = DataFrame(
    person_id = [1,1,1,
                 2,2,2,
                 3,3,3,
                 4,4,4],
    time = [1,2,3,
            1,2,3,
            1,2,3,
            1,2,3],
    affect = [2,3,1,
              2,2,2,
              1,3,1,
              1,2,3],
    esteem = [5,6,7,
              6,6,6,
              3,4,5,
              4,8,9],
    stable_measure = [5,5,5,
                      7,7,7,
                      3,3,3,
                      2,2,2]
)

@pt long_df

┌───────────┬───────┬────────┬────────┬────────────────┐
│ person_id │  time │ affect │ esteem │ stable_measure │
│     Int64 │ Int64 │  Int64 │  Int64 │          Int64 │
├───────────┼───────┼────────┼────────┼────────────────┤
│         1 │     1 │      2 │      5 │              5 │
│         1 │     2 │      3 │      6 │              5 │
│         1 │     3 │      1 │      7 │              5 │
│         2 │     1 │      2 │      6 │              7 │
│         2 │     2 │      2 │      6 │              7 │
│         2 │     3 │      2 │      6 │              7 │
│         3 │     1 │      1 │      3 │              3 │
│         3 │     2 │      3 │      4 │              3 │
│         3 │     3 │      1 │      5 │              3 │
│         4 │     1 │      1 │      4 │              2 │
│         4 │     2 │      2 │      8 │              2 │
│         4 │     3 │      3 │      9 │              2 │
└───────────┴───────┴────────┴────────┴────────────────┘

# create wide data with esteem as the "widen" variable

print(

unstack(long_df, :person_id, :time, :esteem)

)

4×4 DataFrames.DataFrame
│ Row │ person_id │ 1      │ 2      │ 3      │
│     │ Int64     │ Int64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼───────────┼────────┼────────┼────────┤
│ 1   │ 1         │ 5      │ 6      │ 7      │
│ 2   │ 2         │ 6      │ 6      │ 6      │
│ 3   │ 3         │ 3      │ 4      │ 5      │
│ 4   │ 4         │ 4      │ 8      │ 9      │

# create wide data with affect as the "widen" variable

print(

unstack(long_df, :person_id, :time, :affect)

)

4×4 DataFrames.DataFrame
│ Row │ person_id │ 1      │ 2      │ 3      │
│     │ Int64     │ Int64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼───────────┼────────┼────────┼────────┤
│ 1   │ 1         │ 2      │ 3      │ 1      │
│ 2   │ 2         │ 2      │ 2      │ 2      │
│ 3   │ 3         │ 1      │ 3      │ 1      │
│ 4   │ 4         │ 1      │ 2      │ 3      │

So, there are two issues with reshaping in julia. First, I can't make affect and esteem the wide variables simultaneously.

Second, the names of the columns become integers such as 1, 2, 3 and so they need to be relabeled.

Here is how I would prepare the data for SEM.

wide_affect = unstack(long_df, [:person_id, :stable_measure], :time, :affect)
rename!(wide_affect, "1" => "affect1", "2" => "affect2", "3" => "affect3")

wide_esteem = unstack(long_df, [:person_id], :time, :esteem)
rename!(wide_esteem, "1" => "esteem1", "2" => "esteem2", "3" => "esteem3")

wide_df = join(wide_affect, wide_esteem, on = :person_id, kind = :left)

4 rows × 8 columns

	person_id	stable_measure	affect1	affect2	affect3	esteem1	esteem2	esteem3
	Int64	Int64	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰	Int64⍰
1	1	5	2	3	1	5	6	7
2	2	7	2	2	2	6	6	6
3	3	3	1	3	1	3	4	5
4	4	2	1	2	3	4	8	9

Disclaimer. I'm not saying that julia is incapable of unstacking multiple variables or that there is not a workaround (I just demonstrated one).

I'm simply saying that the reshape command in R is currently more user-friendly for SEM.

What about wide to long?

long_affect = stack(wide_affect, [:affect1, :affect2, :affect3])

12 rows × 4 columns

	variable	value	person_id	stable_measure
	Symbol	Int64⍰	Int64	Int64
1	affect1	2	1	5
2	affect1	2	2	7
3	affect1	1	3	3
4	affect1	1	4	2
5	affect2	3	1	5
6	affect2	2	2	7
7	affect2	3	3	3
8	affect2	2	4	2
9	affect3	1	1	5
10	affect3	2	2	7
11	affect3	1	3	3
12	affect3	3	4	2

Input & Output

Identify directory and adjust.

# current directory

pwd()

# change directory

cd("")

# change directory to wherever the opened file is located

cd(dirname(@__FILE__))

# move up a level

cd("../data")

# if you know the file name but not the directory

filename = "whatever_name.csv"
filepath = joinpath(@__DIR__, filename)
println(filepath)

Working with csv files, either with CSV.jl or CSVFiles.jl.

# read csv

df = CSV.File(file) |> DataFrame!

# if the file contains data with no header

CSV.File(file; header = false)

# or to incorporate your own column names

CSV.File(file; header = ["col1", "col2", "col3"])

# if the csv has spaces in the header names
# e.g., column names like "column one" will become "column_one"

CSV.File(file; normalizenames=true)

# if the csv has meta data within the first few rows that needs to be skipped

CSV.File(file; datarow = 4)

# if something like NA or 999 indicate missingness in csv

CSV.File(file; missingstrings=["999", "NA"])

# identify column types

CSV.File(file; types =[Int, Int, String])


# the csv package has many of the same commands
## append two csvs

df = CSV.read(file)
df = CSV.read(file2, df; apppend=true)


# write csvs
## specify that missing values should be "NA"

CSV.write("out.csv", df; null = "NA")

Identify all files within a folder.

readdir([path])

RCall

Run R code within julia. I don't evaluate these commands to save space.

# run R code

R"""

library(ggplot2)
ggplot($df, aes(x = time, y = y)) +
	geom_point()


library(actuar)
total <- 4 + 9
power_law <- rpareto(1,2,3

"""

# pull objects from r environment to julia environment

@rget total
@rget power_law

# put objects from julia environment into r environment

jul_variable = 4

@rput jul_variables

R"""
new_in_r <- jul_variable + 10
"""

@rget new_in_r

Packages

Add packages. In terminal, press ] and then type add package_name. Do not include .jl after the package name.

In a script, do below.

Pkg.Add("RCall")

# from github

Pkg.cone("https://github.com/JuliaDiffEq/ParameterizedFunctions.jl")

Vectors & Sampling

Create vectors and arrays.

# vector of length n with all zeros

n=10
result_store = zeros(n)

# undefined vector of length n

Vector(undef, n)

# array from 0 to 100 in steps of 10

print(
collect(0:10:100)
)

[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# array from 4 to 1 decreasing by 1

print(
collect(4:-1:1)
)

[4, 3, 2, 1]

# array from 1 to 100 with 12 steps

print(
collect(range(1, stop = 100, length = 12))
)

[1.0, 10.0, 19.0, 28.0, 37.0, 46.0, 55.0, 64.0, 73.0, 82.0, 91.0, 100.0]

# start at 1, add 3, do that 20 times or for 20 steps

print(
range(1, step = 3, length = 20) |> collect
)

[1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 5
8]

Count the number of "trues" in a vector. For this example, refer back to the weather data frame.

@pt weather_df

┌───────┬────────┬───────┬─────────┐
│   sun │ clouds │   day │ weather │
│ Int64 │  Int64 │ Int64 │  String │
├───────┼────────┼───────┼─────────┤
│     3 │      8 │     1 │  cloudy │
│     2 │      5 │     2 │  cloudy │
│     1 │      7 │     3 │  cloudy │
│     8 │      2 │     4 │   sunny │
│     9 │      1 │     5 │   sunny │
│     7 │      3 │     6 │   sunny │
└───────┴────────┴───────┴─────────┘

# how many times is it sunny?

days_sunny = count(i == "sunny" for i in weather_df.weather)

print(days_sunny)

Sampling and Distributions

Create distributions and pull samples.

# set seed

Random.seed!(4)

# draw 100 random samples from a normal distribution with mean 1 and sd 2

draw_values = rand(Normal(1,2), 100)

# draw a single sample but retain it as an integer rather than an array

draw_integer = rand(Normal(1,2), 1)[1]

# estimate parameters of empirical distribution draw_values

fit(Normal, draw_values)

# for any distribution (see distributions "continuous" documentation

## abstract_distribution = Normal(parameter1, parameter2)
## x = rand(abstract_distribution, 100)


# multivariate normal with 2 variables, draw 5 values

mu = [10.0; 20.1]
covariance_matrix = [2.0 1.2; 1.2 3.5]
d = MvNormal(mu, covariance_matrix)
x = rand(d, 5)


# random number from uniform between 0 and 1

rand()

# 5 random numbers drawn from a uniform betwen 0 and 1

rand(5)

# matrix of random numbers, 4 by 3

rand(4,3)

# random numbers from Uniform[0,100]

rand()*100


# randomly choose an index within a vector

rand(1:10)

# 2 random draws from Uniform[3:9]

rand(3:9, 2)

2-element Array{Int64,1}:
 3
 3

Miscillaneous

Read a text file.

datafilename = "data.txt"
datafile = open(datafilename)
data = readlines(datafile)
close(datafile)