LINQ, or how to deal with data more efficiently

If you want to operate on large datasets, Language-INtegrated Queries can be a helpful tool

April 17, 2021
1328 words (7 min read)
Development
UiPath

Introduction

Now that UiPath is migrating to 64-bit, we might see increased interest in performing in-memory data crunching even on larger datasets. There aren’t really any activities in UiPath for this, however, as RPA is too slow for large datasets.

While I don’t think RPA is currently going in the direction of actual big data (for which there are much better specialized tools), I have seen customers ask about intermediate data sets, e.g. with 1 million rows exported from SAP. These often still fit in memory, and therefore should be tractable.

What is LINQ?

LINQ, or language-integrated queries, is a feature of the .NET languages that allows you to specify certain data operations in a relatively readable way. In the context of Workflow Foundation (which UiPath uses), when dealing with larger datasets, you really don’t want to go through the performance hit of doing everything with activities, so LINQ is a nice tool to have in your belt.

There are two different flavors of LINQ (see below for how to construct these), one looks very similar to SQL

from person in lhs 
	join salary in rhs
	on person["ID"] equals salary["ID"]
	select new {ID = person["ID"], Name = person["Name"], Salary = salary["Salary"]};

the other looks more like a series of functional operations being applied, but the compiler will optimize these in the same way

lhs.Join(rhs, 
	person => person["ID"],
	salary => salary["ID"],
	(person,  salary) => new {ID = person["ID"], Name = person["Name"], Salary = salary["Salary"]}
);

Which one you find more readable is a matter of taste and background, I usually prefer the SQL-like syntax.

Use case: joining two data tables

This is a very common operation: you have two normalized tables you get from a database, for example an entity table and a lookup table, and want to combine the two in order to add an additional attribute to your data table. There is an activity for this called Join Data Tables, but LINQ lets you do far more, so bear with me for a moment.

Let’s start with a simple example using the Invoke Code activity: we have two people, Stefan and Elaine. Both of them have an employee ID. In VB.net, that looks like this (of course you can also use the Build Data Table activity, but we’ll need to programmatically do this later):

lhs = New DataTable
lhs.Columns.Add("ID", GetType(Int32))
lhs.Columns.Add("Name", GetType(String))

lhs.Rows.Add({1, "Stefan"})
lhs.Rows.Add({3, "Elaine"})

now, both of these people also have salaries, keyed by their IDsYes, I’m egregiously underpaid

rhs = New DataTable
rhs.Columns.Add("ID", GetType(Int32))
rhs.Columns.Add("Salary", GetType(Decimal))

rhs.Rows.Add({1, 1.54})
rhs.Rows.Add({3, 15.00})

but we don’t want to look up the employee name every time, so we need to join them together. The naive way to do this is to loop through every row in one table, for these rows loop through every item in the other table, and if you find a match, add this to the result.

joinTable = lhs.Clone()
joinTable.Columns.Add("Salary", GetType(Decimal))

For Each r As DataRow In lhs.Rows
	For Each r2 As DataRow In rhs.Rows
		If r2("ID").Equals(r("ID")) Then
			joinTable.Rows.Add({r("ID"), r("Name"), r2("Salary")})
		End If
	Next
Next

or in C# because Option Strict On in VB sucks with LINQ and we want to compare this later

joinTable = lhs.Clone();
joinTable.Columns.Add("Salary", typeof(Decimal));

foreach (DataRow r in lhs.Rows) {
	foreach (DataRow r2 in rhs.Rows) {
		if (r2["ID"].Equals(r["ID"])) {
			joinTable.Rows.Add(new []{r["ID"], r["Name"], r2["Salary"]});
		}
	}
}

However, for large amounts of data, searching through the whole lookup table for each row in the data table is not feasible. Databases have a lot of tricks to handle this sort of thing, the main one is using something called an index. This works pretty much like the index at the end of a book that tells you on which page you can find certain keywords.

For .NET, we can use LINQ to do the join instead, which does some similar optimizations under the hood. Here is something equivalent to the above

joinTable = lhs.Clone();
joinTable.Columns.Add("Salary", typeof(Decimal));

var joinQuery = from person in lhs.AsEnumerable() 
	join salary in rhs.AsEnumerable()
	on person["ID"] equals salary["ID"]
	select new {ID = person["ID"], Name = person["Name"], Salary = salary["Salary"]};

foreach (var r in joinQuery) {
	joinTable.Rows.Add(new [] { r.ID, r.Name, r.Salary});
}

or in the functional flavor

joinTable = lhs.Clone();
joinTable.Columns.Add("Salary", typeof(Decimal));

var joinQuery = lhs.AsEnumerable().Join(rhs.AsEnumerable(), 
	person => person["ID"],
	salary => salary["ID"],
	(person,  salary) => new {ID = person["ID"], Name = person["Name"], Salary = salary["Salary"]}
);
foreach (var r in joinQuery) {
	joinTable.Rows.Add(new [] { r.ID, r.Name, r.Salary});
}

Timing

So what do we gain from this except for a questionably more readable syntax? Well, one advantage is that these will run a lot faster than the nested for loops. Let’s have a look with a larger amount of data and do some timing using the Stopwatch class.

Stopwatch setup
Stopwatch setup

First, let’s kick out the naive version vs the others with 10k randomly generated numbers

lhs = new DataTable();
lhs.Columns.Add("ID", typeof(Int32));
lhs.Columns.Add("Value", typeof(Int32));

rhs = new DataTable();
rhs.Columns.Add("ID", typeof(Int32));
rhs.Columns.Add("Unvalue", typeof(Int32));

var rand = new Random(1329847);
int numInts = 10000;
int maxID = 100000;
int maxValue = 100000;
for (int i = 0; i < numInts; ++i) {
	lhs.Rows.Add(rand.Next(maxID), rand.Next(maxValue));
	rhs.Rows.Add(rand.Next(maxID), rand.Next(maxValue));
	
}

So, what are the results? As you can see, the naive join is really a lot slower than the others, with Join Data Tables being by far the quickest, although the 1s delay is probably due to passing the data into Invoke Code rather than anything with the join execution.

04/17/2021 13:38:26 => [Info] Starting naive join
04/17/2021 13:38:39 => [Info] Naive join rows: 950
04/17/2021 13:38:39 => [Info] Naive join runtime: 13.599
04/17/2021 13:38:39 => [Info] Starting Join Data Tables
04/17/2021 13:38:39 => [Info] Join Data Tables rows: 950
04/17/2021 13:38:39 => [Info] Join Data Tables runtime: 00.12
04/17/2021 13:38:39 => [Info] Starting join with LINQ (SQL)
04/17/2021 13:38:41 => [Info] Join with LINQ (SQL) rows: 950
04/17/2021 13:38:41 => [Info] Join with LINQ (SQL) runtime: 01.257
04/17/2021 13:38:41 => [Info] Starting join with LINQ (Functional)
04/17/2021 13:38:42 => [Info] Join with LINQ (Functional) rows: 950
04/17/2021 13:38:42 => [Info] Join with LINQ (Functional) runtime: 01.358

Let’s see what happens when we go to 500k rows instead, though. Now, Join Data Tables is actually about 50% slower than the LINQ expressions, possibly due to duplicating the ID column, although in practice the difference is unlikely to matter. Naive join would be so slow here that I won’t even run it. Note I tried larger amounts of data, but UiPath started crashing for an unknown reason.

04/17/2021 13:42:32 => [Info] Starting Join Data Tables
04/17/2021 13:42:40 => [Info] Join Data Tables rows: 2499558
04/17/2021 13:42:40 => [Info] Join Data Tables runtime: 07.888
04/17/2021 13:42:40 => [Info] Starting join with LINQ (SQL)
04/17/2021 13:42:45 => [Info] Join with LINQ (SQL) rows: 2499558
04/17/2021 13:42:45 => [Info] Join with LINQ (SQL) runtime: 05.559
04/17/2021 13:42:45 => [Info] Starting join with LINQ (Functional)
04/17/2021 13:42:51 => [Info] Join with LINQ (Functional) rows: 2499558
04/17/2021 13:42:51 => [Info] Join with LINQ (Functional) runtime: 05.558
04/17/2021 13:42:51 => [Info] LINQ execution ended in: 00:00:22

Other Linqs

Of course, the main advantage of LINQ is not so much just data joining (as we saw, the Join Data Tables is sufficient for this in most cases) but being far more flexible than activities. For example, you can use it to group your data by one column and do something for each group, or aggregate them easily.

You can find my test code on github if you’d like to play around with it on your own.

I hope you enjoyed the somewhat more technical post this week. See you next time!


© 2021, Stefan Reutter