Query in database, not in memory

I work in a startup environment. Sometimes features are developed within 2-3 days. So developers give preference to deadline instead of performance. For past few months, I was compiling a list of code snippets where performance is ignored completely.

The common culprit in such code was developers habit to query in memory instead of database.

What do I mean by querying in memory?

I mean instead of getting result from database, we write code to first get data from DB and then write code to extract result.

For Example:

Invoice.where(client_id: 1).pluck(:amount).sum

Above snippet loads all amounts in memory and then sums up them. I consider this bad. We can achieve same result by using sql aggregate functions, without loading any additional data in memory.

Invoice.where(client_id: 1).sum(:amount)

Snippets

I have listed such snippets below, also I have come up with replacement for such code which will improve performance significantly.

Plucking Ids

Do not pluck ids from one table and insert them into another query. Simply write sub query which will leave the work of selection to Postgres(or any other DB). This will result in huge performance boost. You can easily see performance difference on tables with large number of records.

Following query takes more time to return result. Here we first tell Postgres to return profile ids, and again we tell him to get profiles for such profile ids.

BAD

Profile.where(id: Task.where(context: 'test').pluck(:profile_id)).count
=> 20625

Below query gives result really fast.

GOOD

Profile.where(id: Task.where(context: 'test').select(:profile_id)).count
=> 20625

Why?

When you use ActiveRecord::QueryMethods::WhereChain#select, ActiveRecord creates subquery i.e. relation object.

Profile.where(id: Task.where(context: 'test').select(:profile_id)).to_sql

> "SELECT "profiles".* FROM "profiles" WHERE "profiles"."id" IN (SELECT "tasks"."profile_id" FROM "tasks" WHERE "tasks"."context" = 'test')"

Here we tell postgres to select profile ids from tasks table and use those to retrieve profiles. The whole statement acts as a single query and works efficiently than pluck version of code.

Unnecessary aggregations after fetching rows

Somewhere in our code base I found following code.

pending_one_task_profile_ids = Task.pending.pluck(:profile_id)

pending_one_task_profile_ids = pending_one_task_profile_ids.select{|e| pending_one_task_profile_ids.count(e) == 1}

solo_tasks_count = Task.pending.where(profile_id: pending_one_task_profile_ids).count

Above code snippet, calculates solo_tasks_count . The solo_tasks_count is number of tasks who have only one pending task.

Well, what is wrong with this code? Everything.

It gets all profile ids of pending tasks in memory. If you have 10,000 pending tasks then you have array of 10,000 integers in memory.
Then, it runs ruby’s Enumerable#select method, Which runs for each element in the array and selects only those elements which are present only once in the array.
On third line, tasks table is queried to get count according to array of profile ids.

Above code can be replaced by this. And it will be much more faster than previous code.

pending_one_task_profile_ids = Task.pending.group(:profile_id).having('COUNT(profile_id)=1').select(:profile_id)

solo_tasks_count = Task.pending.where(:profile_id => pending_one_task_profile_ids).count

Avoid Array breakup methods

We have our own core extensions which adds extra behavior (monkey patches) to built in ruby classes. For array we have following method:

class Array
  def breakup
    Hash[self.uniq.map{|_x| [_x, self.count(_x)]}]
  end
end

The breakup method returns hash where each element is key and value is count of such element in array. For example,

[1, 2, 3, 1, 1, 4, 2].breakup
> {1=>3, 2=>2, 3=>1, 4=>1}

This method is really handy. But while querying database we can get breakup directly from SQL GROUP BY query.

So instead of writing

@counts_hash = current_user.requests.pluck(:context).breakup

We can write

@counts_hash = current_user.requests.group(:context).count

In first snippet we are getting array of contexts in memory and then computing count of each element. Where second snippet is plain SQL query which returns count of each context.

What do I mean by querying in memory?#

Snippets#

Plucking Ids#

Unnecessary aggregations after fetching rows#

Avoid Array breakup methods#

What do I mean by querying in memory?

Snippets

Plucking Ids

Unnecessary aggregations after fetching rows

Avoid Array breakup methods