I work in a startup environment. Sometimes features are developed within 2-3 days. So developers give preference to deadline instead of performance. For past few months, I was compiling a list of code snippets where performance is ignored completely.
The common culprit in such code was developers habit to query in memory instead of database.
What do I mean by querying in memory?
I mean instead of getting result from database, we write code to first get data from DB and then write code to extract result.
For Example:
Invoice.where(client_id: 1).pluck(:amount).sum
Above snippet loads all amounts in memory and then sums up them. I consider this bad. We can achieve same result by using sql aggregate functions, without loading any additional data in memory.
Invoice.where(client_id: 1).sum(:amount)
Snippets
I have listed such snippets below, also I have come up with replacement for such code which will improve performance significantly.
Plucking Ids
Do not pluck ids from one table and insert them into another query. Simply write sub query which will leave the work of selection to Postgres(or any other DB). This will result in huge performance boost. You can easily see performance difference on tables with large number of records.
Following query takes more time to return result. Here we first tell Postgres to return profile ids, and again we tell him to get profiles for such profile ids.
BAD
Profile.where(id: Task.where(context: 'test').pluck(:profile_id)).count
=> 20625
Below query gives result really fast.
GOOD
Profile.where(id: Task.where(context: 'test').select(:profile_id)).count
=> 20625
Why?
When you use ActiveRecord::QueryMethods::WhereChain#select, ActiveRecord creates subquery i.e. relation object.
Profile.where(id: Task.where(context: 'test').select(:profile_id)).to_sql
> "SELECT "profiles".* FROM "profiles" WHERE "profiles"."id" IN (SELECT "tasks"."profile_id" FROM "tasks" WHERE "tasks"."context" = 'test')"
Here we tell postgres to select profile ids from tasks table and use those to retrieve profiles. The whole statement acts as a single query and works efficiently than pluck version of code.
Unnecessary aggregations after fetching rows
Somewhere in our code base I found following code.
pending_one_task_profile_ids = Task.pending.pluck(:profile_id)
pending_one_task_profile_ids = pending_one_task_profile_ids.select{|e| pending_one_task_profile_ids.count(e) == 1}
solo_tasks_count = Task.pending.where(profile_id: pending_one_task_profile_ids).count
Above code snippet, calculates solo_tasks_count
. The solo_tasks_count
is number of tasks who have only one pending task.
Well, what is wrong with this code? Everything.
- It gets all profile ids of pending tasks in memory. If you have 10,000 pending tasks then you have array of 10,000 integers in memory.
- Then, it runs ruby’s Enumerable#select method, Which runs for each element in the array and selects only those elements which are present only once in the array.
- On third line,
tasks
table is queried to get count according to array of profile ids.
Above code can be replaced by this. And it will be much more faster than previous code.
pending_one_task_profile_ids = Task.pending.group(:profile_id).having('COUNT(profile_id)=1').select(:profile_id)
solo_tasks_count = Task.pending.where(:profile_id => pending_one_task_profile_ids).count
Avoid Array breakup methods
We have our own core extensions which adds extra behavior (monkey patches) to built in ruby classes. For array we have following method:
class Array
def breakup
Hash[self.uniq.map{|_x| [_x, self.count(_x)]}]
end
end
The breakup
method returns hash where each element is key and value is count of such element in array. For example,
[1, 2, 3, 1, 1, 4, 2].breakup
> {1=>3, 2=>2, 3=>1, 4=>1}
This method is really handy. But while querying database we can get breakup directly from SQL GROUP BY
query.
So instead of writing
@counts_hash = current_user.requests.pluck(:context).breakup
We can write
@counts_hash = current_user.requests.group(:context).count
In first snippet we are getting array of contexts in memory and then computing count of each element. Where second snippet is plain SQL query which returns count of each context.