Creating Large Elasticsearch Index by Using Limited Hardware

First of all, let me tell you my current requirement and you'll continue reading if you have a similar use case. I'm trying to submit a paper for SIGIR 2018 and for this paper I need to index 50,000,000 unique documents to 15 different Elasticsearch indexes. According to my calculation, I need 1.5 terrabyte space without replication. So I decided to buy an external spinning disk and building index on this hardware. Good plan, right? Let me tell you it's not!

At this point my setup was:

  • MacBook Pro: 256 GB SSD storage (190 GB storage was available) with 16 GB Ram
  • Seagate Slim Backup 2 TB spinning disk
  • ES's data path was pointing to external disk only
  • ES initiated with 8 GB Ram
  • Index set to refresh_interval -1 to disable near real-time index updates
  • Index set to numberofreplicas 0 to speed up bulk requests

What was in my mind? I thought that when I send index requests, ES will collect a bunch of index requests in memory and process these requests on memory and ES flush periodically these documents from memory to disk so using spinning disk should not be a major bottleneck. But I was wrong :) First I tracked the performance of Elasticsearch by using VisualVM and Elasticsearch wasn't using so much memory and GC was working fine, so I decided to check disk usage of the OS.

wholelottalove:~ soneraltin$ iostat -c 5 -w 10  
              disk0               disk2       cpu    load average
    KB/t  tps  MB/s     KB/t  tps  MB/s  us sy id   1m   5m   15m
  189.90  274 50.72    74.83    5  0.38  27  7 66  3.15 3.08 3.11
  235.29  370 85.07   162.61  147 23.37  19 10 72  3.12 3.07 3.11
  199.63  412 80.42   106.64  181 18.82  19  7 74  3.10 3.07 3.11
  218.48  455 97.10   126.78  173 21.45  12  6 82  3.07 3.07 3.11
  231.82  375 84.79   188.21  147 27.11   7  5 89  2.98 3.05 3.10

It's obvious that disk0, internal SSD, so the Elasticsearch executables are on disk0, disk2 should be busy which shows disk0 is swapping a lot. And I checked external disk's transfer speed and it's 120 Mbyte/sec, which is clearly not enough for fast processing, but at the same time, I was using it for downloading data from CommonCrawl and also processing WARC files which makes SSD so much slower!

More or less Elasticsearch's whole process is like this, writing inverted index to spinning disk is a very high-cost operation which decreases doc/sec index from 6,000 to 50 on average compared to SSD. So using an internal spinning disk should be so much faster than using the external spinning disk, but in my case, it's not an option.

So for indexing 50,000,000 documents to 15 different indicies, I need (50,000,000 * 15 / 50 / 60 / 60 / 24 = 173) days, so it's not acceptable.

So first I checked the ES tuning blogs, I read around 10 different awesome blog posts and tried to make it faster but even though I made a lot of change, their performance effect was not so the external SSD prices and I skipped that option. Then I checked GCP, Azure, and AWS for ES clusters and I calculated the estimated price for this experiment and it's around 300 USD. So yeah, for one time it's acceptable.

But then one of my colleagues came with a great idea. His plan was so good. So he decided to:

  • Build several small indexes by using SSD faster. Like for index A which will have 50,000,000 documents in total, build sub-indexes A0, A1, A2 ... A19
  • Move them to external disk
  • Point out external disk as ES data folder
  • Define alias for each index by using these sub indexes
  • Use alias A for the main and virtually merged index.

So it's an perfect plan isn't it. So maybe I'll compare search performance of the external disk since I need to execute 7 million different queries from AOL logs.

TAGGED IN elasticsearch, thesis