Monday, June 25, 2012

Mysql Optimization - Find Remove Duplicate Indexes

Duplicate index means when we have multiple index in the same column of a table.

Example : duplicate on same column
CREATE TABLE DUPLICATE_INDEX (
  EMAIL varchar(100) NOT NULL,
  PRIMARY KEY (EMAIL),
  KEY EMAIL(EMAIL)
) ENGINE=MyISAM

In this example on column 'EMAIL' we have primary key and index as well and one of them is unnecessary. So ideally we should remove one of index from this table.

Example : duplicate because of composite index
CREATE TABLE DUPLICATE_INDEX (
  EMAIL varchar(100) NOT NULL,
  PHONE varchar(100) NOT NULL,
  UNIQUE KEY (EMAIL,PHONE),
  KEY EMAIL(EMAIL)
) ENGINE=MyISAM

In this example index on column is multiple because of index key EMAIL and UNIQUE key (EMAIL,PHONE) .so key on index EMAIL is irrelevant.

Why we need to remove these duplicate keys
1. They make the optimizer phase slower because MySQL needs to examine more query plans.
2. For every insert / delete / update  , the storage engine needs to maintain, calculate and update more index
3. Disk space will increase and hence the backup time.

How To Find Duplicate Keys
There is a tool in Percona Toolkit that can help you to find all those keys in your schema, its name is pt-duplicate-key-checkerand it can find the  different types of duplicates key


command: pt-duplicate-key-checker --database=test
Mysql Peformance


Sphinx Search Optimization

Sphinx search is an alternative of performing search in mysql .
From my experience sphinx will provide you a great deal of performance .
For Details http://sphinxsearch.com

This Page is about the optimization that i had done and will boost your performance

* Caching (Use Memcache)
sphinx directly does not support caching. An alternative is to use memcache. 
Based on your website requirements , say i fetch records of 1st 10 pages  and store records in memcache. So when user uses pagination then there is no need to peform a search operation , rather we can fetch records directly from memcache

* Use Of Multiquery / faceted searching
Multi-queries, or query batches, let you send multiple queries to Sphinx in one go.Two API methods that implement multi-query mechanism are AddQuery() and RunQueries(). 


Why To Use:
1. decrease in network roundtrips. 
2. automatic internal optimizations.
3. for new version performance will always increase.


Sequential Code
$cl->SetGroupBy('grouping1', SPH_GROUPBY_ATTR,"@count desc");
$cl->AddQuery ('textToSearch_OPTINAL','IndexName');
$res=$cl->RunQueries();

$cl->SetGroupBy('grouping2', SPH_GROUPBY_ATTR,"@count desc");
$cl->AddQuery ('textToSearch_OPTINAL','IndexName');
$res=$cl->RunQueries();

$cl->SetGroupBy('grouping3', SPH_GROUPBY_ATTR,"@count desc");
$cl->AddQuery ('textToSearch_OPTINAL','IndexName');
$res=$cl->RunQueries();

can be optimized by Multiquery
$cl->SetGroupBy('grouping1', SPH_GROUPBY_ATTR,"@count desc");
$cl->SetGroupBy('grouping2', SPH_GROUPBY_ATTR,"@count desc");
$cl->SetGroupBy('grouping3', SPH_GROUPBY_ATTR,"@count desc");
$cl->AddQuery ('textToSearch_OPTINAL','IndexName');
$res=$cl->RunQueries();

* Avoid use of setSelect('limitedcolumns)
setSelect(*) is much faster than setSelect('limitedcolumns)
Reason:
You have a bucket of colored marbels.If I ask you to pass me all (the *), then you just pass me the bucket.But if I ask you to pass me only the red and yellow ones, it takes you time, because you
have to pick out the right ones, and put them in a new bucket.
Smaller bucket (ie less network bandwidth), but takes longer (processing)

Sphinx Search

* Use Of SphinxQL (for large in/not-in queries)

client end computations are very high specially when a lot of values are being sent
in a IN or NOT IN clause because for every value it converts it into some binary format
using pack function and then sends to the sphinx server. This causes high load on our web
server because we have queries wherein we show online users within a search criteria and
sometimes online users are around 15k-20k.

example:  normal use of api is slow and cpu utilization consuming

for($i=100;$i<20000;$i++) {        $x[]=$i; }

$cl->SetFilter('PROFILEID',$x,TRUE);
use of sphinxQL will give cpu utilization improvement by 10 times

Distributed searching
To scale well, Sphinx has distributed searching capabilities. Distributed searching is useful to improve query latency (ie. search time) and throughput (ie. max queries/sec) in multi-server, multi-CPU or multi-core environments. This is essential for applications which need to search through huge amounts data (ie. billions of records and terabytes of text).

* Explore sphinx.conf
1. mem_limit                       = 32M (default allocation)
change it to around 2048MB based on your usage.