rq tool now on fedorahosted

So the RPM Query (rq) tool that I inherited as ‘srpm’ from Stew Benedict back in the day at Mandriva is now on fedorahosted,org. Unfortunately, it’s missing history because I had a dickens of a time trying to extract it from my existing svn repo (I got it so just the relevant commits where in there, but the entire history of the other repo came with it which does me no good). You can check it out at https://fedorahosted.org/rq/wiki.

But the problem I’m having now, and maybe someone on the lazyweb can help me with this, is that it is absolutely freaking slow. Querying the database is fast enough when you’re searching for something, but doing the imports and especially handling the updates is a process of hours. For instance, to have it manage the updates for Fedora 14 (the SRPMs), I’m looking at 1311 package updates and 353 new packages being added. It’s been running since last night (I don’t know the specific time it started, but it has to be well over 10hrs ago) and it’s still in the “remove old package records” stage. That seems highly excessive to me.

But the database is big:

rqs 0.6 ($Id: rqs 425 2011-01-21 23:37:19Z vdanen@ANNVIX.CA $)

Database statistics:

   Database  => User: rq, Host: localhost, Database: rqs
   Data size => 9.96 GB

   Tag records  : 5                Package records : 21629
   File records : 9554598          Source records  : 76031
   Ctags records: 56300434         Requires records: 100345

And what it is doing is first removing all of the old records (packages, files, ctags, requires, source files) before it can add in the new. Of course, it does this one at a time:

if self.type == 'source':
  tables = ('packages', 'sources', 'files', 'ctags', 'buildreqs')
for rnum in to_remove:
  r_count = r_count + 1
  for table in tables:
    query  = "DELETE FROM %s WHERE p_record = %d" % (table, rnum)
    result = self.db.do_query(query)

I think the problem is I’ve reached my limit of understanding of efficient MySQL as I’ve never dealt with data sets this huge. What would be the best way to handle this? Would using temporary tables work best? Would having each rpm repo on its own table work better? That would logistically be more work as I’d have to create SQL tables on the fly (and while I would like this thing to be portable between db types at some point, it currently isn’t — but this would have an impact when I want to make it portable). The problem then is if I want to query and look for ‘sshd.c’ for instance, I’d have to look across multiple tables (such as ‘rqs_f14_source’ and ‘rqs_rhel6_source’ instead of just ‘source’), but my concern is that might increase the query time (querying right now is pretty fast, and I don’t want to negatively impact querying times, but this import/update time now is ridiculous).

But maybe that is the answer? Multiple tables, grouped by repo (or tag, as rq calls them) as opposed to giant tables now? I mean, looking at the number of rows in the ctags table (56,300,434), this is a pretty huge dataset. Or can you remove from multiple tables at the same time (the p_record column is identical across tables… same name, same data).

Any hints to speed this sucker up?

7 Comments

  1. Niels de Vos

    Well, one suggestion to speed up the deletion is replacing

    for table in tables:
    query = “DELETE FROM %s WHERE p_record = %d” % (table, rnum)
    result = self.db.do_query(query)

    by

    query = “DELETE FROM %s WHERE p_record = %d” % (“,”.join(tables), rnum)
    result = self.db.do_query(query)

    It will still delete all the rows one-by-one, but at least you have only 1/5 of the SQL-statements.

    I guess you can also change the WHERE-clause in something like

    WHERE p_record IN ( )

    This can be a sub-query that constructs the to_remove list, or a temporary table that contains this information. You could even list all rnum’s as a list on the SQL-statement, but if the list is big, you probably shouldn’t ;)

    HTH,
    Niels

    Jan 29, 2011 @ 08:55:30
  2. vdanen

    No, I don’t think I want to ist all of the rnum’s. =) That would be a pretty big list. Thanks for the tip on cutting down the SQL statements; I didn’t know you could remove from multiple tables in one statement. I don’t know if that will help with the speed, but it probably can’t hurt.

    Sub-queries is totally new to me. I’ll have to look into that and see what can be done. Sounds interesting.

    FWIW, to finish that import it took:

    ./rqs -u f14 -P  4422.99s user 2641.62s system 8% cpu 22:29:35.69 total
    

    Which is an insanely long time. I _know_ the majority of that time was due to the deletions.

    Jan 29, 2011 @ 17:07:21
  3. Derek Williams

    I am much more familiar with Postgres, haven’t worked with MySQL (or Python for that matter) for a while and am not too familiar with optimizing for it, but I have some ideas.

    First, does each delete get immediatly synced to disk? In other words, are you waiting on IO? Running ‘top’ should show you the percentage of cpu that is waiting. Also running ‘ watch “cat /proc/vmstat | grep -A 1 dirty”‘ should let you keep track of how much dirty data you have. If it stays low (a few thousand) it probably means that it is getting written out to disk too often instead of being cached.

    I’m not sure how the python client interacts with MySQL, but wrapping the entire for loop in a transaction or a lock might speed things up as changes to the db are not done until the transaction is commited or tables are unlocked.

    As Niels mentioned previously, you might be able to use WHERE .. IN () depending on the size of to_remove:

    query = “DELETE FROM %s WHERE p_record IN (%s)” % (“,”.join(tables), “,”.join(rnum))

    I don’t know the query size limit, but you can always try splitting to_remove up into smaller chunks.

    I haven’t looked at the source code yet, only the sql table definitions. If I get time later I’ll check it out and see if I can find anything in there that could help.

    Jan 30, 2011 @ 10:07:10
  4. vdanen

    Hmmm… did a bit of googling today and I think I need to start looking at transactions. I’ve got to admit that most of the database work I’ve done, whether in postgres or MySQL, has been pretty light and hasn’t had datasets this huge, so I’m sure I need to normalize a bit more and optimize a heck of a lot.

    Yeah, each DELETE would likely be writing to disk immediately. The database is on an SSD drive which is making it faster, but I think I need to do something like:

    START TRANSACTION;
    [code to do lots of DELETEs]
    COMMIT;
    

    I _believe_ that will get things written to disk once for the whole thing. I can't do this on the additions I don't think because a lot of stuff depends on other things being committed (to get IDs to relate, etc.) and the actual importing is fairly quick all things considered (the bulk of the CPU/time is on unpacking rpms/tarballs and running ctags on everything). But I think the transaction stuff may really help here. Was looking at this:

    http://dev.mysql.com/doc/refman/5.0/en/commit.html

    Will have to do some playing if I get some time tomorrow. I think I also need to determine whether using MyISAM or InnoDB would be a better fit for this as well. Lots of reading to do I think. =)

    Found this too, today:

    http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/

    which may also have some good info (it looked decent at first glance).

    The other thing that I may look at and probably should have a while ago is SQLAlchemy, which is a database toolkit for python. There is probably stuff in there that would make all of this easier. The downside is it might take a bit of effort to change the code to use it rather than what I had already concocted.

    Jan 30, 2011 @ 21:46:09
  5. vdanen

    Ok, tried to do the multi-table delete, and it doesn’t work:

    
    mysql> DELETE FROM packages, sources, files, ctags, buildreqs WHERE p_record = 25071;
    ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'WHERE p_record = 25071' at line 1
    

    I think to do that I have to build a more complex query with a lot of JOINs and such… at least everytime I google “delete from multiple tables in mysql” that’s the kind of code I’m seeing in the results.

    I hate JOINs. Not bad when you’re only joining two tables, but I’m scared I’ll bungle it and instead of removing entire rows like I want, start putting holes in my data and leaving rows semi-intact.

    I’m going to see if transactions will speed things up first before trying to optimize what could turn into a really ugly query.

    Jan 31, 2011 @ 21:17:21
  6. vdanen

    Aha! Ok, I think I may have this figured out. My query times look like this:

    Jan 31 21:21:13 query is: DELETE FROM packages WHERE p_record = 25094
    Jan 31 21:21:13 query is: DELETE FROM sources WHERE p_record = 25094
    Jan 31 21:21:13 query is: DELETE FROM files WHERE p_record = 25094
    Jan 31 21:21:13 query is: DELETE FROM ctags WHERE p_record = 25094
    Jan 31 21:22:11 query is: DELETE FROM buildreqs WHERE p_record = 25094
    

    and looking at my SQL schema, there is no index on p_record! When I did it, I didn’t really need to do any lookups by p_record and certainly no deletes, so I think if I add this to my schema:

    @@ -101,7 +101,8 @@
       `c_file` text NOT NULL,
       PRIMARY KEY  (`c_record`),
       KEY `rec` USING BTREE (`c_record`),
    -  KEY `trec` USING BTREE (`t_record`)
    +  KEY `trec` USING BTREE (`t_record`),
    +  KEY `prec` USING BTREE (`p_record`)
     ) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
    

    then perhaps I’ll get better performance. I also need to see if InnoDB will give me better performance. I don’t know much about the difference between MyISAM and InnoDB so I’ll have to do some reading. By the sounds of things, transactions don’t really work in MyISAM (everything is written to disk immediately).

    Jan 31, 2011 @ 21:29:55
  7. vdanen

    It’s funny how once you start writing out or talking about a problem that the solution becomes more apparent, and the mistake is stupidly obvious. I love these one-character mistakes.

    The problem was definitely the lack of p_record being indexed, but it wasn’t because I didn’t think of it… c_record was indexed twice which makes me think that many months ago, I made a typo and put c_record when I meant to put p_record.

    With that change and adding the p_record index and removing the duplicate c_record index, what took ~60 seconds per package is now taking about one second. So the 1300+ package update that took about 22hrs (which makes sense, given each of those 1300 packages took about a minute each to remove), should turn into 22 minutes or so.

    Looks like I won’t need to spend the time looking into SQLAlchemy just yet, or trying to see if InnoDB is more advantageous than MyISAM. =) Yaaa me.

    Jan 31, 2011 @ 22:30:51

Leave a Reply

*