For RDF query performance, we have the following possible questions:
For running with large data sets, one should configure the Virtuoso process to use between 2/3 to 3/5 of system RAM and to stripe storage on all available disks. See NumberOfBuffers and Striping ini parameters.
Also, if running with a large database, setting MaxCheckpointRemap to 1/4th of the database size is recommended. This is in pages, 8K per page.
If the graph is always given, as one or more FROM or FROM NAMED, and there are no patterns where only graph and predicate are given, then the default indices should be appropriate. If the predicate and graph are given but subject is not, then it is sometimes useful to add
create bitmap index RDF_QUAD_PGOS on DB.DBA.RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff));
If the server is pre 5.0.7, leave out the partitioning clause.
Making the PGOS index can help in some cases even if it is not readily apparent from the queries that one is needed. This is so for example if the predicate by itself is selective, i.e. there is a predicate that occurs in only a few triples.
If the graph itself is not given in the queries, then the default index scheme will be unworkable. For this, the appropriate scheme is:
create table RDF_QUAD (G iri_id_8, S iri_id_8, P iri_id_8, O any, primary key (S, P, O, G)) alter index RDF_QUAD on RDF_QUAD partition (S int (0hexffff00)); create bitmap index RDF_QUAD_OPGS on DB.DBA.RDF_QUAD (O, P, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_POGS on DB.DBA.RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_GPOS on DB.DBA.RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff));
For a pre 5.0.7 server, leave the partition clauses and the alter index statement out.
If there are existing triples and one does not wish to reload them, then the following sequence will convert the data:
log_enable (2); drop index RDF_QUAD_OGPS; checkpoint; create table R2 (G iri_id_8, S iri_id_8, P iri_id_8, O any, primary key (S, P, O, G)) alter index R2 on R2 partition (S int (0hexffff00)); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table RDF_QUAD; checkpoint; alter table r2 rename RDF_QUAD; create bitmap index RDF_QUAD_OPGS on DB.DBA.RDF_QUAD (O, P, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_POGS on RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_GPOS on RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff)); checkpoint; log_enable (1);
First drop the OGPS index to make space. Then, in row autocommit mode and without logging, copy the quads into a new primary key layout. Drop the old and rename the new over the old. Make the additional indices. Do a checkpoint after the drops so as to actually free the space also in the checkpointed state. Finish with a checkpoint so as to finalize the changes, since logging was turned off. Even if logging had been on, one would not wish to have to replay the reindexing if the server terminated abnormally. Finally turn logging back on for the session.
This is all meant to be done with a SQL client like isql and not through a web interface. The web interface has no real session and the log_enables do nothing there.
Other indexing schemes may be tried. We note however that in all cases, one or other of the indices should begin with G. This is because for schema operations it is necessary to read through a graph. If no index begins with G, this becomes a full table scan and is unworkable, leading to an extremely slow server start and making operations like drop graph as good as unusable.
Public web service endpoints are proven to be sources of especially bad queries. While local application develpers can obtain instructions from database administrator and use ISQL access to the database in order to tune execution plans, "external" clients do not know details of configuration and/or lacks appropriate skills. The most common problem is that public endpoints usually get requests that does not mention the required graph, because that queries were initially written for use with triple stores. If the web service provides access to a single graph (or to a short list of graphs) then it is strongly recommended to configure it by adding a row into DB.DBA.SYS_SPARQL_HOST. The idea is that if the client specifies default graph in the request or uses named graphs and group graph patterns then he is probably smarter than average and will provide meaningful queries. If no graph names are specified then the query will benefit from preset graph because this will give the compiler some more indexes to choose from -- indexes taht begin with G.
Sometimes web service endpoint is used to access data of only one application, not all data in the system. In that case one may wish to declare a separate storage that consists of only RDF Views made by that application and define input:storage in appropriate row of DB.DBA.SYS_SPARQL_HOST.
The selectivity of triple patterns is determined at query compile time from sampling the data. It is possible that misleading data is produced. To see if the cardinality guesses are generally valid, look at the query plan with explain ().
Below is a sample from the LUBM qualification data set in the Virtuoso distribution. After running make test in binsrc/test/lubm, there is a loaded database with the data. Start a server in the same directory to see the data.
SQL> explain ('sparql prefix ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> select * from <lubm> where { ?x rdf:type ub:GraduateStudent }'); REPORT VARCHAR _______________________________________________________________________________ { Precode: 0: $25 "callret" := Call __BOX_FLAGS_TWEAK (<constant (lubm)>, <constant (1)>) 5: $26 "lubm" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($25 "callret") 12: $27 "callret" := Call __BOX_FLAGS_TWEAK (<constant (http://www.w3.org/1999/02/22-rdf-syntax-ns#type)>, <constant (1)>) 17: $28 "-ns#type" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($27 "callret") 24: $29 "callret" := Call __BOX_FLAGS_TWEAK (<constant (http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent)>, <constant (1)>) 29: $30 "owl#GraduateStudent" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ($29 "callret") 36: BReturn 0 from DB.DBA.RDF_QUAD by RDF_QUAD_OGPS 1.9e+03 rows Key RDF_QUAD_OGPS ASC ($32 "s-3-1-t0.S") <col=415 O = $30 "owl#GraduateStudent"> , <col=412 G = $26 "lubm"> , <col=414 P = $28 "-ns#type"> row specs: <col=415 O LIKE <constant (T)>> Current of: <$34 "<DB.DBA.RDF_QUAD s-3-1-t0>" spec 5> After code: 0: $35 "x" := Call ID_TO_IRI ($32 "s-3-1-t0.S") 5: BReturn 0 Select ($35 "x", <$34 "<DB.DBA.RDF_QUAD s-3-1-t0>" spec 5>) } 22 Rows. -- 1 msec.
This finds the graduate student instances in the lubm graph. First the query converts the IRI literals to id's. Then, using a match of OG on OGPS it finds the IRI's of the graduate students. Then it converts the IRI id to return to the string form.
The cardinality estimate of 1.9e+03 rows is on the FROM line.
Doing an explain on the queries will show the cardinality estimates. To drill down further, one can split the query into smaller chunks and see the estimates for these, up to doing it at the triple pattern level. To indicate a variable that is bound but whose value is not a literal known at compile time, one can use the parameter marker ??.
explain ('sparql define sql:table-option "order" prefix ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> select * from <lubm> where { ?x rdf:type ?? }');
This will not know the type but will know that a type will be provided. So instead of guessing 1900 matches, this will guess a smaller number, which is obviously less precise. Thus literals are generally better.
In some cases, generally to work around an optimization error, one can specify an explicit join order. This is done with the sql:select-option "order" clause in the SPARQL query prefix.
select sparql_to_sql_text (' define sql:select-option "order" prefix ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> select * from <lubm> where { ?x rdf:type ub:GraduateStudent . ?x ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0> }');
shows the SQL text with the order option at the end.
If an estimate is radically wrong then this should be reported as a bug.
If there is a FROM with a KEY on the next line and no column specs then this is a full table scan. The more columns are specified the less rows will be passed to the next operation in the chain. In the example above, there are three columns whose values are known before reading the table and these columns are leading columns of the index in use so column specs are
<col=415 O = $30 "owl#GraduateStudent"> , <col=412 G = $26 "lubm"> , <col=414 P = $28 "-ns#type">
A KEY with only a row spec is a full table scan with the row spec applied as a filter. This is usually not good unless this is specifically intended.
If queries are compiled to make full table scans when this is not specifically intended, this should be reported as a bug. The explain output and the query text should be included in the report.
An explicit join order is specified by the define sql:select-option "order" clause in the SPARQL query prefix: Consider:
explain ('sparql define sql:select-option "order, loop" prefix ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> select * from <lubm> where { ?x ub:takesCourse ?c . ?x rdf:type ub:GraduateStudent }');
One will see in the output that the first table access is to retrieve all in the lubm graph which take some course and then later to check if this is a graduate student. This is obviously not the preferred order but the sql:select-option "order" forces the optimizer to join from left to right.
It is very easy to end up with completely unworkable query plans in this manner but if the optimizer really is in error, then this is the only way of overriding its preferences. The effect of sql:select-option is pervasive, extending inside unions, optionals, subqueries etc within the statement.
We note that if, in the above query, both the course taken by the student and the type of the student are given, the query compilation will be, at least for all non-cluster cases, an index intersection. This is not overridden by the sql:select-option clause since an index intersection is always a safe guess, regardless of the correctness of the cardinality guesses of the patterns involved.
There are many functions for loading RDF text, in RDF/XML and Turtle.
For loading RDF/XML, the best way is to ssplit the data to be loaded into multiple streams and load these in parallel using rdf_load_rdfxml (). To avoid running out of rollback space for large files and in order to have multiple concurrent loads not interfere with each other, the row autocommit mode should be enabled.
For example,
log_enable (2); -- switch row-by-row autocommit on and logging off for this session db..rdf_load_rdfxml (file_to_string_output ('file.xml'), 'base_uri', 'target_graph'); -- more files here ... checkpoint;
Loading a file with text like the above with isql will load the data. Since the transaction logging is off, make a manual checkpoint at the end to ensure that data is persisted upon server restart since there is no roll forward log.
If large amounts of data are to be loaded, run multiple such streams in parallel. One may have for example 6 streams for 4 cores. This means that if up to two threads wait for disk, there is still work for all cores.
Having substantially more threads than processors or disks is not particularly useful.
There exist multithreaded load functions which will load one file on multiple threads. Experience shows that loading multiple files on one thread per file is better.
For loading Turtle, some platforms may have a non-reentrant Turtle parser. This means that only one load may run at a time. One can try this by calling ttlp () from two sessions at the same time. If these do not execute concurrently, then the best way may be to try ttlp_mt and see if this runs faster than a single threaded ttlp call.
To load the rdf data to LOD instance, perform the following steps:
-- -- $Id: rdfandsparql.xml,v 1.111 2009/08/28 13:24:40 source Exp $ -- -- Alternate RDF index scheme for cases where G unspecified -- -- This file is part of the OpenLink Software Virtuoso Open-Source (VOS) -- project. -- -- Copyright (C) 1998-2009 OpenLink Software -- -- This project is free software; you can redistribute it and/or modify it -- under the terms of the GNU General Public License as published by the -- Free Software Foundation; only version 2 of the License, dated June 1991. -- -- This program is distributed in the hope that it will be useful, but -- WITHOUT ANY WARRANTY; without even the implied warranty of -- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -- General Public License for more details. -- -- You should have received a copy of the GNU General Public License along -- with this program; if not, write to the Free Software Foundation, Inc., -- 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA -- -- drop index RDF_QUAD_OGPS; checkpoint; create table R2 (G iri_id_8, S iri_id_8, P iri_id_8, O any, primary key (S, P, O, G)) alter index R2 on R2 partition (S int (0hexffff00)); log_enable (2); insert into R2 (G, S, P, O) select G, S, P, O from rdf_quad; drop table RDF_QUAD; alter table r2 rename RDF_QUAD; checkpoint; create bitmap index RDF_QUAD_OPGS on RDF_QUAD (O, P, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_POGS on RDF_QUAD (P, O, G, S) partition (O varchar (-1, 0hexffff)); create bitmap index RDF_QUAD_GPOS on RDF_QUAD (G, P, O, S) partition (O varchar (-1, 0hexffff)); checkpoint;
create table load_list ( ll_file varchar, ll_graph varchar, ll_state int default 0, -- 0 not started, 1 going, 2 done ll_started datetime, ll_done datetime, ll_host int, ll_work_time integer, ll_error varchar, primary key (ll_file)) alter index load_list on load_list partition (ll_file varchar) ; create index ll_state on load_list (ll_state, ll_file, ll_graph) partition (ll_state int) ; create table ldlock (id int primary key) alter index ldlock on ldlock partition (id int) ; insert into ldlock values (0); create procedure ld_dir (in path varchar, in mask varchar, in graph varchar) { declare ls any; declare inx int; ls := sys_dirlist (path, 1); for (inx := 0; inx < length (ls); inx := inx + 1) { if (ls[inx] like mask) { set isolation = 'serializable'; if (not (exists (select 1 from DB.DBA.LOAD_LIST where LL_FILE = path || '/' || ls[inx] for update))) { declare gfile, cgfile, ngraph varchar; gfile := path || '/' || replace (ls[inx], '.gz', '') || '.graph'; cgfile := path || '/' || regexp_replace (replace (ls[inx], '.gz', ''), '\\-[0-9]+\\.n', '.n') || '.graph'; if (file_stat (gfile) <> 0) ngraph := trim (file_to_string (gfile), ' \r\n'); else if (file_stat (cgfile) <> 0) ngraph := trim (file_to_string (cgfile), ' \r\n'); else if (file_stat (path || '/' || 'global.graph') <> 0) ngraph := trim (file_to_string (path || '/' || 'global.graph'), ' \r\n'); else ngraph := graph; if (ngraph is not null) { insert into DB.DBA.LOAD_LIST (ll_file, ll_graph) values (path || '/' || ls[inx], ngraph); } } commit work; } } } ; create procedure ld_dir_all (in path varchar, in mask varchar, in graph varchar) { declare ls any; declare inx int; ls := sys_dirlist (path, 0); ld_dir (path, mask, graph); for (inx := 0; inx < length (ls); inx := inx + 1) { if (ls[inx] <> '.' and ls[inx] <> '..') { ld_dir_all (path||'/'||ls[inx], mask, graph); } } } ; create procedure ld_add (in _fname varchar, in _graph varchar) { --log_message (sprintf ('ld_add: %s, %s', _fname, _graph)); set isolation = 'serializable'; if (not (exists (select 1 from DB.DBA.LOAD_LIST where LL_FILE = _fname for update))) { insert into DB.DBA.LOAD_LIST (LL_FILE, LL_GRAPH) values (_fname, _graph); } commit work; } ; create procedure ld_ttlp_flags (in fname varchar) { if (fname like '%/btc-2009%' or fname like '%.nq%') return 255 + 512; return 255; } create procedure ld_file (in f varchar, in graph varchar) { declare gzip_name varchar; declare exit handler for sqlstate '*' { rollback work; update DB.DBA.LOAD_LIST set LL_STATE = 2, LL_DONE = curdatetime (), LL_ERROR = __sql_state || ' ' || __sql_message where LL_FILE = f; commit work; log_message (sprintf (' File %s error %s %s', f, __sql_state, __sql_message)); return; }; if (f like '%.grdf' or f like '%.grdf.gz') { load_grdf (f); } else if (f like '%.gz') { gzip_name := regexp_replace (f, '\.gz\x24', ''); if (gzip_name like '%.xml' or gzip_name like '%.owl' or gzip_name like '%.rdf') DB.DBA.RDF_LOAD_RDFXML (gz_file_open (f), graph, graph); else if (gzip_name like '%.n4') TTLP (gz_file_open (f), graph, graph, 512 + 255); else TTLP (gz_file_open (f), graph, graph, ld_ttlp_flags (f)); } else { if (f like '%.xml' or f like '%.owl' or f like '%.rdf') DB.DBA.RDF_LOAD_RDFXML (file_open (f), graph, graph); else if (f like '%.n4') TTLP (file_open (f), graph, graph, 512 + 255); else TTLP (file_open (f), graph, graph, ld_ttlp_flags (f)); } --log_message (sprintf ('loaded %s', f)); } ; create procedure rdf_load_dir (in path varchar, in mask varchar := '%.nt', in graph varchar := 'http://dbpedia.org') { delete from DB.DBA.LOAD_LIST where LL_FILE = '##stop'; commit work; ld_dir (path, mask, graph); rdf_loader_run (); } ; create procedure ld_array () { declare first, last, arr, len, local any; declare cr cursor for select top 100 LL_FILE, LL_GRAPH from DB.DBA.LOAD_LIST table option (index ll_state) where LL_STATE = 0 for update; declare fill int; declare f, g varchar; declare r any; whenever not found goto done; first := 0; last := 0; arr := make_array (100, 'any'); fill := 0; open cr; len := 0; for (;;) { fetch cr into f, g; if (0 = first) first := f; last := f; arr[fill] := vector (f, g); len := len + cast (file_stat (f, 1) as int); fill := fill + 1; if (len > 2000000) goto done; } done: if (0 = first) return 0; if (1 <> sys_stat ('cl_run_local_only')) local := sys_stat ('cl_this_host'); update load_list set ll_state = 1, ll_started = curdatetime (), LL_HOST = local where ll_file >= first and ll_file <= last; return arr; } ; create procedure rdf_loader_run (in max_files integer := null, in log_enable int := 2) { declare sec_delay float; declare _f, _graph varchar; declare arr any; declare xx, inx, tx_mode, ld_mode int; ld_mode := log_enable; if (0 = sys_stat ('cl_run_local_only')) { if (log_enable = 2 and cl_this_host () = 1) { cl_exec ('checkpoint_interval (0)'); cl_exec ('__dbf_set (''cl_non_logged_write_mode'', 1)'); } if (cl_this_host () = 1) cl_exec('__dbf_set(''cl_max_keep_alives_missed'',3000)'); } tx_mode := bit_and (1, log_enable); log_message ('Loader started'); delete from DB.DBA.LOAD_LIST where LL_FILE = '##stop'; commit work; while (1) { set isolation = 'repeatable'; declare exit handler for sqlstate '40001' { rollback work; sec_delay := rnd(1000)*0.001; log_message(sprintf('deadlock in loader, waiting %d milliseconds', cast (sec_delay * 1000 as integer))); delay(sec_delay); goto again; }; again:; if (exists (select 1 from DB.DBA.LOAD_LIST where LL_FILE = '##stop')) { log_message ('File load stopped by rdf_load_stop.'); return; } log_enable (tx_mode, 1); if (max_files is not null and max_files <= 0) { commit work; log_message ('Max_files reached. Finishing.'); return; } whenever not found goto looks_empty; -- log_message ('Getting next file.'); set isolation = 'serializable'; select id into xx from ldlock where id = 0 for update; arr := ld_array (); commit work; if (0 = arr) goto looks_empty; log_enable (ld_mode, 1); for (inx := 0; inx < 100; inx := inx + 1) { if (0 = arr[inx]) goto arr_done; ld_file (arr[inx][0], arr[inx][1]); update DB.DBA.LOAD_LIST set LL_STATE = 2, LL_DONE = curdatetime () where LL_FILE = arr[inx][0]; } arr_done: log_enable (tx_mode, 1); if (max_files is not null) max_files := max_files - 100; commit work; } looks_empty: commit work; log_message ('No more files to load. Loader has finished,'); return; } ; create procedure rdf_load_stop (in force int := 0) { insert into DB.DBA.LOAD_LIST (LL_FILE) values ('##stop'); commit work; if (force) cl_exec ('txn_killall (1)'); } ; create procedure RDF_LOADER_RUN_1 (in x int, in y int) { rdf_loader_run (x, y); } ; create procedure rdf_ld_srv (in log_enable int) { declare aq any; aq := async_queue (1); aq_request (aq, 'DB.DBA.RDF_LOADER_RUN_1', vector (null, log_enable)); aq_wait_all (aq); } ; create procedure load_grdf (in f varchar) { declare line any; declare inx int; declare ses any; declare gr varchar; if (f like '%.gz') ses := gz_file_open (f); else ses := file_open (f); inx := 0; line := ''; while (line <> 0) { gr := ses_read_line (ses, 0, 0, 1); if (gr = 0) return; line := ses_read_line (ses, 0, 0, 1); if (line = 0) return; DB.DBA.RDF_LOAD_RDFXML (line, gr, gr); inx := inx + 1; } } ; -- cl_exec ('set lock_escalation_pct = 110'); -- cl_exec ('DB.DBA.RDF_LD_SRV (1)') & -- cl_exec ('DB.DBA.RDF_LD_SRV (2)') &
SQL>cl_exec ('checkpoint);
SQL>ld_dir ('/dbs/data', '*.gz', 'http://dbpedia.org');
SQL>rdf_loader_run();
To load the uniprot data, create a function for example such as:
create function DB.DBA.UNIPROT_LOAD (in log_mode integer := 1) { DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename1'),'http://base_uri_1', 'destination_graph_1', log_mode, 3); DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename2'),'http://base_uri_2', 'destination_graph_2', log_mode, 3); ... DB.DBA.RDF_LOAD_RDFXML_MT (file_to_string_output('filename9'),'http://base_uri_9', 'destination_graph_9', log_mode, 3); }
If you are starting from blank database and you can drop it and re-create in case of error signaled, use it this way:
checkpoint; checkpoint_interval(6000); DB.DBA.UNIPROT_LOAD (0), checkpoint; checkpoint_interval(60);
If the database contains important data already and there's no way to stop it and backup before the load then use:
checkpoint; checkpoint_interval(6000); DB.DBA.UNIPROT_LOAD (), checkpoint; checkpoint_interval(60);
Note that the 'number of threads' parameter of DB.DBA.RDF_LOAD_RDFXML() mentions threads used to process data from file, an extra thread will read the text and parse it, so for 4 CPU cores there's no need in parameter value greater than 3. Three processing threads per one parsing tread is usually good ratio because parsing is usually three times faster than the rest of loading so CPU loading is well balanced. If for example you are using 2 x Quad Xeon, then you can choose between 8 single-threaded parsers or 2 parsers with 3 processing threads each. With 4 cores you may simply load file after file with 3 processing threads. The most important performance tuning is to set the [Parameters] section of virtuoso configuration file:
NumberOfBuffers = 1000000 MaxDirtyBuffers = 800000 MaxCheckpointRemap = 1000000 DefaultIsolation = 2
Note: these numbers are reasonable for 16 GB RAM Linux box. Usually when there are no such massive operations as loading huge database, you can set up the values as:
NumberOfBuffers = 1500000 MaxDirtyBuffers = 1200000 MaxCheckpointRemap = 1500000 DefaultIsolation = 2
Tip: Thus after loading all data you may wish to shutdown, tweak and start server again. If you have ext2fs or ext3fs filesystem, then it's better to have enough free space on disk not to make it more than 80% full. When it's almost full it may allocate database file badly, resulting in measurable loss of disk access speed. That is not Virtuoso-specific fact, but a common hint for all database-like applications with random access to big files.
Here is an example of using awk file for splitting big file smaller ones:
BEGIN { file_part=1000 e_line = "</rdf:RDF>" cur=0 cur_o=0 file=0 part=file_part } { res_file_i="res/"FILENAME line=$0 s=$1 res_file=res_file_i"_"file".rdf" if (index (s, "</rdf:Description>") == 1) { cur=cur+1 part=part-1 } if (part > 0) { print line >> res_file } if (part == 0) { # print "===================== " cur print line >> res_file print e_line >> res_file close (res_file) file=file+1 part=file_part res_file=res_file_i"_"file".rdf" system ("cp beg.txt " res_file) } } END { }
You can use the following script as an example for loading DBPedia RDF data in Virtuoso:
#!/bin/sh PORT=$1 USER=$2 PASS=$3 file=$4 g=$5 LOGF=`basename $0`.log if [ -z "$PORT" -o -z "$USER" -o -z "$PASS" -o -z "$file" -o -z "$g" ] then echo "Usage: `basename $0` [DSN] [user] [password] [ttl-file] [graph-iri]" exit fi if [ ! -f "$file" -a ! -d "$file" ] then echo "$file does not exists" exit 1 fi mkdir READY 2>/dev/null rm -f $LOGF $LOGF.* echo "Starting..." echo "Logging into: $LOGF" DOSQL () { isql $PORT $USER $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="$1" > $LOGF } LOAD_FILE () { f=$1 g=$2 echo "Loading $f (`cat $f | wc -l` lines) `date \"+%H:%M:%S\"`" | tee -a $LOG DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF if [ $? != 0 ] then echo "An error occurred, please check $LOGF" exit 1 fi line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` newf=$f.part inx=1 while [ ! -z "$line_no" ] do cat $f | awk "BEGIN { i = 1 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }" >> bad.nt line_no=`expr $line_no + 1` echo "Retrying from line $line_no" echo "@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> ." > tmp.nt cat $f | awk "BEGIN { i = 1 } { if (i>=$line_no) print \$0; i = i + 1 }" >> tmp.nt mv tmp.nt $newf f=$newf mv $LOGF $LOGF.$inx DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 17); checkpoint;" > $LOGF if [ $? != 0 ] then echo "An error occurred, please check $LOGF" exit 1 fi line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` inx=`expr $inx + 1` done rm -f $newf 2>/dev/null echo "Loaded. " } echo "=======================================" echo "Loading started." echo "=======================================" if [ -f "$file" ] then LOAD_FILE $file $g mv $file READY 2>> /dev/null elif [ -d "$file" ] then for ff in `find $file -name '*.nt'` do LOAD_FILE $ff $g mv $ff READY 2>> /dev/null done else echo "The input is not file or directory" fi echo "=======================================" echo "Final checkpoint." DOSQL "checkpoint;" > temp.res echo "=======================================" echo "Check bad.nt file for skipped triples." echo "=======================================" exit 0
The shell script below was used to import files in n3 notation into OpenLink Virtuoso RDF storage.
When an syntax error it will cut content from next line and will retry. This was used on ubuntu linux to import bio2rdf and freebase dumps.
Note it uses gawk, so it must be available on system where is tried. Also for recovery additional disk space is needed at max the size of original file.
#!/bin/bash PASS=$1 f=$2 g=$3 # Usage if [ -z "$PASS" -o -z "$f" -o -z "$g" ] then echo "Usage: $0 [password] [ttl-file] [graph-iri]" exit fi if [ ! -f "$f" ] then echo "$f does not exists" exit fi # Your port here PORT=1111 #`inifile -f dbpedia.ini -s Parameters -k ServerPort` if test -z "$PORT" then echo "Cannot find INI and inifile command" exit fi # Initial run isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log # If disconnect etc. if [ $? != 0 ] then echo "An error occurred, please check $0.log" exit fi # Check for error line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` newf=$f.part inx=1 # Error recovery while [ ! -z "$line_no" ] do cat $f | awk "BEGIN { i = 0 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }" >> bad.nt line_no=`expr $line_no + 1` echo "Retrying from line $line_no" cat $f | awk "BEGIN { i = 0 } { if (i>=$line_no) print \$0; i = i + 1 }" > tmp.nt mv tmp.nt $newf f=$newf mv $0.log $0.log.$inx # Run the recovered part isql $PORT dba $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="ttlp_mt (file_to_string_output ('$f'), '', '$g'); checkpoint;" > $0.log if [ $? != 0 ] then echo "An error occurred, please check $0.log" exit fi line_no=`grep Error $0.log | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'` inx=`expr $inx + 1` done
Since SPARUL updates are generally not ment to be transactional, it is best to run these in log_enable (2) mode, which commits every operation as it is done. This prevents one from running out of rollback space. Also for bulk updates, transaction logging can be turned off. If so, one should do a manual checkpoint after the operation to ensure persistence across server restart since there is no roll forward log.
To have a roll forward log and row by row autocommit, one may use log_enable (3). This will write constantly into the log which takes extra time. Having no logging and doing a checkpoint when the whole work is finished is faster.
Many SPARUL operations can be run in parallel in this way. If they are independent with respect to their input and output, they can run in parallel and row by row autocommit will ensure they do not end up waiting for each others' locks.
We ran the DBpedia benchmark queries again with different configurations of Virtuoso. Comparing numbers given by different parties is a constant problem. In the case reported here, we loaded the full DBpedia 3, all languages, with about 198M triples, onto Virtuoso v5 and Virtuoso Cluster v6, all on the same 4 core 2GHz Xeon with 8G RAM. All databases were striped on 6 disks. The Cluster configuration was with 4 processes in the same box. We ran the queries in two variants:
The times below are for the sequence of 5 queries. As there is a query in the set that specifies no condition on S or O and only P, thus cannot be done with the default indices With Virtuoso v5. With Virtuoso Cluster v6 it can, because v6 is more space efficient. So we added the index:
create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s);
Virtuoso v5 with gspo, ogps, pogs | Virtuoso Cluster v6 with gspo, ogps | Virtuoso Cluster v6 with gspo, ogps, pogs | |
---|---|---|---|
cold | 210 s | 136 s | 33.4 s |
warm | 0.600 s | 4.01 s | 0.628 s |
Now let us do it without a graph being specified. Note that alter index is valid for v6 or higher. For all platforms, we drop any existing indices, and:
create table r2 (g iri_id_8, s, iri_id_8, p iri_id_8, o any, primary key (s, p, o, g)) alter index R2 on R2 partition (s int (0hexffff00)); log_enable (2); insert into r2 (g, s, p, o) select g, s, p, o from rdf_quad; drop table rdf_quad; alter table r2 rename RDF_QUAD; create bitmap index rdf_quad_opgs on rdf_quad (o, p, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_pogs on rdf_quad (p, o, g, s) partition (o varchar (-1, 0hexffff)); create bitmap index rdf_quad_gpos on rdf_quad (g, p, o, s) partition (o varchar (-1, 0hexffff));
The code is identical for v5 and v6, except that with v5 we use iri_id (32 bit) for the type, not iri_id_8 (64 bit). We note that we run out of IDs with v5 around a few billion triples, so with v6 we have double the ID length and still manage to be vastly more space efficient.
With the above 4 indices, we can query the data pretty much in any combination without hitting a full scan of any index. We note that all indices that do not begin with s end with s as a bitmap. This takes about 60% of the space of a non-bitmap index for data such as DBpedia.
If you intend to do completely arbitrary RDF queries in Virtuoso, then chances are you are best off with the above index scheme.
Virtuoso v5 with gspo, ogps, pogs | Virtuoso Cluster v6 with gspo, ogps, pogs | |
---|---|---|
warm | 0.595 s | 0.617 s |
The cold times were about the same as above, so not reproduced.
It is in the SPARQL spirit to specify a graph and for pretty much any application, there are entirely sensible ways of keeping the data in graphs and specifying which ones are concerned by queries. This is why Virtuoso is set up for this by default.
On the other hand, for the open web scenario, dealing with an unknown large number of graphs, enumerating graphs is not possible and questions like which graph of which source asserts x become relevant. We have two distinct use cases which warrant different setups of the database, simple as that.
The latter use case is not really within the SPARQL spec, so implementations may or may not support this.
Once the indices are right, there is no difference between specifying a graph and not specifying a graph with the queries considered. With more complex queries, specifying a graph or set of graphs does allow some optimizations that cannot be done with no graph specified. For example, bitmap intersections are possible only when all leading key parts are given.
The best warm cache time is with v5; the five queries run under 600 ms after the first go. This is noted to show that all-in-memory with a single thread of execution is hard to beat.
Cluster v6 performs the same queries in 623 ms. What is gained in parallelism is lost in latency if all operations complete in microseconds. On the other hand, Cluster v6 leaves v5 in the dust in any situation that has less than 100% hit rate. This is due to actual benefit from parallelism if operations take longer than a few microseconds, such as in the case of disk reads. Cluster v6 has substantially better data layout on disk, as well as fewer pages to load for the same content.
This makes it possible to run the queries without the pogs index on Cluster v6 even when v5 takes prohibitively long.
The purpose is to have a lot of RAM and space-efficient data representation.
For reference, the query texts specifying the graph are below. To run without specifying the graph, just drop the FROM <http://dbpedia.org> from each query. The returned row counts are indicated below each query's text.
SQL>SPARQL SELECT ?p ?o FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/Metropolitan_Museum_of_Art> ?p ?o . }; p o VARCHAR VARCHAR _______________________________________________________________________________ http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://umbel.org/umbel/ac/Artifact http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/MuseumsInNewYorkCity http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/ArtMuseumsAndGalleriesInTheUnitedStates http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/class/yago/Museum103800563 .. -- 335 rows SQL>SPARQL PREFIX p: <http://dbpedia.org/property/> SELECT ?film1 ?actor1 ?film2 ?actor2 FROM <http://dbpedia.org> WHERE { ?film1 p:starring <http://dbpedia.org/resource/Kevin_Bacon> . ?film1 p:starring ?actor1 . ?film2 p:starring ?actor1 . ?film2 p:starring ?actor2 . }; film1 actor1 film2 ctor2 VARCHAR VARCHAR VARCHAR ARCHAR http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Meryl_Streep http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Joseph_Mazzello http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/David_Strathairn http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/Kevin_Bacon http://dbpedia.org/resource/The_River_Wild http://dbpedia.org/resource/John_C._Reilly ... -- 23910 rows SQL>SPARQL PREFIX p: <http://dbpedia.org/property/> SELECT ?artist ?artwork ?museum ?director FROM <http://dbpedia.org> WHERE { ?artwork p:artist ?artist . ?artwork p:museum ?museum . ?museum p:director ?director }; artist artwork museum director VARCHAR VARCHAR VARCHAR VARCHAR _______________________________________________ http://dbpedia.org/resource/Paul_C%C3%A9zanne http://dbpedia.org/resource/The_Basket_of_Apples http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Paul_Signac http://dbpedia.org/resource/Neo-impressionism http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Georges_Seurat http://dbpedia.org/resource/Neo-impressionism http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Edward_Hopper http://dbpedia.org/resource/Nighthawks http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno http://dbpedia.org/resource/Mary_Cassatt http://dbpedia.org/resource/The_Child%27s_Bath http://dbpedia.org/resource/Art_Institute_of_Chicago James Cuno .. -- 303 rows SQL>sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?s ?homepage FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/Berlin> geo:lat ?berlinLat . <http://dbpedia.org/resource/Berlin> geo:long ?berlinLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s foaf:homepage ?homepage . FILTER ( ?lat <= ?berlinLat + 0.03190235436 && ?long >= ?berlinLong - 0.08679199218 && ?lat >= ?berlinLat - 0.03190235436 && ?long <= ?berlinLong + 0.08679199218) }; s homepage VARCHAR VARCHAR _______________________________________________________________________________ http://dbpedia.org/resource/Berlin_University_of_the_Arts http://www.udk-berlin.de/ http://dbpedia.org/resource/Berlin_University_of_the_Arts http://www.udk-berlin.de/ http://dbpedia.org/resource/Berlin_Zoological_Garden http://www.zoo-berlin.de/en.html http://dbpedia.org/resource/Federal_Ministry_of_the_Interior_%28Germany%29 http://www.bmi.bund.de http://dbpedia.org/resource/Neues_Schauspielhaus http://www.goya-berlin.com/ http://dbpedia.org/resource/Bauhaus_Archive http://www.bauhaus.de/english/index.htm http://dbpedia.org/resource/Canisius-Kolleg_Berlin http://www.canisius-kolleg.de http://dbpedia.org/resource/Franz%C3%B6sisches_Gymnasium_Berlin http://www.fg-berlin.cidsnet.de .. -- 48 rows SQL>sparql PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX p: <http://dbpedia.org/property/> SELECT ?s ?a ?homepage FROM <http://dbpedia.org> WHERE { <http://dbpedia.org/resource/New_York_City> geo:lat ?nyLat . <http://dbpedia.org/resource/New_York_City> geo:long ?nyLong . ?s geo:lat ?lat . ?s geo:long ?long . ?s p:architect ?a . ?a foaf:homepage ?homepage . FILTER ( ?lat <= ?nyLat + 0.3190235436 && ?long >= ?nyLong - 0.8679199218 && ?lat >= ?nyLat - 0.3190235436 && ?long <= ?nyLong + 0.8679199218) }; s a homepage VARCHAR VARCHAR VARCHAR _______________________________________________________________________________ http://dbpedia.org/resource/GE_Building http://dbpedia.org/resource/Associated_Architects http://www.associated-architects.co.uk http://dbpedia.org/resource/Giants_Stadium http://dbpedia.org/resource/HNTB http://www.hntb.com/ http://dbpedia.org/resource/Fort_Tryon_Park_and_the_Cloisters http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Central_Park http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Prospect_Park_%28Brooklyn%29 http://dbpedia.org/resource/Frederick_Law_Olmsted http://www.asla.org/land/061305/olmsted.html http://dbpedia.org/resource/Meadowlands_Stadium http://dbpedia.org/resource/360_Architecture http://oakland.athletics.mlb.com/oak/ballpark/new/faq.jsp http://dbpedia.org/resource/Citi_Field http://dbpedia.org/resource/HOK_Sport_Venue_Event http://www.hoksve.com/ http://dbpedia.org/resource/Citigroup_Center http://dbpedia.org/resource/Hugh_Stubbins_Jr. http://www.klingstubbins.com http://dbpedia.org/resource/150_Greenwich_Street http://dbpedia.org/resource/Fumihiko_Maki http://www.pritzkerprize.com/maki2.htm http://dbpedia.org/resource/Freedom_Tower http://dbpedia.org/resource/David_Childs http://www.som.com/content.cfm/www_david_m_childs http://dbpedia.org/resource/7_World_Trade_Center http://dbpedia.org/resource/David_Childs http://www.som.com/content.cfm/www_david_m_childs http://dbpedia.org/resource/The_New_York_Times_Building http://dbpedia.org/resource/Renzo_Piano http://www.rpbw.com/ http://dbpedia.org/resource/Trump_World_Tower http://dbpedia.org/resource/Costas_Kondylis http://www.kondylis.com 13 Rows. -- 2183 msec.
In a particular RDF Store Benchmarks there is difference if the queries are executed with specified graph or with specified multiple graphs. As Virtuoso is quad store, not triple store with many tables, it runs queries inefficiently if graphs are specified and there are no additional indexes except pre-set GSPO and OGPS. Proper use of the FROM clause or adding indexes with graph column will contribute for better results.
If is known in advance for the current RDF Store Benchmarks that some users will not indicate specific graphs then should be done:
Both methods do not require any changes in query texts
Strongly recommended is the usage of additional bitmap indexes:
SQL> create bitmap index RDF_QUAD_POGS on DB.DBA.RDF_QUAD (P,O,G,S); SQL> create bitmap index RDF_QUAD_PSOG on DB.DBA.RDF_QUAD (P,S,O,G);
You can create other indexes as well. Bitmap indexes are preferable, but if O is the last column, then the index can not be bitmap, so it could be, for e.g.:
create index RDF_QUAD_PSGO on DB.DBA.RDF_QUAD (P, S, G, O);
but cannot be:
create bitmap index RDF_QUAD_PSGO on DB.DBA.RDF_QUAD (P, S, G, O);
Previous
Inference Rules & Reasoning |
Chapter Contents |
Next
RDF Data Access Providers (Drivers) |