Bug 783764 – Code Search for GNOME Builder

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 783764 - Code Search for GNOME Builder


Summary:	Code Search for GNOME Builder


Status:	RESOLVED FIXED

Product:	gnome-builder
Classification:	Other
Component:	general
Version:	3.25.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	GNOME Builder Maintainers
QA Contact:	GNOME Builder Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-06-13 17:41 UTC by Anoop Chandu
Modified:	2017-11-27 04:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Indexing code and storing that into disk (74.65 KB, patch) 2017-06-13 17:46 UTC, Anoop Chandu	none	Details \| Review
Indexing tool (30.98 KB, patch) 2017-06-23 17:34 UTC, Anoop Chandu	none	Details \| Review
Indexer plugin (44.00 KB, patch) 2017-06-26 14:28 UTC, Anoop Chandu	none	Details \| Review
Indexing code and storing that into disk (65.75 KB, patch) 2017-06-29 10:24 UTC, Anoop Chandu	none	Details \| Review
Adding code-index plugin and IdeCodeIndexService (25.24 KB, patch) 2017-07-22 06:58 UTC, Anoop Chandu	reviewed	Details \| Review
Adding IdeCodeIndexBuilder to code-index (37.67 KB, patch) 2017-07-22 07:01 UTC, Anoop Chandu	reviewed	Details \| Review
Adding serializable map (24.11 KB, patch) 2017-08-15 17:44 UTC, Anoop Chandu	reviewed	Details \| Review
Adding IdeCodeIndexIndex (21.27 KB, patch) 2017-08-15 17:44 UTC, Anoop Chandu	reviewed	Details \| Review
Implement IdeCodeIndexer (14.26 KB, patch) 2017-08-21 22:25 UTC, Anoop Chandu	none	Details \| Review
Implement GListModel & IdeCodeIndexEntry (24.29 KB, patch) 2017-08-21 22:25 UTC, Anoop Chandu	none	Details \| Review

Description Anoop Chandu 2017-06-13 17:41:30 UTC

This is for tracking GSOC Project Code Search for GNOME Builder. Project link
https://summerofcode.withgoogle.com/projects/#4574885478137856. Goal of this project is to implement Global search and Jump to definition by indexing source code using libclang.

Comment 1 Anoop Chandu 2017-06-13 17:46:21 UTC

Created attachment 353700 [details] [review]
Indexing code and storing that into disk

This will index code and store that index into disk. All files in a
directory are indexed into a single file. Whole indexing is divided into
2 parallel processes, indexing all directories into a files and loading
those files.

Indexing a directory into a file is done by,
  Getting all declarations and inserting those into DzlFuzzyIndexBuilder
     This is done in 2 parallel processes, getting translation unit of file and 
     traversing that TU for inserting declarations into DzlFuzzyIndexBuilder
  When all declarations are inserted into DzlFuzzyIndexBuilder, it is written into 
  a file.

Comment 2 Anoop Chandu 2017-06-14 06:15:46 UTC

Just a note about few classes:

IdeClangAstIndexer - This will index a list of files and store that index into a file. This is done by generating AST for each file, traversing that to extract declarations and inserting those into DzlFuzzyIndexBuilder. 

IdeIndexerIndex - This will load and store DzlFuzzyIndex when a file path is given to that. This will maintain a map for file path and DzlFuzzyIndex.

IdeIndexerIndexBuilder - This will index a directory (recursively) and load those indexes. First it will look for changes in directory. If an index is new enough then it won't index that directory and just load index. If it needs to reindex then it gives list of files in that directory to IdeClangAstIndexer to index. After indexing is done by IdeClangAstIndexer it loads that index.

IdeIndexerservice - looks for changes in working directory and gives IdeIndexerIndexBuilder a directory to index (recursively).

Comment 3 Anoop Chandu 2017-06-23 17:34:24 UTC

Created attachment 354338 [details] [review]
Indexing tool

This tool will index a set of files into given destination folder.
Index for each set contain 2 files. One file for storing keys of
global declarations where they can be searched by exact match. Other
for storing names of all symbols where they can be searched fuzzily.

Comment 4 Anoop Chandu 2017-06-26 14:28:15 UTC

Created attachment 354506 [details] [review]
Indexer plugin

This is a plugin which has service that traverse working directory for
changes and indexes those. Indexing is done using indexing tool in a
subprocess. After indexing it will load all indexes and be ready to
process queries.

Comment 5 Christian Hergert 2017-06-28 23:28:45 UTC

Review of attachment 354506 [details] [review]:

Nice work thus far!

There are a lot of mechanics to look over to make this easier for future reviews. In particular, we want to be very explicit about how we transfer ownership of structures during asynchronous processes. Builder is a large program and it is much easier to track this stuff down during development than in the wild were so much is going on at once.

I think you will find the ide-indexer-builder code can be much simplified if you build out async/finish pairs for each series of async operation. Then your "primary" async/finish pair will be scripting the execution of those helper async/finish operations.

::: plugins/indexer/ide-indexer-builder.c
@@ +1,1 @@
+#define G_LOG_DOMAIN "ide-indexer-builder"

Don't forget a license for your file. Something that is GPLv3+ compatible. My preference is GPLv3, but yours may be different.

@@ +17,3 @@
+struct _IdeIndexerBuilder
+{
+	IdeObject         parent;

2 space indentation

@@ +40,3 @@
+{
+  GFile    *directory;
+  gboolean  recursive;

Generally we use "guint foo : 1" for storing booleans in structures in case we need to add things in the future. It won't really save us any allocation size from GSlice since it will always hand out >=2*sizeof(void*) allocations back, but nonetheless it's good habit. Note that it needs to be unsigned when using a 1-bit flag since otherwise the 1 bit would be used for the sign bit.

@@ +41,3 @@
+  GFile    *directory;
+  gboolean  recursive;
+} SubTask1Data;

Structures with numbers in them aren't very helpful to the next person that needs to come and work on the code. Use a name that describes the work being done.

@@ +44,3 @@
+
+const gchar *extensions[] = {".c",".h",".cc",".cp",".cxx",".cpp",".CPP",".c++",".C",
+                            ".hh",".H",".hp",".hxx",".hpp",".HPP",".h++",".tcc", "\0"};

Use NULL for your final item and just compare against NULL when iterating the extensions list.

@@ +54,3 @@
+                                                          GTask             *main_task);
+
+enum

enum { cuddled on the same line

@@ +58,3 @@
+  PROP_0,
+  PROP_INDEX,
+  LAST_PROP

We're trying to be consistent and use N_PROPS throughout the code-base.

@@ +66,3 @@
+changes_queue_data_free (ChangesQueueData *data)
+{
+  g_ptr_array_unref (data->files);

We like to clear pointers so that when debugging we can find leaks by checking the value of struct fields. So use something like:

  g_clear_pointer(&data->files, g_ptr_array_unref);

@@ +67,3 @@
+{
+  g_ptr_array_unref (data->files);
+  g_object_unref (data->destination);

g_clear_object (&data->destination);

@@ +74,3 @@
+sub_task1_data_free (SubTask1Data *data)
+{
+  g_object_unref (data->directory);

g_clear_object (&data->directory);

@@ +81,3 @@
+indexing_data_free (IndexingData *data)
+{
+  g_ptr_array_unref (data->changes_queue);

g_clear_pointer

@@ +93,3 @@
+{
+  IdeIndexerIndex *index = (IdeIndexerIndex *)source_object;
+  GTask *main_task = (GTask *)user_data;

Tracking down referencing counting issues in async tasks is a very difficult problem once the code has been written. Because of that, we try to be explicit about passing around ownership.

Every where you have a GTask, we want to see g_autoptr() used and g_steal_pointer(). For example, when creating the task, use something like:

 g_autoptr(GTask) task = NULL;

 task = g_task_new (self, cancellable, callback, user_data);
 ...

 do_something_else_async (..., g_steal_pointer (&task));

If you find that you need to pass the GTask to multiple async operations, use g_object_ref() instead of g_steal_pointer(). The cost of an extra atomic unref is not worth the headache of tracking down ownership transfers.

@@ +94,3 @@
+  IdeIndexerIndex *index = (IdeIndexerIndex *)source_object;
+  GTask *main_task = (GTask *)user_data;
+  GError *error = NULL;

error is leaked. use

g_autoptr(GError) error = NULL;

@@ +100,3 @@
+
+  ide_indexer_index_load_indexes_finish (index, result, &error);
+  g_debug ("sub task 3 - loading finished");

You might want to log the failure with something like:

if (!ide_inexer_index_load_indexes_finish (index, result, &error))
  g_warning ("%s", error->message);
else
  g_debug ("sub task 3 - loading finished");

@@ +117,3 @@
+  IdeIndexerBuilder *self;
+  IdeSubprocess *subprocess = (IdeSubprocess *)source_object;
+  GTask *main_task = (GTask *)user_data;

Same here, make sure to transfer ownership of the task (which would be passed to you by the g_steal_pointer()/g_object_ref() for user_data).

 g_autoptr(GTask) main_task = user_data;

@@ +118,3 @@
+  IdeSubprocess *subprocess = (IdeSubprocess *)source_object;
+  GTask *main_task = (GTask *)user_data;
+  GError *error = NULL;

error is leaked, g_autoptr(GError) error = NULL

@@ +124,3 @@
+  g_assert (G_IS_TASK (main_task));
+
+  self = g_task_get_source_object (main_task);

g_assert (IDE_IS_INDEXER (self));

@@ +125,3 @@
+
+  self = g_task_get_source_object (main_task);
+  loading_queue = g_task_get_task_data (main_task);

I like to add some assertions after getting task data, since it's possible to change the data after the task has been created. Just an extra paranoia check.

@@ +127,3 @@
+  loading_queue = g_task_get_task_data (main_task);
+
+  ide_subprocess_wait_finish (subprocess, result, &error);

if (!ide_subprocess_wait_finish (subprocess, result, &error))
  g_warning ("Subprocess failure: %s", error->message);

Do you want to keep processing task 3 if 2 has failed?

@@ +133,3 @@
+  ide_indexer_index_load_indexes_async (self->index,
+                                        loading_queue,
+                                        NULL,

You probably want to pass the cancellable down from the main task. Since you'll likely be stealing ownership of the task in the last parameter here, you'll need to extract the cancellable before this function call as the order parameters are processed in C is undefined.

cancellable = g_task_get_cancellable (task);

ide_indexer_index_load_indexes_async (..., cancellable, ..., g_steal_pointer (&main_task));

@@ +153,3 @@
+  IdeIndexerBuilder *self;
+  IdeBuildSystem *build_system = (IdeBuildSystem *)object;
+  gchar **argv = NULL;

argv is being leaked.

g_auto(GStrv) argv = NULL;

@@ +154,3 @@
+  IdeBuildSystem *build_system = (IdeBuildSystem *)object;
+  gchar **argv = NULL;
+  GError *error = NULL;

g_autoptr(GError) error = NULL;

@@ +155,3 @@
+  gchar **argv = NULL;
+  GError *error = NULL;
+  GTask *main_task = (GTask *)user_data;

Same here, g_autoptr(GTask) task = user_data and ensure you pass a reference when calling the async func.

@@ +191,3 @@
+      cxxflags = ide_configuration_getenv (config, "CXXFLAGS");
+
+      if (cflags && *cflags)

You probably want to break out the logic for using cflags or cxxflags into another function, because you'll have to guess based on the file extension. Probably something like:

 - Always CFLAGS for .c files
 - Everything else gets CXXFLAGS if it exists, otherwise fallback to CFLAGS. .h files are difficult becaues C++ projects often use .h instead of .hh or .hpp, etc.

@@ +203,3 @@
+  for (int i = 0; argv[i] != NULL; i++)
+    g_string_append_printf (indexer_input, "%s ", argv[i]);
+  g_string_append_printf (indexer_input, "%s\n", "-I/usr/lib/llvm-4.0/bin/../lib/clang/4.0.0/include");

This code should be unconcerned with the clang include path. That should be done by the code that is actually using libclang since those sort of include files are tied to the compiler version.

@@ +208,3 @@
+    {
+      GFile *file;
+      g_autofree gchar *file_name;

Always set g_autofree, g_autoptr(), and g_auto() variables to NULL when declaring.

@@ +213,3 @@
+
+      file_name = g_file_get_path (file);
+      g_string_append_printf (indexer_input, "%s ", file_name);

This isn't safe to do like this because you can't handle filenames with spaces. If you want to pass the list of files on STDIN, you need to use \n or something to delimit each filename safely.

@@ +216,3 @@
+
+      ide_build_system_get_build_flags_async (build_system,
+                                              ide_file_new (ide_object_get_context (IDE_OBJECT (self)), g_object_ref (file)),

This leaks file. ide_file_new() will take it's own reference to file.

@@ +219,3 @@
+                                              NULL,
+                                              get_build_flags_cb,
+                                              main_task);

After adding the autoptr, g_steal_pointer(&main_task).

@@ +227,3 @@
+      gsize num_bytes;
+
+      indexer_pipe = (GOutputStream *)ide_subprocess_get_stdin_pipe (idata->subprocess);

The cast is unnecessary.

@@ +254,3 @@
+    {
+      gsize num_bytes;
+      GError *error = NULL;

g_autoptr(GError) error = NULL;

@@ +260,3 @@
+      output = ide_subprocess_get_stdin_pipe (idata->subprocess);
+
+      g_output_stream_printf (output, &num_bytes, NULL, &error, "0\n");

Why the 0?

Presumably just closing the stream is enough to indicated that the caller is done passing in filenames?

@@ +267,3 @@
+      g_task_set_task_data (main_task, loading_queue, (GDestroyNotify)g_ptr_array_unref);
+
+      ide_subprocess_wait_async (idata->subprocess, NULL, indexing_finished_cb, main_task);

Explicit task ownership transfers, g_object_ref(main_task).

@@ +274,3 @@
+      GPtrArray *files;
+      GFile *destination, *file;
+      GString *indexer_input;

g_autoptr(GString) indexer_input = NULL; and then use g_steal_pointer(&indexer_input) when assigning it to task data. This makes it easier for us when reviewing to ensure that ownership is explicit.

@@ +275,3 @@
+      GFile *destination, *file;
+      GString *indexer_input;
+      g_autofree gchar *dest_name, *file_name;

Always assign NULL with autofree

@@ +290,3 @@
+
+      file_name = g_file_get_path (file);
+      g_string_append_printf (indexer_input, "%s ", file_name);

Use \n to delimit filenames instead of space.

@@ +298,3 @@
+      context = ide_object_get_context (IDE_OBJECT (self));
+      ide_build_system_get_build_flags_async (ide_context_get_build_system (context),
+                                              ide_file_new (context, g_object_ref (file)),

file is leaked.

@@ +301,3 @@
+                                              NULL,
+                                              get_build_flags_cb,
+                                              main_task);

Explicit ownership transfer.

@@ +302,3 @@
+                                              get_build_flags_cb,
+                                              main_task);
+      g_ptr_array_remove_index (files, files->len - 1);

I like to add assertions before operations like this such as:

 g_assert (files->len > 0);
 g_assert (idata->changes_queue->len > 0);

@@ +316,3 @@
+{
+  IdeIndexerBuilder *self = (IdeIndexerBuilder *)object;
+  GTask *main_task = (GTask *)user_data;

g_autoptr(GTask) main_task = user_data;

@@ +317,3 @@
+  IdeIndexerBuilder *self = (IdeIndexerBuilder *)object;
+  GTask *main_task = (GTask *)user_data;
+  GError *error = NULL;

g_autoptr(GError) error = NULL;

@@ +330,3 @@
+      g_debug ("No changes are there, completing task");
+      g_task_return_boolean (main_task, TRUE);
+      g_object_unref (main_task);

Use the autoptr for releasing reference when scope leaves this function.

@@ +334,3 @@
+  else
+    {
+      g_autoptr(IdeSubprocessLauncher) launcher;

= NULL

@@ +335,3 @@
+    {
+      g_autoptr(IdeSubprocessLauncher) launcher;
+      IdeSubprocess *subprocess;

g_autoptr(IdeSubprocess) subprocess = NULL;

@@ +348,3 @@
+          g_debug ("%s", error->message);
+          g_task_return_boolean (main_task, FALSE);
+          g_object_unref (main_task);

rely on autoptr

@@ +354,3 @@
+      idata = g_slice_new0 (IndexingData);
+      idata->changes_queue = g_ptr_array_ref (changes_queue);
+      idata->subprocess = subprocess;

g_steal_pointer (&subprocess)

@@ +376,3 @@
+  g_autoptr(GFileEnumerator) enumerator = NULL;
+  gpointer infoptr = NULL;
+  GError *error = NULL;

g_autoptr(GError) error = NULL;

@@ +382,3 @@
+  DzlFuzzyIndex *fzy_index = NULL;
+  guint num_files = 0;
+  GTimeVal indexing_time = {0,0};

{ 0 } is sufficient for setting all struct fields to 0.

@@ +396,3 @@
+  g_clear_error (&error);
+
+  dest_file = g_file_get_child (destination, "index2");

I expect later on we'll come up with some names here that are more descriptive.

I also think we'll want to keep the files in an alternate directory such as /var/tmp but I need to think about that more.

@@ +427,3 @@
+                                          G_FILE_QUERY_INFO_NOFOLLOW_SYMLINKS,
+                                          NULL,
+                                          &error);

If you're not going to log or use the error, just pass NULL for the error parameter.

@@ +430,3 @@
+  g_clear_error (&error);
+
+  while ((infoptr = g_file_enumerator_next_file (enumerator, NULL, &error)) != NULL)

Put the NULL check near the assignment so it's clear what is being compared.

 if (NULL != (infoptr = ...)))

@@ +441,3 @@
+      if (type == G_FILE_TYPE_DIRECTORY && recursive)
+        {
+          g_autoptr(GFile) sub_dir, sub_dest;

Always assign to NULL. Also I really prefer to see every variable on it's own line in the Builder codebase. Doubly so with autoptr.

@@ +450,3 @@
+      else if (type == G_FILE_TYPE_REGULAR)
+        {
+          for (int i=0; extensions[i][0]; i++)

Spaces and after changing the extensions list:

  for (guint i = 0; extensions[i] != NULL; i++)

@@ +452,3 @@
+          for (int i=0; extensions[i][0]; i++)
+            {
+              if (g_str_has_suffix (name, extensions[i]))

This does a lot of redundant scanning to the end of the name string. You might simply call strrchr(name, '.') once up front and compare equality using that pointer.

@@ +499,3 @@
+  GFile *workdir;
+  g_autoptr(GFile) destination = NULL;
+  g_autofree gchar *relative_path, *destination_path;

Two separate lines, = NULL for each. It's not obvious syntactically that the g_autofree cleanup attribute applies to both variables not just the first, so use two separate declarations.

@@ +512,3 @@
+  destination_path = g_build_filename (g_get_user_cache_dir (),
+                                       ide_get_program_name (),
+                                       "index",

We should probably use something more descriptive like "code-index" or "code-search" or something like that.

::: plugins/indexer/ide-indexer-builder.h
@@ +1,1 @@
+#ifndef IDE_INDEXER_BUILDER_H

Always include a license at the top. The GPLv3 license is a file-based license.

::: plugins/indexer/ide-indexer-index.c
@@ +41,3 @@
+{
+  DirectoryIndex *dir_index;
+  g_autoptr(GFile) file1 = NULL, file2 = NULL;

One declaration per line in our codebase.

@@ +75,3 @@
+
+  if (dir_index == NULL)
+  {

2-space indentation for {}

@@ +79,3 @@
+    dir_index->names = g_object_ref (names);
+    dir_index->keys = g_object_ref (keys);
+    g_hash_table_insert (self->indexes, g_file_dup (directory), dir_index);

Why does g_file_dup() need to be used instead of g_object_ref()?

@@ +92,3 @@
+{
+  IdeIndexerIndex *self = source_object;
+  GPtrArray *directories = (GPtrArray *)task_data_ptr;

The cast is unnecessary.

@@ +93,3 @@
+  IdeIndexerIndex *self = source_object;
+  GPtrArray *directories = (GPtrArray *)task_data_ptr;
+  gint num_dirs;

guint num_dirs.

@@ +99,3 @@
+  g_assert (IDE_IS_INDEXER_INDEX (self));
+  g_assert (G_IS_TASK (task));
+

g_assert (directories != NULL);

@@ +102,3 @@
+  num_dirs = directories->len;
+
+  for (int i = 0; i < num_dirs; i++)

for (guint i = 0

Always try to avoid comparisons and transfer of signed/unsigned. It will save you CVEs in your future days. :)

Also, you don't modify directories here, so just do:

 for (guint i = 0; i < directories->len; i++)

The compiler is smart enough to hold directories->len in a register, storing it to a local doesn't make anything faster.

@@ +107,3 @@
+
+      directory = g_ptr_array_index (directories, i);
+      if (ide_indexer_index_load_index (self, directory, cancellable, &error) == NULL)

It's weird to me that a function called load_index that passes a pointer back is not transfered to the caller. How about "ensure_index()"?

@@ +128,3 @@
+
+  task = g_task_new (self, cancellable, callback, user_data);
+  g_task_set_task_data (task, g_ptr_array_ref (directories), (GDestroyNotify)g_ptr_array_unref);

Also add

 g_task_set_task_priority (task, G_PRIORITY_LOW);
 g_task_set_source_tag (task, ide_indexer_index_load_indexes_async);

So that we have some useful data to track down issues at runtime and the task operation is lower-priority than the UI.

@@ +169,3 @@
+  IdeIndexerIndex *self = (IdeIndexerIndex *)object;
+
+  g_hash_table_unref (self->indexes);

g_clear_pointer

::: plugins/indexer/ide-indexer-index.h
@@ +1,1 @@
+#ifndef IDE_INDEXER_INDEX_H

License

::: plugins/indexer/ide-indexer-service.c
@@ +18,3 @@
+
+  /* Queue of directories which needs to be indexed */
+  GQueue                 *building_queue;

Generally, when storing a GQueue in a structure you can omit the * and just use the queue inline.

Just use g_queue_clear() instead of g_queue_free().

@@ +24,3 @@
+{
+  GFile   *directory;
+  gboolean recursive;

guint recursive : 1

@@ +25,3 @@
+  GFile   *directory;
+  gboolean recursive;
+}BuildingData;

Space after }

@@ +35,3 @@
+building_data_free (BuildingData *data)
+{
+  g_object_unref (data->directory);

g_clear_object

@@ +44,3 @@
+                  gpointer      user_data)
+{
+  IdeIndexerService *self = (IdeIndexerService *)user_data;

Transfer ownership explicitly. Pass a reference and g_autoptr(IdeIndexerService) self = user_data;

@@ +46,3 @@
+  IdeIndexerService *self = (IdeIndexerService *)user_data;
+  IdeIndexerBuilder *builder = (IdeIndexerBuilder *)object;
+  GError *error = NULL;

g_autoptr(GError) error = NULL;

@@ +50,3 @@
+  g_assert (IDE_IS_INDEXER_SERVICE (self));
+
+  ide_indexer_builder_build_finish (builder, result, &error);

Log the error somewhere so we can track down bugs. Either g_debug("%s", error->message) or IDE_TRACE_MSG ("%s", error->message) if you think it can be spurious and only useful in developer/nightly builds.

@@ +113,3 @@
+  GFile *file;
+  gboolean flag = FALSE;
+  g_autofree gchar *file_name;

= NULL;

@@ +118,3 @@
+  file_name = g_file_get_uri (file);
+
+  for (int i=0; extensions[i][0]; i++)

for (guint i = 0; extensions[i] != NULL; i++)

@@ +120,3 @@
+  for (int i=0; extensions[i][0]; i++)
+    {
+      if (g_str_has_suffix (file_name, extensions[i]))

Stash the strrchr(file_name, '.') ahead of time so the comparisons are only comparing the applicable data. (And use g_str_equal or strcmp()==0

@@ +127,3 @@
+    }
+
+  if (flag)

This can probably just be put inside the loop followed by return. The compiler is smart enough to optimize it out of the loop.

@@ +137,3 @@
+{
+  GFileType type;
+

We always assert our parameters at the beginning of functions. They compile out of production builds and help us catch things early.

@@ +197,3 @@
+
+  bdata = g_slice_new (BuildingData);
+  bdata->directory = g_file_dup (workdir);

g_object_ref() should be sufficient.

@@ +206,3 @@
+                                   NULL,
+                                   builder_build_cb,
+                                   self);

You need to maintain ownership of a reference to self during the async operation. So g_object_ref (self) and appropriate g_autoptr(...) self = user_data in the callback.

::: plugins/indexer/ide-indexer-service.h
@@ +1,1 @@
+#ifndef IDE_INDEXER_SERVICE_H

License.

::: plugins/indexer/indexer-plugin.c
@@ +1,1 @@
+#include <libpeas/peas.h>

License.

::: plugins/indexer/meson.build
@@ +21,3 @@
+)
+
+install_data('indexer.plugin', install_dir: plugindir)

The other plugins have been updated to use configure_data so that the plugin file gets copied to the builddir (allowing us to run our tests and access the plugin files/modules). So copy what they do.

Comment 6 Christian Hergert 2017-06-28 23:39:43 UTC

Review of attachment 354338 [details] [review]:

Take into account the other rather large review I've done and make sure to fix things you see in this patch series too.

As far as design. One concern I have is that this is manually linking against clang and therefore will not be useful to other languages for which we can extract AST information.

I think an indexer interface should be added to libide, so that plugins may implement the necessary components to extract features from their respective AST. Then the plugin that will do the actual indexing can callback into other plugins to extract those features via the interface in libide+libpeas.

Something like:

 IdeCodeIndexer interface in libide/

Then your plugin:

 plugins/indexer/ which knows how to create IdeCodeIndexer instances for a given file. That interface will deliver you the data necessary to build the index.

Then the clang plugin can implement IdeCodeIndexer to extract features from the CXIndex. This way clang code continues to live in the clang plugin (and should GCC give us the features to do it some day, we can use GCC instead of clang). Additionally we can have the python plugin do a Python AST extraction, etc.

Make sense? Anywhere I can elaborate?

Comment 7 Anoop Chandu 2017-06-29 10:24:13 UTC

Created attachment 354680 [details] [review]
Indexing code and storing that into disk

Comment 8 Christian Hergert 2017-07-04 08:36:24 UTC

Review of attachment 354680 [details] [review]:

It looks, based on what I've read, that the IdeAstIndexer interface is meant to perform all of the indexing work. One difficulty with this design is that to support multiple languages, we'd have to duplicate a lot of plumbing and we wouldn't be sharing the indexer files. Having an unlimited number of inputs into a search engine is problematic, because each time we add a new file source, the performance of the lookup engine needs to inspect another table. So we need to find the right balance here.

An alternate approach might be to create an indexer plugin interface and index entry in libide, that can then be used by a plugin that knows how to store that index data to disk and retrieve it as a search provider.

If plugins implement the indexer plugin interface for their respective languages, all of the index data for all the different languages can live within the same fuzzy index by directory.

The tricky part here is to get the right plugin for a given file. Libpeas provides us the peas_plugin_info_get_external_data() which we can use to determine which plugin to use to extract features for a given file.

Say we come across "foo-bar.c". We can guess the gtksourceview id for this file using gtk_source_language_manager_guess_language() (which is "c" in this case). Then look across all of our PeasPluginInfo for one that has an external data for "Ast-Indexer-Languages" containing "c". If we add "X-Ast-Indexer-Languages=c" to clang.plugin, that would ensure we use it's implementation to extract features from the file.

An interface in libide might look like:

struct _IdeAstIndexerInterface
{
  GTypeInterface parent;

  void (*index_async) (IdeAstIndexer *self,
                       GFile *file,
                       GCancellable *cancellable,
                       GAsyncReadyCallback callback,
                       gpointer user_data);
  GPtrArray *(index_finish) (IdeAstIndexer *self,
                             GAsyncResult *result,
                             GError **error);
};

index_finish() might return an array of IdeAstIndexEntry and what that contains is probably going to depend on what you decide to index. I'm not sure if you want that to be a plain boxed struct, or a full on GObject that allows for inheritance.

The clang plugin would gain an IdeClangAstIndexer, we might add one for vala, python, etc.

Then your indexer-plugin would be responsible for walking the file-system, guessing the file-types, using the proper plugin for the AstIndexer interface, and converting the array of IndexEntry into the fuzzy search index. (And also implement the search provider interface).

Comment 9 Christian Hergert 2017-07-04 08:56:02 UTC

I tried to sketch up a few things that might be a bit clearer that my attempt
to use bugzilla's patch review to describe what I meant.

Here is a rough outline of what I was describing.

 - Add IdeAstIndexer to libide with an interface like:

   struct _IdeAstIndexerInterface
   {
     GTypeInterface parent;

     void       (*index_async)  (IdeAstIndexer        *self,
                                 GFile                *file,
                                 LGAsyncReadyCallback   callback,
                                 gpointer              user_data);
     GPtrArray *(*index_finish) (IdeAstIndexer        *self, 
                                 GAsyncResult         *result,
                                 GError              **error);
   };

 - Add IdeAstIndexerEntry object or interface to libide which
   describes a feature that was extracted from the file. Using an
   interface might allow for you to make the data-set lazy and
   there-by save some memory/cpu.

   struct _IdeAstIndexerEntryInterface
   {
     GTypeInterface parent;

     gchar          *(*get_usr)   (IdeAstIndexer *self);
     gchar          *(*get_name)  (IdeAstIndexer *self);
     IdeSymbolKind   (*get_kind)  (IdeAstIndexer *self);
     IdeSymbolFlags  (*get_flags) (IdeAstIndexer *self);
     void            (*get_range) (IdeAstIndexer *self,
                                   guint         *begin_line,
                                   guint         *begin_line_offset,
                                   guint         *end_line,
                                   guint         *end_line_offset);

     /* etc */
   };

 - Implement the above interface(s) in the clang plugin.

 - Refactor your indexer plugin to locate the proper plugin-based
   IdeAstIndexer to instantiate based on the content-type (or preferrably
   the GtkSourceLanguage id since we use that in other places) and
   use that to index the file.

   The key will be "X-Ast-Indexer-Languages" in the .plugin file, but the
   word you pass to peas_plugin_info_get_external_data() does not have the "X-
   prefix, so just "Ast-Indexer-Languages".

   See libide/plugins/ide-extension-util.c:67 for how we match based on
   external data in other plugins. It's basically just a split on ","
   and allow * for matching all. The main thing here is that we can specify
   a value like "c,cplusplus,chdr,cplusplushdr" for the clang.plugin file.

 - Take the information in the IdeAstIndexerEntry and add it to your
   fuzzy index.

 - Implement the search provider using said indexes (also in indexer-plugin).

Comment 10 Anoop Chandu 2017-07-22 06:58:49 UTC

Created attachment 356157 [details] [review]
Adding code-index plugin and IdeCodeIndexService

This patch adds code index plugin and a service which indexes project using other classes.

code-index plugin will indexes project and stores them in disk.
It will then use that index to do fuzzy search of symbols and
finding definitions.

IdeCodeIndexService will look for changes and index source files.
It will also maintain index which is loaded into memory.

Comment 11 Anoop Chandu 2017-07-22 07:01:58 UTC

Created attachment 356158 [details] [review]
Adding IdeCodeIndexBuilder to code-index

This patch adds IdeCodeIndexBuilder class which actually indexes project an loads that index.

Comment 12 Christian Hergert 2017-07-23 10:48:49 UTC

Review of attachment 356158 [details] [review]:

Hrmm, I see simple-table-builder.h but code-index-builder.c. Was that intended?

Great work, it's really coming together well!

::: plugins/code-index/ide-code-index-builder.c
@@ +65,3 @@
+{
+  GFile     *directory;
+  guint      recursive : 1;

We try to put similarly aligned types near each other in the struct. In the case here, for 64-bit machines, the compiler will insert 63-bits of padding after "recursive" because pointers must be aligned to 8-bytes. So the :1 bitflag is not of much use. Instead, put recursive at the tail of the structure. That allows the compiler to use less space for the structure by placing the two pointers next to each other.

@@ +81,3 @@
+  DzlFuzzyIndexBuilder  *index_builder;
+  IdeSimpleTableBuilder *table_builder;
+  guint                  num_files_indexed;

Same here, move this field after file_index_entries.

@@ +148,3 @@
+{
+  IdeCodeIndexBuilder *self = (IdeCodeIndexBuilder *)source_object;
+  GPtrArray *directories = (GPtrArray *)task_data_ptr;

Both of these casts are unnecessary in C. gpointer is just void*.

@@ +156,3 @@
+
+  for (guint i = 0; i < directories->len; i++)
+    {

One thing we try to do elsewhere in the code-base is to assign an element from a container to a typed variable. (In this case, a GFile*). That reduces the chances the void* parameter gets used in the wrong place in future refactoring.

@@ +158,3 @@
+    {
+      if (!ide_code_index_index_load (self->index,g_ptr_array_index (directories, i),
+                                      cancellable, &error))

The use of error here is improper. The g_autoptr(GError) is in the parent scope, so the value is only freed when the parent scope exits (not each assignment as might be the case with some C++ frameworks). So if you have two failures, the second one would stop over and leak the first GError.

To fix this, just move the g_autoptr(GError) assignment inside the loop's scope.

@@ +174,3 @@
+  g_autoptr(GTask) task = NULL;
+
+  g_assert (IDE_IS_CODE_INDEX_BUILDER (self));

Add an assertion for directories != NULL

@@ +184,3 @@
+
+  ide_thread_pool_push_task (IDE_THREAD_POOL_INDEXER, task,
+                             ide_code_index_builder_load_indexes_worker);

\o/ finally something using the indexer thread pool!

@@ +247,3 @@
+
+  self = g_task_get_source_object (task);
+  data = g_task_get_task_data (task);

Use g_assert() after these to validate expectations. Assertion traps like this can be a canary for us should we run into reference counting troubles.

@@ +273,3 @@
+      g_task_set_return_on_cancel (task, TRUE);
+      ide_thread_pool_push_task (IDE_THREAD_POOL_INDEXER, task, write_dir_index_worker);
+      g_object_unref (task);

I don't think it is a good idea from a maintainability standpoint to manually unref task in this function. If it is taking ownership of the parameter (and based on the g_steal_pointer() above, that seems like what it is), then I would suggest the following pattern to make the transfer more clear.

g_autoptr(GTask) local_task = task;

And then only use local_task throughout the function. (So g_steal_pointer(&local_task) is clear it's taking the reference and passing it along).

@@ +359,3 @@
+      data->num_files_indexed++;
+
+      sprintf (num, "%d", files->len);

Use g_snprintf(). Obviously sprintf() can be used safely, but locales can come into play and so it's always tricky and we'd rather just have proper bounds checking.

@@ +410,3 @@
+    }
+
+  data = g_slice_new (IndexDirTaskData);

We always use g_slice_new0 except for the most performance critical code to reduce the chances of future mistakes.

@@ +512,3 @@
+  data->indexing_data_ar = g_ptr_array_ref (indexing_data_ar);
+  /* directories will be there in this */
+  data->loading_data_ar = g_ptr_array_new_with_free_func ((GDestroyNotify)g_object_unref);

I don't think you need the cast here, g_object_unref's signature matches GDestroyNotify

@@ +598,3 @@
+          sub_dest = g_file_get_child (destination, file_name);
+
+          ide_code_index_builder_get_changes (self, sub_dir, sub_dest,

You're holding open a directory FD while enumerating the children and then recursing. You might want to first collect all the recursive directories and then process them after exhausting the current directory. Otherwise, you handle the deeply nested cases before the more toplevel cases.

@@ +609,3 @@
+              g_file_info_get_modification_time (info, &mod_time);
+
+              if ((mod_time.tv_sec > max_mod_time.tv_sec) ||

You might consider breaking this out into a timeval_newer_than() helper func for readability sake.

@@ +629,3 @@
+      IndexingData *idata;
+
+      idata = g_slice_new (IndexingData);

new0

@@ +739,3 @@
+    {
+      g_debug ("%s", error->message);
+      g_task_return_error (main_task, error);

Use g_steal_pointer() to avoid a segfault.

@@ +749,3 @@
+{
+  IdeCodeIndexBuilder *self = (IdeCodeIndexBuilder *)object;
+  g_autoptr(GTask) main_task = (GTask *)user_data;

No cast necessary

@@ +759,3 @@
+    {
+      g_debug ("%s", error->message);
+      g_task_return_error (main_task, error);

g_steal_pointer() here too

@@ +791,3 @@
+{
+  IdeCodeIndexBuilder *self = (IdeCodeIndexBuilder *)object;
+  g_autoptr(GTask) main_task = (GTask *)user_data;

remove cast

@@ +801,3 @@
+    {
+      g_debug ("%s", error->message);
+      g_task_return_boolean (main_task, FALSE);

Any reason for not propagating the error back? If so, warrants a comment. If not, steal the error and pass it along.

@@ +874,3 @@
+    {
+    case PROP_INDEX:
+      self->index = g_value_get_object (value);

This isn't safe. g_value_get_object() does not take a reference, so the object could be freed upon returning from set_property(). Use g_value_dup_object() to get your own reference. Also, you need a finalize (or dispose) function to free that reference.

@@ +877,3 @@
+      break;
+    case PROP_SERVICE:
+      self->service = g_value_get_object (value);

Same.

::: plugins/code-index/ide-simple-table-builder.h
@@ +39,3 @@
+ * Index format in disk:
+ * -------------------------
+ * NumberofRows NumberofColumns SizeofKeysArray

This is a bit short on information to describe an encoding format on disk. It needs to describe the structure layouts and padding, a version field in case it needs to change, what endian is the data stored in, etc.

Because of these reasons, I would suggest storing the data in a GVariant format. It's relative self describing, has a known bit/byte format, we can store strings, have fixed size or dynamically sized content within arrays and maintain O(1) access times, and known endianness. Further more, we can mmap() the file and access it's contents directly, using pointers to strings in the data set without allocating them ourselves. (Like we do for the fuzzy search).

Comment 13 Christian Hergert 2017-07-23 11:06:57 UTC

Review of attachment 356157 [details] [review]:

::: plugins/code-index/ide-code-index-service.c
@@ +75,3 @@
+                  gpointer      user_data)
+{
+  g_autoptr(IdeCodeIndexService) self = (IdeCodeIndexService *)user_data;

Lose the cast

@@ +119,3 @@
+  g_assert (G_IS_FILE (directory));
+
+  bdata = g_slice_new (BuildData);

new0

@@ +139,3 @@
+      g_queue_push_tail (&self->building_queue, bdata);
+    }
+

A pattern you might consider using here is to introduce some delay before flushing operations. That gives you the chance to do a few things. You can coalesce operations that might come in (from frequent writes by an external tool, etc) as well as prevent overload on the system from rapid changes coming in.

Usually what I do is to just push_tail(), and then call something_queue_flush(). The queue_flush() function should be smart enough to know if it is already processing and instead wait for the active callback to complete (like you've done above). This makes it easier to introduce delay by using a g_timeout_add_seconds() from the queue_flush() function.

@@ +314,3 @@
+                  const gchar         *lang)
+{
+  g_autoptr(IdeCodeIndexer) code_indexer = NULL;

I think you can remove most of this code and just use IdeExtensionAdapater. After creating it and setting the language as the value, call _get_extension() and take a reference to it. Then you can dispose the adapter and return your extension reference.

@@ +420,3 @@
+
+  manager = gtk_source_language_manager_get_default ();
+  language = gtk_source_language_manager_guess_language (manager, g_strdup (file_name), NULL);

This is leaking the result of the strdup.

Comment 14 Anoop Chandu 2017-08-15 17:44:15 UTC

Created attachment 357661 [details] [review]
Adding serializable map

This map associates a GVariant with a key. With IdeMapBuilder a map
can be built and stored on to disk. Using IdeMap map on disk can
be loaded and searched.

Comment 15 Anoop Chandu 2017-08-15 17:44:46 UTC

Created attachment 357662 [details] [review]
Adding IdeCodeIndexIndex

With this object we can load indexes from disk. Using the loaded
index search results for a query can be retrieved which will be
displayed from global search bar and definition of a reference at
some location can be found.

Comment 16 Christian Hergert 2017-08-20 00:29:02 UTC

Review of attachment 357661 [details] [review]:

I think we should rename this type IdePersistentMap (or similar).

I think we should be able to remove the need for the value_size field. If I'm understanding this correctly, that is for getting a "fixed size" array that is used to access direct offsets. While it costs a function call, g_variant_get_child_value() is O(1) and so I don't think we need value_size and instead just hold onto a GVariant representing that array of elements.

By making this more generic (to work with variable-size GVariants), we can use it as generic data storage for future projects.

::: plugins/code-index/ide-map-builder.c
@@ +167,3 @@
+
+  /* Insert value size */
+  g_variant_dict_insert_value (&dict, "value_size",

There doesn't seem to be any enforcement that the elements are of the same size, so I'm not sure this is safe.

Also, I don't think we need to "get_fixed_array" in the reader for everything because g_variant_get_child_value() is O(1) even without fixed size elements (using a back-pointer table at the end of the array).

::: plugins/code-index/ide-map.h
@@ +39,3 @@
+                                                 GAsyncResult        *result,
+                                                 GError             **error);
+GVariant   *ide_map_get_value                   (IdeMap              *self,

Let's try to avoid "get_" functions for things that aren't properties. Instead, let's use "lookup".

Comment 17 Christian Hergert 2017-08-20 00:54:56 UTC

Review of attachment 357662 [details] [review]:

Nice work!

 - Some safety issues to work through (GHashTableIter is not safe between mutations)
 - Some lifetime tracking issues
 - Some signed vs unsigned comparisons to fix
 - Possibly some performance improvements to be made in the search hot path

::: plugins/code-index/ide-code-index-index.c
@@ +104,3 @@
+                       const SearchResult *b)
+{
+  return (dzl_fuzzy_index_match_get_score (a->match) >

This won't result in a stable sort due to the equality case. Not a big deal, but something to think about as you could end up in a situation with rows bouncing back and forth depending on the use case.

@@ +199,3 @@
+   * indexed previously in the same directory.
+   */
+  if (dzl_fuzzy_index_get_metadata_uint32 (fuzzy_index, "Num_Files") != files->len)

dir_index is leaked

@@ +211,3 @@
+                                              error)))
+    {
+      return FALSE;

here too

@@ +219,3 @@
+     ((mod_time.tv_sec == index_mod_time.tv_sec) &&
+      (mod_time.tv_usec > index_mod_time.tv_usec)))
+    return FALSE;

and here

@@ +222,3 @@
+
+  /* Return false if all current files in directory are not there in index. */
+  for (int i = 0; i < files->len; i++)

guint i = 0;

You can't always avoid it, but try to avoid signed vs unsigned integer comparisons. They'll bite back at the worst of times.

@@ +229,3 @@
+
+      if (!dzl_fuzzy_index_get_metadata_uint32 (fuzzy_index, file_name))
+        return FALSE;

and here

@@ +232,3 @@
+    }
+
+  g_hash_table_insert (self->indexes, g_object_ref (directory), dir_index);

probably want to either have a "goto cleanup" or add an autoptr cleanup function (and then g_steal_pointer()) for dir_index.

@@ +251,3 @@
+  heap = dzl_heap_new (sizeof (SearchResult), (GCompareFunc)search_result_compare);
+
+  for (int i = 0; i < dirs_matches->len; i++)

guint i

@@ +265,3 @@
+      result.match = g_list_model_get_item (dir_matches->matches, result.match_number);
+
+      dzl_heap_insert_val (heap, result);

At this point, the result must already have a score (if search_result_compare is correct).

And if that is the case, I *think* the use of the Heap is unnecessary altogether.

The ideal use-case of IdeSearchReducer is to use the score to determine if we need to do any further processing of the item by checking it's score. If we already reached "max items" and the score is lower than our lowest item in the reducer, then we can skip any future processing of the item.

@@ +268,3 @@
+    }
+
+  while (dzl_heap_extract (heap, &sresult) && max_results)

This is doing a lot of work for every potential search result, we need to be very judicious with what happens in such a tight loop.

@@ +367,3 @@
+    }
+
+  if (g_hash_table_iter_next (&tdata->iter, NULL, (gpointer *)&dir_index))

This isn't safe. You can't hold onto a GHashTableIter across potential mutations of the hashtable (except for when using g_hash_table_iter_* mutation functions).

@@ +439,3 @@
+    }
+
+  g_hash_table_iter_init (&tdata->iter, self->indexes);

As noted elsewhere, this isn't safe unless you can guarantee self->indexes stays unmutated.

@@ +536,3 @@
+{
+  self->indexes = g_hash_table_new_full (g_file_hash,
+                                         (GEqualFunc)indexer_file_equal,

Just use g_file_equal() ?

Comment 18 Anoop Chandu 2017-08-21 22:25:25 UTC

Created attachment 358095 [details] [review]
Implement IdeCodeIndexer

Comment 19 Anoop Chandu 2017-08-21 22:25:59 UTC

Created attachment 358096 [details] [review]
Implement GListModel & IdeCodeIndexEntry