Tuesday, 20 October 2015

Matlab: Match elements of a column in two arrays with different number of rows and move data from one to the other

Problem:

    For example, there are 'Name', 'Count' in inventory tables Sample1, Sample2, Sample3. Table Sample1, Sample2, Sample3 share some common variables of 'Names', but with different number of  'Count'. Now we want to combine all three tables into one table T_sum, in which 'Name' is a unique list and in the row of each name have the count for each sample.

Eg. from the many of the following table

Name Count
Bread 234
Pop 123
... ...

to a summary table like the following
Name_all Sample1-Count Sample2-Count ...
Bread 234 236 ...
Pop 123 456 ...
... ... ... ...

Solution #1:  use for loop

%% find the list of Names
Name_all = unique([Sample1.Name; Sample2.Name ...]);

%% do a for loop to migrate data from individual tables to a summary table
% preallocate the memory for the summary table
Counts = zeros(size(Name_all, 1),1);
T_sum = table(Name_all, Counts, Counts, ... );

% set table column names
T_sum.Properties.VariableNames(2:end) = ['Sample1-Count', 'Sample2-Count', ...] 

% for loop for coping data
for  jj = 1: N_samples
    for ii = 1:size(Name_all, 1)
        % find the index of the Name in Sample{jj}
        tmpIndex = Sample{jj}.Name == Name_all(ii);
        % if there Name_all(ii) is in the list of Sample{jj}.Name, copy Count to T_sum under the column of Samplejj-Count
        if sum(tmpIndex) ==1
            T_sum(ii+1,jj) = Sample{jj}.Count(tmpIndex);
        end
    end
end

This code is good for small set of data (<10 a="" becomes="" br="" comes="" data="" it="" large="" of="" set="" slow.="" to="" very="" when="">

Solution #2:  Vectorization (faster)

Same as the codes above for initialization and preallocation. We just need to replace the for-loops to the following codes:

%% use vectorized indexing to replace the for-loops

for jj = 1: N_samples
     % use ismember to find the order (locations) for each name in Sample{jj} that appears in Name_all
     [tmpIndex, tmploc] = ismember(Name_all, Sample{jj}.Name);
     % delete zeros in tmploc
     tmploc(~tmploc) = [];
     % Copying counts to the Summary table
     T_sum(tmp, jj+1) = Sample{jj}.Count(tmploc);
end


Now, it is much faster.

Please leave a comment if you have any question.




No comments:

Post a Comment