Tuesday, August 14, 2012

Big data sets and ScriptDb (hack)

First of all ScriptDb just may be the best thing that happened to GAS. It allows for fast data retrieval and storage(more or less).

But ScriptDb has its problems, and probably the biggest one is saving big data sets. This was a problem I encountered when working on pushData.

A quick intro:

pushData is supposed to save spreadsheets to ScriptDb and use that data to copy entries from some source spreadsheets to some destination spreadsheets.

To make it more simple, imagine having a huge spreadsheets with student data. The data going from first and last name all the way to place of birth.

But you also need smaller sheets which will have only parts of the data for each student. Say a roster for all the students in the school. pushData will allow you to define a process

for copying data, what data should be copied from where to where and in which order you want the columns to appear. You can decide when the copying should happen. Or rather the time interval.

The time interval can go from 1 to 24 hours. And for one process the time can be set to real time.(as soon as a change has been made to the source sheet it will be reflected to the destination sheet).

 

Now my problem is easily defined, I have 1,2,3..5? huge sheets a few smaller ones and all the data from those sheets has to be saved to ScriptDb. If you have ever tried to save a big sheet to ScriptDb you can see the problem.

  At the time of writing this saving that data is very very difficult, I am confident that GAS engineers will have a solution very soon but until then it is a real quest to do such a thing.

 

My initial approach was to create objects, every row would be represented by an object, the object keys would be the column names and all those object would be neatly saved to the database. (Used something very similar to the function in the documentation). The problem is that saving objects one by one is just very slow, so very slow it exceeds the execution time for any big data set. That I knew and my plan was to use the saveBatch function, however saveBatch will be able to save only small arrays of objects. By small I mean something ~20 objects at a time. I did try slicing and saving only parts but that did not go so well either.

NOTE: “Data Storage error” is the error most frequently shown, but it can (and does) mean very different things.

After a few days trying to solve this and a few periods of thinking: “Ooook it just is not possible” I did find a solution. This solution however will render some ScriptDb possibilities useless. But it still was working and it was acceptable for the script I am building.

 

My Solution (a simplified version)

My solution is a divide and conquer approach. Also I do not save rows any more I save columns.

The data from the sheet is taken by getDataRange().getValues and using 2DArrayLib’s transpose function I shift the rows and columns. 

I go though the array row by row turn the whole row into a JSON string(just temporarily) and find its length. (There is another approach without turning the data to strings but this will work much faster, and I needed fast)

In this process the largest array is found.

 

1 var db=ScriptDb.getMyDb();
2
3 var sheetData=sheet.getDataRange().getValues();
4 var keys=sheetData[0];
5
6 checkDates(sheetData);
7
8 sheetData=ArrayLib.transpose(sheetData);
9
10 var largestSize=0;
11 var indexOfLargest=0;
12
13 for(var i=0; i<sheetData.length; i++){
14 var thisSize=(Utilities.jsonStringify(sheetData[i]).length);
15 if(thisSize>largestSize){
16 indexOfLargest=i;
17 largestSize=thisSize;
18 }
19 }


checkDates(sheetData) is a function that will find columns containing dates and convert the dates to a ScriptDb-friendly format.


Now the "guess” work comes. I assume that all the rows have the same, largest, length. Knowing that there is a limit on the object size ScriptDb will save  I calculate how many objects it will take to save the whole sheet. From that I find the number of elements I can save for every sheet column per object.


1 var elementPerObject;
2 var numberOfObjects;
3 var storePerObject=5*8*1024; //to get size in bytes
4 var additional=0;
5 var type="sheetData";
6
7 numberOfObjects=parseInt(Math.ceil( (largestSize*keys.length+additional) / storePerObject));
8 elementPerObject=parseInt( Math.floor( (sheetData[indexOfLargest].length)/numberOfObjects) );
9

 



This is by no means a surgically correct calculation, far from it. But it will find a SAFE number of elements you can save. It kind of works like sending files over a network when you think of it (mush less sophisticated thou)


All that is left now is to loop from 0 to the number of objects needed and save those parts of the sheet.


1 for(var count=0; count<numberOfObjects; count++){
2
3 var thisItem={};
4 thisItem.headers=keys;
5 thisItem.itemType="sheetData";
6 thisItem.itemId=itemId;
7 thisItem.sheetName=sheet.getName();
8 thisItem.spreadsheetId=spreadsheet.getId();
9 itemId++;
10
11 for(var i=0; i<keys.length; i++){
12 thisItem[keys[i]]=Utilities.jsonStringify(sheetData[i].slice(begin, end));
13 }
14
15 items.push(thisItem);
16
17 begin=end;
18 end+=elementPerObject;
19
20 }

Initially begin and end are set to 0 and elementsPerObject. As you can see it will slice the array and load the parts to the object keys saving pushing all of the objects to an array. Every run of the loop will make a new object.


The number of objects can (and will) get somewhat large, some of the space will get wasted. But it will never cause the script to fail because an object was to large, and that is what I wanted to achieve.


Finally the array of items created has to be saved:


1 begin=0;
2 end=20;
3 if(end>items.length)
4 end=items.length;
5
6 while(end<=items.length){
7
8 var subItems=items.slice(begin, end);
9 var results = db.saveBatch(subItems, false);
10
11 if (!db.allOk(results)) {
12 // partial or no success
13 for (var i = 0 ; i < results.length; i++) {
14 Logger.log(i);
15 var item = results[i];
16 if ((typeof item.success) == 'function' && !item.success()) {
17 db.save(item);
18 }
19 }
20 }
21
22 if(end==items.length)
23 break;
24 else{
25
26 begin=end;
27 end+=18;
28
29 if(end>items.length)
30 end=items.length;
31 }
32
33 }

Again the array is sliced into more digestible pieces and saved, I used the fail-safe from the ScriptDb documentation to make sure all the elements will be saved.


And that is it, the sheet is saved. This function will save a sheet 10 000 rows x 13 columns in 10-14 seconds (tested it for a few days). And even better when using this to write data it finishes in 6-10 seconds.


Giving the possibility of saving several really big sheets Smile.


 


Usually when you save something you want to retrieve it, this is how I do it. The function will depend on the saving data function…


My function queries the database for all the objects associated with a sheet (made in the previous function), sorting the results by item ID. (In my case it was crucial to write the data in the same order it originally was)


Again in this case I am not interested in all the columns from the original sheet( it is not excluded)


 


1 var db=ScriptDb.getMyDb();
2
3 var items=db.query({itemType:"sheetData", spreadsheetId:sourceSpreadsheet.getId(), sheetName:sourceSheet}).sortBy("itemId", db.NUMERIC);
4
5 var completeData=[];
6
7 for(var i=0; i<headers.length; i++){
8
9 completeData[i]=[];
10 completeData[i][0]=headers[i];
11 }

In this case headers is an array filled with column names I want to have on my new Sheet . E.G. the original sheet could have “Name, Surname, ID, Date of Birth” but I only want to have “Name” and “ID” on my new sheet, then headers will contain “Name, ID”.


After setting every first element in the new array to the column name I go through the saved items and join what is inside the row with the values the new object contains.


1 while(items.hasNext()){
2
3
4 var thisItem=items.next();
5
6 for(var i=0; i<headers.length; i++){
7 if(!completeData[i])
8 completeData[i]=[];
9
10 mergeArray(completeData[i], Utilities.jsonParse(thisItem[headers[i]]));
11 }
12
13 }

Finally using 2DArayLib I again transpose the array and write the data to a new sheet. Simple enough, effective and in some cases useless. But for this app and similar ones this approach is gold.


 


Cheers

No comments:

Post a Comment