Efficient way to store and search through large data set?

I’m trying to create an app that determines the closest air quality sensor to your current location. I have all of the coding working but I’m not sure it’s the most efficient. Due to the JSON format of the data, I can only retrieve the latitude/longitude pairs and I have to then compare them to the device’s location pairs.

The full JSON set has 19,000 entries (not a typo). The search through that data is actually pretty quick. It takes about 7-10 seconds using a “for each j in list” block. But storing that data is the issue I’m struggling with.

I thought that a Thunkable list would max out before 19,000 entries. And maybe it does but it’s still doing a good job of finding the closest items in testing I’ve done. It works on my iMac but not on my iPhone. Occasionally, I can get it to work on the phone but rarely. My guess is either the processor isn’t fast enough or the list can’t handle that many items on a phone.

My solution was going to be to feed the data to AirTable but the free version maxes out at under 2,000 rows. I tried a Google Sheet which should be able to handle that length but the Create Row block is soooo slow. Using that “for each j in list” block, it creates a new row in the Google Sheet about once every second. So if I let it go, it should finish in about, oh, 5 hours.

Is there a faster method I should consider? Or is is just an outlandish idea to store and search through 19,000 records on a phone/web app?

Some of my initial thoughts:

Are these air quality sensors all in the same state? If they are across the country, you could group them by state in your database. Then get the current users state through an API or something and just look at coordinates in that state. That would definitely cut down your list size.

Also Google Maps is definitely the king pin here. I have not looked through there APIs, but I’m sure there is something that helps with this.

Another brute force method would be narrowing down your search by the first to digits in the lat and long. If the users Lat is 49.xxxxx look at 49 sensors only. If there are none, try 50 and 48.

For database storage. Since your data structure is JSON, I would suggest Firebase since the Realtime DB stores the data in JSON format anyways.

1 Like

I was going to use Google Maps’ Distance Matrix API to determine distance between points but I opted for a math equation instead since it’s a free option.

I’ll look at Firebase. I haven’t used it before.

I had the same thought about narrowing down the lat/lon ranges. Thanks for suggesting that. I think I’ll give it a go.

As for the state, etc., the API (PurpleAir) returns sensor dats sorted by sensor label name which is a field that anyone can alter when they configure a sensor. So the names are random things like “Outdoor.” I wish I could sort by something else but I can’t figure out a way to sort a JSON response without parsing it first.

Looks like I logged in with the wrong forum account. I’ll fix that shortly.

Performance has been an issue in some of my apps as well. Are you loading your JSON into a variable?
If so, what type: App, Stored, or Cloud? I have found the best performance is to store the JSON in the cloud, then copy it into an app variable, then perform my search on the app variable.

Additionally, although it is a bit of a hack, consider using the Break out of Loop block. That will halt the search as soon as you find a match.

As others have suggested, if you can cluster your sensors in some way (state, lat.long rounded to 10s, etc.) then you can efficiently focus your search. If you create a JSON hierarchy by some sort of grouping, you can then use the GetProperty Block as described here (Generating/ changing JSON object) to pull a sub-hierarchy and perform you loop through the limited hierarchy.

P.S. The techniques I describe above work quite well for an app with over 6,000 entries, so it scales reasonably well. It definitely scales much better than the Data Source blocks.

P.P.S. Feel free to DM me if you would like to talk about a solution in detail

1 Like

@drted Break out of Loop. Brilliant. Of course. If the 3rd record in the JSON is 1/4 mile from the user’s location, there’s no need to search the other 18,997 records. That should actually help a lot. Especially if I follow @darren’s suggestion to limit lat/lon ranges.

I’ve been using app variables. I’ll see if cloud variables make a difference in terms of performance.

Do you use a service that only provides station coordinates and doesn’t have an API for filtering data? This is very strange. I use a service where you can specify the city or coordinates and it will give you the nearest station and data.

I’m using PurpleAir.com’s API. There doesn’t seem to be a way to specify a sort order or to search by city or lat/lon coordinates. They are releasing a new API soon and I’m hoping it can do some of that. The current API seems limited: https://docs.google.com/document/d/15ijz94dXJ-YAZLi9iZ_RaBwrZ4KtYeCy08goGBwnbCU/edit.

They allow you to retrieve ALL sensor data (19,000 records) or one or more by sensor ID. The sensor ID is a five-digit number that can be found on their website’s map feature. So I had to go in and pick the IDs I wanted by physically clicking on their sensor icons on a map of locations near me. I then constructed the JSON urls from those IDs.

I’ve gotten the search by nearest lat/lon pair to work but it takes forever to loop through JSON records. I really wish their API would allow for something like that. I thought I found a way to sort a JSON response by a field/property within the request url but I haven’t been able to find it again so maybe I dreamt it!

The thing is, I’ve put a lot of research into how to determine accurate air quality readings in California, where I live. PurpleAir seems to be the best. But I’m open to suggestions! I know I definitely don’t want to use the U.S. EPA’s AirNow.gov data. It’s problematic for several reasons which I could go into.

@actech Looks like I picked the wrong API. IQAir/AirVisual allows for a nearest city request: http://api.airvisual.com/v2/nearest_city?key=

There goes 10 hours of work I put into my app! Oh well… although AirVisual limits calls per minute much more than PurpleAir. I was able to get a few dozen calls with PurpleAir. I may end up having to pay for something to do what I want easily. Hmm.

Edit: their API returns a “nearest location” that’s over three miles away from me. That’s no good. My app can find the sensor within 1/2 a mile – or closer – to the phone’s location.

There are sensors within a couple miles of my house that are 100 points higher than where I live because of canyons, ridges, etc. that affect wind and collection of pollutants. So I guess I’ll stick with the difficult method unless I find something better. :wink:

I understand, but I use http://aqicn.org/api/ru/. I’ll look at your project.

Now I understand your problem. In fact, everything is very simple - you are using a service with a test API, which is hardly convenient to use. Its developers did a good job, but they created a lot of problems for themselves and users.

Tell me, do you keep all the information about the station in your list, or only a few parameters from them? Why am I asking this? If you store all the data, this is a very large amount. I have a list with 20 thousand entries, but each entry in it consists of no more than 10 characters, so it works well. You need to see how much data you have. My experiments show that blocks start to work unstable with data over 1 Mb. In other words, the problem is not with the phone, but with restrictions in blocks. But I’m afraid that no one will tell you how much data blocks and components can reliably work with.

If I understand correctly, your problem is that the amount of data is too large to store in the application itself. In this case, there is only one way out - to create your own server and request data from it, so that the application does not load all the data, but only the found data. This is how a high-quality API service should work. Simply put, you have to make your own Web service and API, or wait for the developers to make it themselves.

I really appreciate all your knowledge and advice.

I’m not storing the entire JSON response. Within the API call Get block, I check a bunch of conditions (lat/lon pair within 10 miles of my phone’s location, sensor not “hidden” – which means it’s not on the PurpleAir.com map, sensor age less than 6 hours old – which is the last time it reported data to their servers, and the device type to make sure the sensor is outside rather than inside) and then I store the latitude, longitude, sensor name/label and sensor ID #. Those four values are stored in four separate lists. In my testing, the lists have between 800-2000 items. That’s out of a total of about 19,000 possible entries in the full JSON data.

With those constraints, it’s working on my phone but it takes about 30 seconds to return the nearest sensor. That seems really slow. So I’m looking for more ways to make the coding more efficient. An API call seems like the obvious way to speed things up and make the coding much leaner.

One thought I had is to pull the full JSON data but only keep the lat/lon pairs because that’s what I need for the distance search. And then each time I need to find the closest sensor, I just loop over the data locally on the device. It would still take 30 seconds or so to built that first set of data but after that it should be really fast. The JSON call seems to take a good portion of the time.

If I was doing this I’d aim to limit the number of individual write to database or spreadsheet events, and I’d want all the calculations to be done by the spreadsheet not the app. I’m assuming there are many users not just one or two (or just you!).
What I’d look at is:

  • get your response from the API with conditions as you describe
  • save the whole JSON to a single cell in your spreadsheet
  • parse data in that cell into individual lines - lat/long pairs plus any other properties (how hard this is depends on how regular the data is) - I’d start using find, mid and such formulae, with each line continuing the search from where the last one found the previous lat/long pair.
  • create columns to calculate distance from a fixed point (with necessary exclusions)
  • write the phone’s lat/long to another cell
  • find the shortest distance (eg sort by calculated distance)
  • read that single line of data back to the app.

This approach would mean lots of spreadsheet code and very little app code. I don’t know if this will work but the principle of doing the calculations in the spreadsheet should.

Another approach would be to create a static, regular grid - eg a one mile grid and design a spreadsheet which, based on the latest API download, finds the sensors that are within a grid length of each grid point. Then in your app calculate which are the four nearest grid points to your location, and then get only the data from the sensors that are within a grid length of those four grid points. That would spread the workload between the app and the spreadsheet. I’m imagining a spreadsheet with two tables in it - one with the sensor data as a list, another with a grid of lat/long values with each cell containing a ref to any sensors within the distance range.
These are ideas, not solutions. I’ll leave it to you to decide if they’re worth pursuing!

1 Like

@shane.holohan1864qsw Thanks for the algorithm! That makes a lot of sense. And yes, I do want this to work for multiple users (currently about 50 but possibly many more). I’m curious about this suggestion:

Do you have any advice around how to get started with something like that? I’m pretty comfortable in Thunkable and I know my way around an Excel spreadsheet but running formulas from an app isn’t something I’ve done before. Is there a way to do that from within Thunkable with just a Google Spreadsheet or would I need an intermediary server of some sort?

Unfortunately, that site pulls data from the EPA, at least where I live. An API call I made for my location returned a sensor value that is 10 miles away. That IS the closest EPA sensor (AirNow.gov) but it’s not an accurate measure of air quality where I live.

This means that it is better to perform such operations outside of the application. Why is everything slow for you? Because due to the lack of a good API, you are trying to perform heavy server operations on a weak client. You can try using WebViewer to perform these operations using JavaScript and then return the data to the app. From the point of view of the solution architecture, this will also be wrong, but I think this should significantly increase the speed of work, because after loading the response in JSON format, all other operations will occur locally.

@tatiang I haven’t done this myself from an app. And my experience with Excel/Google sheets is cell based formulae not scripts. With those caveats this is what I’d do:

  • if the json is regular, i.e. if the number of characters is the same for each sensor then you can set up a series of rows and populate the cells using using a formula in each cell that extracts (for example) the ten characters after the point where each sensor’s data begins.
  • if the length of each sensor response is different you’ll have to write formulae that finds the delimiters and use those character counts to specify what to extract from the JSON into a particular cell.
  • I’d create a passive sheet populated with these formulae with enough lines in it for all the sensors in the JSON. And I’d set it to autocalculate, so that once the JSON is in the input cell everything else is calculated automatically.
  • then I’d consider whether you want to make the API call to the sensors on demand or at specified intervals. I’d suggest specified intervals to start with. (see below about timed triggers also for a script based approach)
  • The app would then write the user’s location into a cell in a separate table. In that table you’d calculate the nearest sensor(s). Again using passive cell based formulae.

All that said I googled scripting for Sheets and you could it that way too. But it seems that you can’t trigger a script via the API but you could script the Sensor API call on a timed trigger, and then use cell based formulae to calculate the nearest sensor. See these links:
example 2 in this is similar in the need to parse a JSON into cells:

And this explains the limitations on triggering scripts
https://developers.google.com/apps-script/guides/triggers/installable
“API requests do not cause triggers to run.”
“Time-driven triggers let scripts execute at a particular time or on a recurring interval, as frequently as every minute or as infrequently as once per month.”

Hopefully that helps. Good luck!

1 Like

@tatiang I’ve resorted to doing a lot of my data processing work in the cloud. Specifically AWS Lambda. You could write a simple python script that gets the data and processes it and then just sends the pertinent information to your device. Using Lambda you’ll probably get your response in under a second.

Maybe Thunkable could do blocks for Lambda next :smiley: .

2 Likes

Okay, I think I’m getting this now. For some reason, I was thinking of Excel/Google Sheets data as static. That the formulas don’t update until you view the sheet. But that’s silly. If I create a formula, I don’t have to trigger it. It just takes whatever cells as input and stays up-to-date. So I can update the cell values through Thunkable and the formulas will do their thing.

1 Like

I tried storing the full JSON response in a Google Spreadsheet but it wouldn’t allow a single cell with that much data (19,000 JSON records!!!). So then I tried to loop over the full JSON as a list and populate the spreadsheet rows with JSON values. But it only filled about 30 cells and then stopped.

It’s also SO slow. It creates a new row about twice per second.

@tatiang Do I understand correctly that the data in the list of JSON stations is updated from time to time and you need to download all these 19000 records every time? Or is it enough for you to upload this data (station coordinates and other necessary parameters) rarely - once in several months?