treat graticules as standard

2025-04-11 10:09:29 +02:00 · 2024-10-08 18:54:29 +02:00 · 2024-10-08 18:54:29 +02:00 · 3152466769
commit 3152466769
parent 3d8a50f27c
4 changed files with 19 additions and 14 deletions
--- a/6_experimental_process_planet_file.py
+++ b/6_experimental_process_planet_file.py
--- a/7_experimental_graticule_splitter.py
+++ b/7_experimental_graticule_splitter.py
@ -9,7 +9,7 @@ import shutil
 obtain_atp_data = __import__("2_obtain_atp_data")
 import matcher
 import show_data
-process_planet = __import__("6_experimental_process_planet_file")
+process_planet = __import__("4_process_planet_file")
 config = __import__("0_config")


--- a/README.MD
+++ b/README.MD
@ -49,9 +49,14 @@ run in order:
 * 1_obtain_osm_data.py
 * 2_obtain_atp_data.py
 * 3_delete_old_output.py
-* ??????
+* 4_process_planet_file.py
+* 5_generate_graticule_reports.py

-Note that you can modify and rerun say ???????????? - it will reuse cached data.
+Note that you can stop `5_generate_graticule_reports.py` and start it again - it will reuse cached data, rather than generate it from scratch.
+
+To get rid of what was generated already, run `3_delete_old_output`.
+
+`1_obtain_osm_data.py` and `2_obtain_atp_data.py` will also use already obtained data, to use new data you will need reset cached files.

 ## Run tests

@ -135,34 +140,34 @@ note: if anything is unclear, missing or leaves you confused or unsure - please
    * Done in [2_obtain_atp_data.py](2_obtain_atp_data.py)
    * The ATP data is unpacked and otherwise remains unchanged
 * OSM data is filtered, as we care only about locations of shops and shoplike object. Areas, including multipolygons are reduced to points to make further processing easier.
-    * This processing is triggered by ????
+    * Done in [4_process_planet_file.py](4_process_planet_file.py)
    * This processing uses [shops](https://codeberg.org/matkoniecz/shop_listing) package, build specifically for this project
    * This package in turn uses [osmium](https://osmcode.org/pyosmium/) for processing dataset
    * This produces CSV file listing POIs that were detected to be of interest. This data is cached as generation of this file is time-consuming. For details on the format, refer to the documentation for the shops package.
 * ATP data is also prepared for matching process
-    * This processing is triggered by ????
+    * Done in [5_generate_graticule_reports.py](5_generate_graticule_reports.py)
    * ATP data is filtered - some entries are ignored altogether, for example there is some data about specific lamp posts and trees. Such POIs are skipped and ignored altogether.
    * ATP data is processed (for example some fields are renamed, as part of workaround for data quality issues)
    * Data is cleaned to throw away bogus data that could be detected. See [0_config.py](0_config.py) if you prefer more logs or less. The current approach is to report problems to ATP developers (using PRs or issues) and then hide given type of logs from specific spider. If there is backlog for given issue warning about specific problem is fully hidden.
-* Actual matching between OSM and ATP data happens as the next step, with [Match](serializing.py) objects built for each processed POI from ATP - with OSM objects matched to it. This data is saved to .csv files. Matching is [taking into account many factors](test_matching_logic.py)
-* [4_show_data.py](4_show_data.py) is generating reports, including maps
-    * Note: [6_experimental_process_planet_file.py](6_experimental_process_planet_file.py) + [7_experimental_graticule_splitter.py](7_experimental_graticule_splitter.py) is the planned replacement, not limited to spiders unique to a given country.
-    * This report generation also generates list of cases where cache was checked for Nominatim/link status responses
+* Matching between OSM and ATP data happens in this step ([5_generate_graticule_reports.py](5_generate_graticule_reports.py)), with [Match](serializing.py) objects built for each processed POI from ATP - with OSM objects matched to it. This data is saved to .csv files. Matching is [taking into account many factors](test_matching_logic.py)
+* Also in this step reports in form of HTML files are generated (called from [5_generate_graticule_reports.py](5_generate_graticule_reports.py))
 * [link_scan_worker.py](link_scan_worker.py) - checks whether specified links are existing and is their target redirects
    * Uses list of cases where link cache was missing link
    * Avoid hammering the same site repeatedly
+    * In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
 * [nominatim_worker.py](nominatim_worker.py)
    * Runs Nominatim queries that were logged as missing in cache
    * Respects [Nominatim usage policy](https://operations.osmfoundation.org/policies/nominatim/) while making many queries without pause
-    * You can start your cache with dataset from https://codeberg.org/matkoniecz/nominatim_collected_data - though obviously some query results may be now outdated...
+    * In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
+    * You can start your cache with dataset from https://codeberg.org/matkoniecz/nominatim_collected_data - though obviously some query results will be outdated...

-Note that such processing may take several hours or multiple days due to scale of OSM dataset and require 200 GB for caches for processing global dataset. Though if you process smaller area, storage and time requirements will drop appropriately.
+Note that such processing may take several hours or multiple days due to scale of OSM dataset and require 200 GB for caches for processing global dataset.

-This process caches significant amounts of data. As result subsequent runs will be much faster.
+This process caches significant amounts of data. As result subsequent runs with the same data will be much faster.

 Processing should be doable also on old laptops and do not require high-end hardware - this software was developed and run on laptop with 16GB RAM and i7-7600U CPU.

-It is currently an experimental proof of concept. Limited effort was put into performance optimization, as it is unclear where main bottlenecks are. The current priority is producing data of quality high enough to be usable. If you have any specific ideas how to improve performance - comments are highly welcome!
+It is currently an experimental proof of concept. Limited effort was put into performance optimization, as it is unclear where main bottlenecks are. The current priority is producing data of quality high enough to be usable. If you have any specific ideas how to improve performance - comments or pull requests are highly welcome!

 ## Design decisions

--- a/test_general_smoke_test.py
+++ b/test_general_smoke_test.py
@ -2,7 +2,7 @@ import nominatim_worker
 import link_scan_worker
 import unittest
 import shared
-experimental_graticule_splitter = __import__("7_experimental_graticule_splitter")
+generate_graticule_reports = __import__("5_generate_graticule_reports")


 class SmokeTest(unittest.TestCase):