Code for generating comparison between OSM and ATP, focused on how OSM can be improved - missing shops and tags, skipping dubious data
Find a file
2025-06-13 14:59:36 +02:00
.env-template fix typo 2025-06-03 11:31:26 +02:00
.gitignore start using environmental variables and config file 2024-08-13 14:14:36 +02:00
.pre-commit-config.yaml run tests in pre-commit 2024-07-28 14:03:15 +02:00
0_config.py update what was fixed upstream 2025-06-12 14:45:08 +02:00
0_obtain_related_repositories_to_provide_better_feedback_on_atp_data_quality.py apply autopep8 2025-05-10 11:25:42 +02:00
1_obtain_osm_data.py create folder if missing 2025-06-10 13:45:22 +02:00
2_obtain_atp_data.py apply autopep8 2025-05-10 11:25:42 +02:00
3_delete_old_build_and_output.py add idiotproofing note, for people like me 2025-06-04 15:45:27 +02:00
4_process_planet_file.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
5_generate_graticule_reports.py more structure for files 2025-06-04 15:45:08 +02:00
10_find_missing_banned_named_parts.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
12_generate_organic_map_bookmarks.py fix intend 2025-04-27 09:07:10 +02:00
17_list_mismatching_brand_wikidata.py fix crash on showing error message 2025-06-02 11:27:38 +02:00
18_check_poi_type_conflicts_for_suspicious.py fix missing import 2025-05-12 11:20:26 +02:00
19_detect_unhandled_cascading_values_for_canonical_poi_types.py better matching 2025-04-15 08:58:28 +02:00
20_detect_unhandled_closed_poi.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
21_list_import_status.py more structure for files 2025-06-04 15:45:08 +02:00
22_find_where_multiple_atp_match_to_one_osm.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
81_generate_atp_issue_tracker_report.py fix some remaining traces of crashes 2025-05-27 14:15:49 +02:00
82_list_unusual_tags_present_in_atp.py these are fixed now 2025-04-12 16:58:40 +02:00
83_generate_atp_issue_reports_about_poorly_matched_entries.py clean up a bit as autopep8 suggested 2025-04-13 05:56:37 +02:00
84_generate_atp_issue_reports_about_bad_names.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
89_run_remove_bad_data_across_atp_to_trigger_log_output.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
90_detect_pesky_multityping.py apply autopep8 2025-05-10 11:25:42 +02:00
aaa_proteza.py filter areas mores 2025-06-10 12:40:25 +02:00
changeset_parser_extracting_data_from_xml_line.php WIP from dying laptop, OSM-ATP comparison for recently edited OHs 2025-04-09 19:20:33 +02:00
changeset_parser_streetcomplete_edits_generate_csv_and_make_quest_summary.php WIP from dying laptop, OSM-ATP comparison for recently edited OHs 2025-04-09 19:20:33 +02:00
compare_atp_to_recent_osm.py cleaner exception 2025-05-24 23:01:00 +02:00
compute_local_time.py follow autopep demands 2025-04-16 17:19:50 +02:00
data_iterator.py TODO for future 2025-04-17 09:47:56 +02:00
distance_distribution.py hide warning 2025-01-20 10:57:22 +01:00
dump_all_atp_in_area.py move iterator to iterators 2025-02-19 04:27:09 +01:00
easter.py apply autopep8 2025-04-17 09:40:31 +02:00
generate_html.py refactor import listing generation 2025-04-19 12:56:04 +02:00
latest_date_of_file_commit.py let autopep8 do some pointless changes 2025-01-15 16:33:40 +01:00
LICENSE.MD AGPL for now 2024-07-14 19:05:43 +02:00
link_scan_worker.py drop some not needed logs 2025-04-23 11:38:14 +02:00
manual_object_type_conflict_classification.py apply autopep8 2025-05-10 11:25:42 +02:00
match_osm_atp_using_alternative_strategies.py fix some remaining traces of crashes 2025-05-27 14:15:49 +02:00
matcher.py make test making easier 2025-06-13 14:59:36 +02:00
nominatim.py fix typo 2025-04-23 11:38:14 +02:00
nominatim_worker.py more robust finding of things to geolocate 2025-06-12 15:25:57 +02:00
opening_hours_parser.py calculate days of week 2025-04-17 09:39:44 +02:00
prose.py follow autopep demands 2025-04-16 17:19:50 +02:00
qa.py put matcher code into matcher file 2025-05-06 11:33:02 +02:00
qa_autofix_atp.py better function name 2025-04-23 11:38:14 +02:00
README.MD more setup docs 2025-06-11 17:53:44 +02:00
requirements-dev.txt add missing dependency 2025-04-24 14:43:22 +02:00
requirements.txt fix missing dependencies 2025-06-10 14:52:58 +02:00
reviewed_data_sources.py add TODO 2025-05-13 10:26:08 +02:00
serializing.py apply autopep8 2025-05-10 11:25:42 +02:00
shared.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
show_data.py fix linking 2025-06-11 17:57:46 +02:00
spatial_index.py follow linter here 2024-08-14 10:49:38 +02:00
sqlite_test.py fix typo 2025-06-04 15:45:40 +02:00
test_config.py better handle atpisms 2025-01-15 16:53:30 +01:00
test_detecting_elements_as_in_scope_as_shops_and_similar.py include some new shoplike upstream 2024-10-04 11:38:21 +02:00
test_display_website.py move import listers into own file, remove from graticule processing 2025-01-30 18:39:38 +01:00
test_general_smoke_test.py apply more autopep8 ideas 2025-04-12 06:10:57 +02:00
test_import_list_maker.py apply autopep8 2025-05-10 11:25:42 +02:00
test_list_mismatching_wikidata.py refactor code 2025-05-26 18:34:32 +02:00
test_match_osm_atp_using_alternative_strategies.py fix crash, add test 2025-05-26 18:34:32 +02:00
test_matching_logic.py one more matching test 2025-06-09 18:48:47 +02:00
test_nominatim.py try to clean more addresses 2024-10-23 18:34:29 +02:00
test_opening_hours_parser.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
test_processing.py fix typos 2025-04-16 17:18:10 +02:00
test_prose.py fix import regression 2025-01-21 11:06:43 +01:00
test_spatial_index.py autopep8 --in-place --max-line-length=420 --recursive . 2024-08-14 10:17:26 +02:00
test_url_checker.py fix url checker a bit 2025-04-23 11:38:14 +02:00
test_wikidata.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00
url_checker.py apply autopep8 2025-05-10 11:25:42 +02:00
view_data_across_atp_datasets.py apply what autopep8 wanted 2025-04-19 14:12:51 +02:00
wikidata.py apply changes suggested by linter 2025-04-12 06:09:49 +02:00

All The Places <-> OpenStreetMap matcher

Pulls data from ATP and OSM, processes it, matches datasets and reports how OpenStreetMap can be improved using ATP data.

This is experimental version of software. If anything is broken: please report. Currently I am, not aware of anyone else using this code, so setup/use/configuration is likely untested.

If you tried to look at it and become confused or you see some terrible code: please open an issue.

Currently main effort is toward making usable proof of concept and confirming that data obtained with this software is in fact usable, useful and welcome in OpenStreetMap.

But I am also interested in making this software usable by others. I am also interested in hints which parts are wost offenders.

Disclaimers

For disclaimer about data quality see main data listing website hosted in this repo. Or see index.html in website that will be generated by this scripts.

I also still have some licensing worries, see issue #8790 (especially this comment )

See generated listing

main data listing website

hosted in separate repository as a static site (yes, its structure should be changed - there should be no need for generating so enormous number of separate files)

Contributing

Working on making more clear what is going on and making some licensing issues (pending replies at #alltheplaces US Slack channel).

For now contributions are welcome but anyone contributing agrees to license their work as

  • AGPL
  • GPL
  • MIT

all kinds of contributions are welcome - from typo fixes, improving code style to adding missing features and improving performance and adding new features

For pull requests more time-consuming than fixing a typo, I would highly recommend to open issue first to OK/confirm/discuss change.

If you tried to use this software or tried to modify it but you are stuck due to confusing or inclear code or missing documentation - please create an issue.

Setup

Install

apt-get install git python pip # needed to install code and run it, equivalents on your OS are fine
apt-get install curl unzip # used by running code, specifically this command are needed
git clone https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data.git
cd list_how_openstreetmap_can_be_improved_with_alltheplaces_data
pip install --requirement requirements.txt # install dependencies
pip install --requirement requirements-dev.txt # install dependencies

error: externally-managed-environment

you may need to run python3 -m venv ~/venv and use ~/venv/bin/pip instead pip and run source ~/venv/bin/activate before running Python scripts

or perform other workaround to get rid of error: externally-managed-environment

Configuration

and then customize config files - you need at least to specify where cache will be

cp .env-template .env # copy config file template for customization
codium .env # open with whatever text editor you are using, you do not need this specific one

Obtain precached data

You can start your Nominatim cache with dataset from https://codeberg.org/matkoniecz/nominatim_collected_data - though obviously some query results will be outdated...

You can also start POI link cache with https://codeberg.org/matkoniecz/link_scan_collected_data

TODO: include script fetching such data in repository

cd ~/ATP_matcher_cache # or other folder where config.cache_folder() is pointing
git clone https://codeberg.org/matkoniecz/nominatim_collected_data.git
mv nominatim_collected_data nominatim_cache
git clone https://codeberg.org/matkoniecz/link_scan_collected_data.git
mv link_scan_collected_data url_check_cache

Running code

run in order:

  • 3_delete_old_build_and_output.py
  • 1_obtain_osm_data.py
  • 2_obtain_atp_data.py
  • 4_process_planet_file.py
  • 5_generate_graticule_reports.py
  • 21_list_import_status.py

Note that you can stop 5_generate_graticule_reports.py and start it again - it will reuse cached data, rather than generate it from scratch. TODO: which ones do not support interrutions?

1_obtain_osm_data.py and 2_obtain_atp_data.py will also use already obtained data, to use new data you will need to run 3_delete_old_build_and_output.py.

Run tests

You can also run them with python3 -m unittest or python3 -m pytest (note to self by @matkoniecz - I have unittest and mypytest aliases in an interactive shell)

Note to self: on my computer I setup runner under mypytest alias

Funding

This project is funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

NLnet foundation logo NGI Zero Logo

Data flow

note: if anything is unclear, missing or leaves you confused or unsure - please open an issue and I will improve documentation (or restructure code to make it more clear)

  • Various configuration is in 0_config.py - part of it is set with .env file or environmental variables, see setup for more info. This configuration data is used to amend and control behaviour of program in various way.
    • Where various caches are stored
    • What kind of issues detected in ATP data will be listed.
  • To compare OpenStreetMap data with other source, we obviously need to download OSM data.
  • ATP data is also downloaded
  • OSM data is filtered, as we care only about locations of shops and shoplike object. Areas, including multipolygons are reduced to points to make further processing easier.
    • Done in 4_process_planet_file.py
    • This processing uses shops package, build specifically for this project
    • This package in turn uses osmium for processing dataset
    • This produces CSV file listing POIs that were detected to be of interest. This data is cached as generation of this file is time-consuming. For details on the format, refer to the documentation for the shops package.
  • ATP data is also prepared for matching process
    • Done in 5_generate_graticule_reports.py
    • ATP data is filtered - some entries are ignored altogether, for example there is some data about specific lamp posts and trees. Such POIs are skipped and ignored altogether.
    • ATP data is processed (for example some fields are renamed, as part of workaround for data quality issues)
    • Data is cleaned to throw away bogus data that could be detected. See 0_config.py if you prefer more logs or less. The current approach is to report problems to ATP developers (using PRs or issues) and then hide given type of logs from specific spider. If there is backlog for given issue warning about specific problem is fully hidden.
  • Matching between OSM and ATP data happens in this step (5_generate_graticule_reports.py), with Match objects built for each processed POI from ATP - with OSM objects matched to it. This data is saved to .csv files. Matching is taking into account many factors
  • Also in this step reports in form of HTML files are generated (called from 5_generate_graticule_reports.py)
  • link_scan_worker.py - checks whether specified links are existing and is their target redirects
    • Uses list of cases where link cache was missing link
    • Avoid hammering the same site repeatedly
    • In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
  • nominatim_worker.py
    • Runs Nominatim queries that were logged as missing in cache
    • Respects Nominatim usage policy while making many queries without pause
    • In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)

Note that such processing may take several hours or multiple days due to scale of OSM dataset and require 200 GB for caches for processing global dataset.

This process caches significant amounts of data. As result subsequent runs with the same data will be much faster.

Processing should be doable also on old laptops and do not require high-end hardware - this software was developed and run on laptop with 16GB RAM and i7-7600U CPU.

It is currently an experimental proof of concept. Limited effort was put into performance optimization, as it is unclear where main bottlenecks are. The current priority is producing data of quality high enough to be usable. If you have any specific ideas how to improve performance - comments or pull requests are highly welcome!

code at https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data

issues about code/data/website at https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/issues

repository with web listings at https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset

output published on https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/

installation: install dependencies from requirements.txt - maybe also requirements-dev.txt

Discussion of this project

Similar projects

Design decisions

  • POI scrapping is done by a separate project (All The Places) ** There is no known viable replacement, but in principle different data source could be used instead ** There is so ATP-specific code but it is relatively easy to replace
  • I want this comparison builder to become part of ATP, if maintainers are interested ** Though from https://github.com/alltheplaces/alltheplaces/issues/6787 I am not sure is there interest ** Maybe I should make a new more clear issue or a PR? Or treat no response as rejection? ** TODO: investigate how ATP is deployed

pyosmium is used

osm2pgsql was considered but it has vastly higher hardware requirements

duckdb_spatial was considered but at time of developing it there was open game-breaking bug in it, see https://github.com/duckdb/duckdb_spatial/issues/349#issuecomment-2300652683

Overpass API would require at least running own instance, fetching all shop worldwide will fail on any existing public instance (and likely would fail also if I would try to run it on a dedicated instance)

pyrosm was mentioned, but I have not tried it. They do not mention about planet-scale processing but they also do not warn against this.

Someone mentioned planetiler, not sure is it even applicable for such processing.

I considered using osmium and writing filtering code in C. I also considered writing everything in Rust. I have not done either.

Reformat code to follow Python coding standards

autopep8 --in-place --max-line-length=420 --ignore E26 --recursive .

E26 - spaces after #

Note: these suggestions should not be blindly accepted.

PEP 8 -- Style Guide for Python Code

Detect code style issues

See command at the end of section.

Applicable in this projects, should not be applied elsewhere.

W0201 - maybe bad idea but makes sense in this specific code. Probably should not be disabled in other projects.

E0401 - buggy. Not tried reporting as https://github.com/pylint-dev/pylint/issues/9077 is stuck at triage since 2023

Generally applicable

Disables R1702: Too many nested blocks - TODO: reenable it

Disables W0621: Redefining name from outer scope - TODO: reenable it

Detect also TODO_LOW_PRIORITY, TODO_ATP - maybe https://stackoverflow.com/a/71036231/4130619 https://pylint.pycqa.org/en/latest/development_guide/how_tos/custom_checkers.html#how-to-write-a-checker ? Maybe add grep? That will catch also log entries, but maybe those are avoidable

E1136 is hopelessly buggy, see https://github.com/pylint-dev/pylint/issues/1498#issuecomment-1872189118

Disable W1514 as such OS are not supported anyway by me and it is fixed by https://peps.python.org/pep-0686/ making UTF8 default everywhere.

Disables rule C0103 with many false positives (too eager to convert variables into constants).

Disables R0902 as this does not seem to be an actual problem to me.

Disables C0411 as low priority decoration.

Disables C1803 as unwanted syntactic sugar (reconsider after pressing issues are eliminated)

Disables R1705 - as unclear what is wrong with else after return

Disables C0301 complaining about long lines (TODO: reenable? consider, see autopep allowing long lines above).

Disables W0613 complaining about unused arguments. (TODO: reenable? consider)

Disables R0911, R0912, R0913, R0914, R0915, C0302 complaining about complexity/size of code. (TODO: reenable)

Disables C0114, C0115, C0116 asking for docstrings (TODO: reenable)

Disables R0801 as it is a bit of false positive now (TODO: reenable)

Disables C0209: Formatting a regular string which could be an f-string (TODO: maybe reenable)

Disables C0121 complaining about == None (TODO: learn about why it is bad)

Disables R1710 asking for explicit returning of None

Disables W0719 asking for more specific exceptions

Disables R1713 as it is overager and yells about stuff where it would damage readability

Disables W0101 as unreachable code is deliberate in some config functions.

Disables W0611 as unused import are of basically zero importance (TODO: reenable after more important ones are fixed)

Disable R0904 as it complainst about too many tests, which is a false positive.

Disable E1124 as it complained about something helpful.

Disables W0706 as I do not get why it is wrong to code this way and why report may be helpful.

Disables C0413 as import order is at most really unimportant.

Disables R0903 as it complains about harmless things.

Disables R0402 as I fail to get why it is an improvement.

Disables C0325 as such parens improve readability.

Disables C0206 as I see no real improvement in readability.

R1724 is overcleaning and will result in bugs if continue will be ever changed. And results in less clear code.

R1737 - more syntax sugar, lets ignore it for now. TODO: rethink this

R1714 - minimal if any gains, problems if elements are not hashable

Disables R0914: Too many local variables - TODO: reenable it.

Disable R1730 - I see no improvement here, TODO rechecj and reconsider

Disable W0105 at least for now (as python has no multiline comments :( ) https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/implicit-str-concat.html check-str-concat-over-line-jumps = true this design bug was proposed to be removed but it was rejected in https://peps.python.org/pep-3126/ :(

Disable C0201 as it is more cryptic for minor improvement if any

Disable W0603 as limited global is not a deal breaker, though ideally it would get fixed

Disable W0511 (notifying about TODOs)

Command

pylint *.py --include-naming-hint=y --variable-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --argument-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --disable=R0917,E0401,W0201,R0902,C0103,C0301,C0114,C0115,C0116,C0121,W0613,R0911,R0912,R0913,R0915,C0302,C1803,R1710,W0719,R1713,R1705,C0411,W1514,E1136,W0101,W0611,R0904,E1124,R0801,W0706,C0413,R0903,R0402,C0325,C0206,R1724,R1737,R1714,R1702,W0621,C0209,R0914,R1730,W0105,W0603,W0511,C0201 --check-str-concat-over-line-jumps=y