pietervdvn/list_how_openstreetmap_can_be_improved_with_alltheplaces_data

Fork 0

mirror of https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data.git synced 2025-04-05 15:19:26 +02:00

Code for generating comparison between OSM and ATP, focused on how OSM can be improved - missing shops and tags, skipping dubious data

Find a file

Mateusz Konieczny c105c28857 add rerun tests on fail dependency		2025-04-03 15:35:15 +02:00
.env-template	remove dead code	2025-01-15 17:11:19 +01:00
.gitignore	start using environmental variables and config file	2024-08-13 14:14:36 +02:00
.pre-commit-config.yaml	run tests in pre-commit	2024-07-28 14:03:15 +02:00
0_config.py	improve type matching	2025-04-03 15:33:56 +02:00
0_obtain_related_repositories_to_provide_better_feedback_on_atp_data_quality.py	let autopep8 do some pointless changes	2025-01-15 16:33:40 +01:00
1_obtain_osm_data.py	expand libtorrent	2025-02-05 18:49:45 +01:00
2_obtain_atp_data.py	disable noisy log	2025-03-28 08:25:45 +01:00
3_delete_old_build_and_output.py	add explanation what is being deleted	2025-03-16 19:52:23 +01:00
4_process_planet_file.py	handle missing folder	2025-02-26 18:43:20 +01:00
5_generate_graticule_reports.py	separate pile of success markers into own folder	2025-03-22 08:53:19 +01:00
10_find_missing_banned_named_parts.py	split finding repeated not banned names	2025-02-19 04:30:34 +01:00
12_generate_organic_map_bookmarks.py	move iterator to iterators	2025-02-19 04:27:09 +01:00
17_list_mismatching_brand_wikidata.py	fixed upstream in ATP	2025-03-23 13:17:08 +01:00
18_check_poi_type_conflicts_for_suspicious.py	drop now fixed one	2025-04-03 15:34:43 +02:00
19_detect_unhandled_cascading_values_for_canonical_poi_types.py	prepare to fetching old ATP datasets	2025-03-20 15:46:10 +01:00
20_detect_unhandled_closed_poi.py	prepare to fetching old ATP datasets	2025-03-20 15:46:10 +01:00
21_list_import_status.py	skip cases where there is no import nor conflict	2025-04-03 06:06:51 +02:00
22_find_where_multiple_atp_match_to_one_osm.py	one more diagnostic script	2025-03-06 10:24:21 +01:00
81_generate_atp_issue_tracker_report.py	prepare to fetching old ATP datasets	2025-03-20 15:46:10 +01:00
82_list_unusual_tags_present_in_atp.py	move iterator to iterators	2025-02-19 04:27:09 +01:00
83_generate_atp_issue_reports_about_poorly_matched_entries.py	fixed upstream in ATP	2025-03-23 13:17:08 +01:00
84_generate_atp_issue_reports_about_bad_names.py	move iterator to iterators	2025-02-19 04:27:09 +01:00
89_run_remove_bad_data_across_atp_to_trigger_log_output.py	move iterator to iterators	2025-02-19 04:27:09 +01:00
data_iterator.py	iterate over all iterator	2025-04-01 11:31:04 +02:00
distance_distribution.py	hide warning	2025-01-20 10:57:22 +01:00
dump_all_atp_in_area.py	move iterator to iterators	2025-02-19 04:27:09 +01:00
generate_html.py	another competing way of listing import candidates	2025-01-22 16:02:47 +01:00
latest_date_of_file_commit.py	let autopep8 do some pointless changes	2025-01-15 16:33:40 +01:00
LICENSE.MD	AGPL for now	2024-07-14 19:05:43 +02:00
link_scan_worker.py	follow linter advise in formatting	2024-10-08 19:27:19 +02:00
matcher.py	prepare to fetching old ATP datasets	2025-03-20 15:46:10 +01:00
nominatim.py	avoid double spaces	2025-01-30 18:35:08 +01:00
nominatim_worker.py	deal with unstructured later	2024-10-15 14:07:33 +02:00
opening_hours_parser.py	fix some typos	2025-02-09 09:09:14 +01:00
prose.py	fix import regression	2025-01-21 11:06:43 +01:00
qa.py	clarify what qa.py is doing	2025-03-08 04:25:22 +01:00
qa_autofix_atp.py	let autopep8 do some pointless changes	2025-01-15 16:33:40 +01:00
README.MD	improve readme	2025-04-03 15:34:06 +02:00
requirements-dev.txt	add rerun tests on fail dependency	2025-04-03 15:35:15 +02:00
requirements.txt	prepare to fetching old ATP datasets	2025-03-20 15:46:10 +01:00
serializing.py	save full data when serializing to geojson	2025-02-19 04:09:06 +01:00
shared.py	zoom higher when linking ATP	2025-02-09 09:21:58 +01:00
show_data.py	handle OSM tag lists being shown, not only ATP ones	2025-02-05 12:55:14 +01:00
spatial_index.py	follow linter here	2024-08-14 10:49:38 +02:00
test_config.py	better handle atpisms	2025-01-15 16:53:30 +01:00
test_detecting_elements_as_in_scope_as_shops_and_similar.py	include some new shoplike upstream	2024-10-04 11:38:21 +02:00
test_display_website.py	move import listers into own file, remove from graticule processing	2025-01-30 18:39:38 +01:00
test_general_smoke_test.py	treat graticules as standard	2024-10-08 18:54:29 +02:00
test_import_list_maker.py	move import listers into own file, remove from graticule processing	2025-01-30 18:39:38 +01:00
test_matching_logic.py	improve type matching	2025-04-03 15:33:56 +02:00
test_nominatim.py	try to clean more addresses	2024-10-23 18:34:29 +02:00
test_opening_hours_parser.py	miserable but working opening hours parser	2025-01-27 13:21:47 +01:00
test_processing.py	fix name collision	2025-02-01 12:17:26 +01:00
test_prose.py	fix import regression	2025-01-21 11:06:43 +01:00
test_spatial_index.py	autopep8 --in-place --max-line-length=420 --recursive .	2024-08-14 10:17:26 +02:00
test_url_checker.py	fix file name	2025-01-20 10:22:34 +01:00
test_wikidata.py	fix some wikidata problem on Wikidata itself	2025-01-30 17:35:13 +01:00
url_checker.py	stop breaking start of address - https://	2025-02-07 11:50:04 +01:00
view_data_across_atp_datasets.py	first step to view data across time	2025-03-21 19:59:01 +01:00
wikidata.py	fix some wikidata problem on Wikidata itself	2025-01-30 17:35:13 +01:00

README.MD

All The Places <-> OpenStreetMap matcher

Pulls data from ATP and OSM, processes it, matches datasets and reports how OpenStreetMap can be improved using ATP data.

This is experimental version of software. If anything is broken: please report. Currently I am, not aware of anyone else using this code, so setup/use/configuration is likely untested.

If you tried to look at it and become confused or you see some terrible code: please open an issue.

Currently main effort is toward making usable proof of concept and confirming that data obtained with this software is in fact usable, useful and welcome in OpenStreetMap.

But I am also interested in making this software usable by others. I am also interested in hints which parts are wost offenders.

Disclaimers

For disclaimer about data quality see main data listing website. Or see index.html in website that will be generated by this scripts.

I also still have some licensing worries, see issue #8790 (especially this comment )

See generated listing

main data listing website

hosted in separate repository as a static site (yes, its structure should be changed - there should be no need for generating so enormous number of separate files)

Contributing

Working on making more clear what is going on and making some licensing issues (pending replies at #alltheplaces US Slack channel).

For now contributions are welcome but anyone contributing agrees to license their work as

AGPL
GPL
MIT

all kinds of contributions are welcome - from typo fixes, improving code style to adding missing features and improving performance and adding new features

For pull requests more time-consuming than fixing a typo, I would highly recommend to open issue first to OK/confirm/discuss change.

If you tried to use this software or tried to modify it but you are stuck due to confusing or inclear code or missing documentation - please create an issue.

Setup

git clone https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data.git
cd list_how_openstreetmap_can_be_improved_with_alltheplaces_data
pip install --requirement requirements.txt # install dependencies
pip install --requirement requirements-dev.txt # install dependencies

and then customize config files - you need at least to specify where cache will be

cp .env-template .env # copy config file template for customization
codium .env # open with whatever text editor you are using, you do not need this specific one

Running code

run in order:

1_obtain_osm_data.py
2_obtain_atp_data.py
3_delete_old_output.py
4_process_planet_file.py
5_generate_graticule_reports.py

Note that you can stop 5_generate_graticule_reports.py and start it again - it will reuse cached data, rather than generate it from scratch.

To get rid of what was generated already, run 3_delete_old_output.

1_obtain_osm_data.py and 2_obtain_atp_data.py will also use already obtained data, to use new data you will need reset cached files.

Run tests

You can also run them with python3 -m unittest or python3 -m pytest (note to self by @matkoniecz - I have unittest and mypytest aliases in an interactive shell)

Funding

This project is funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

Data flow

note: if anything is unclear, missing or leaves you confused or unsure - please open an issue and I will improve documentation (or restructure code to make it more clear)

Various configuration is in 0_config.py - part of it is set with .env file or environmental variables, see setup for more info. This configuration data is used to amend and control behaviour of program in various way.
- Where various caches are stored
- What kind of issues detected in ATP data will be listed.
To compare OpenStreetMap data with other source, we obviously need to download OSM data.
- Done in 1_obtain_osm_data.py
- At this stage raw OSM data is in .pbf files and not processed at all
ATP data is also downloaded
- Done in 2_obtain_atp_data.py
- The ATP data is unpacked and otherwise remains unchanged
OSM data is filtered, as we care only about locations of shops and shoplike object. Areas, including multipolygons are reduced to points to make further processing easier.
- Done in 4_process_planet_file.py
- This processing uses shops package, build specifically for this project
- This package in turn uses osmium for processing dataset
- This produces CSV file listing POIs that were detected to be of interest. This data is cached as generation of this file is time-consuming. For details on the format, refer to the documentation for the shops package.
ATP data is also prepared for matching process
- Done in 5_generate_graticule_reports.py
- ATP data is filtered - some entries are ignored altogether, for example there is some data about specific lamp posts and trees. Such POIs are skipped and ignored altogether.
- ATP data is processed (for example some fields are renamed, as part of workaround for data quality issues)
- Data is cleaned to throw away bogus data that could be detected. See 0_config.py if you prefer more logs or less. The current approach is to report problems to ATP developers (using PRs or issues) and then hide given type of logs from specific spider. If there is backlog for given issue warning about specific problem is fully hidden.
Matching between OSM and ATP data happens in this step (5_generate_graticule_reports.py), with Match objects built for each processed POI from ATP - with OSM objects matched to it. This data is saved to .csv files. Matching is taking into account many factors
Also in this step reports in form of HTML files are generated (called from 5_generate_graticule_reports.py)
link_scan_worker.py - checks whether specified links are existing and is their target redirects
- Uses list of cases where link cache was missing link
- Avoid hammering the same site repeatedly
- In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
nominatim_worker.py
- Runs Nominatim queries that were logged as missing in cache
- Respects Nominatim usage policy while making many queries without pause
- In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
- You can start your cache with dataset from https://codeberg.org/matkoniecz/nominatim_collected_data - though obviously some query results will be outdated...

Note that such processing may take several hours or multiple days due to scale of OSM dataset and require 200 GB for caches for processing global dataset.

This process caches significant amounts of data. As result subsequent runs with the same data will be much faster.

Processing should be doable also on old laptops and do not require high-end hardware - this software was developed and run on laptop with 16GB RAM and i7-7600U CPU.

It is currently an experimental proof of concept. Limited effort was put into performance optimization, as it is unclear where main bottlenecks are. The current priority is producing data of quality high enough to be usable. If you have any specific ideas how to improve performance - comments or pull requests are highly welcome!

Design decisions

POI scrapping is done by a separate project (All The Places) ** There is no known viable replacement, but in principle different data source could be used instead ** There is so ATP-specific code but it is relatively easy to replace
I want this comparison builder to become part of ATP, if maintainers are interested ** Though from https://github.com/alltheplaces/alltheplaces/issues/6787 I am not sure is there interest ** Maybe I should make a new more clear issue or a PR? Or treat no response as rejection? ** TODO: investigate how ATP is deployed

pyosmium is used

osm2pgsql was considered but it has vastly higher hardware requirements

duckdb_spatial was considered but at time of developing it there was open game-breaking bug in it, see https://github.com/duckdb/duckdb_spatial/issues/349#issuecomment-2300652683

Overpass API would require at least running own instance, fetching all shop worldwide will fail on any existing public instance (and likely would fail also if I would try to run it on a dedicated instance)

pyrosm was mentioned, but I have not tried it. They do not mention about planet-scale processing but they also do not warn against this.

Someone mentioned planetiler, not sure is it even applicable for such processing.

I considered using osmium and writing filtering code in C. I also considered writing everything in Rust. I have not done either.

Reformat code to follow Python coding standards

autopep8 --in-place --max-line-length=420 --recursive .

Note: these suggestions should not be blindly accepted.

PEP 8 -- Style Guide for Python Code

Detect code style issues

See command at the end of section.

Applicable in this projects, should not be applied elsewhere.

W0201 - maybe bad idea but makes sense in this specific code. Probably should not be disabled in other projects.

E0401 - buggy. Not tried reporting as https://github.com/pylint-dev/pylint/issues/9077 is stuck at triage since 2023

Generally applicable

Disables R1702: Too many nested blocks - TODO: reenable it

Disables W0621: Redefining name from outer scope - TODO: reenable it

Detect also TODO_LOW_PRIORITY, TODO_ATP - maybe https://stackoverflow.com/a/71036231/4130619 https://pylint.pycqa.org/en/latest/development_guide/how_tos/custom_checkers.html#how-to-write-a-checker ? Maybe add grep? That will catch also log entries, but maybe those are avoidable

E1136 is hopelessly buggy, see https://github.com/pylint-dev/pylint/issues/1498#issuecomment-1872189118

Disable W1514 as such OS are not supported anyway by me and it is fixed by https://peps.python.org/pep-0686/ making UTF8 default everywhere.

Disables rule C0103 with many false positives (too eager to convert variables into constants).

Disables R0902 as this does not seem to be an actual problem to me.

Disables C0411 as low priority decoration.

Disables C1803 as unwanted syntactic sugar (reconsider after pressing issues are eliminated)

Disables R1705 - as unclear what is wrong with else after return

Disables C0301 complaining about long lines (TODO: reenable? consider, see autopep allowing long lines above).

Disables W0613 complaining about unused arguments. (TODO: reenable? consider)

Disables R0911, R0912, R0913, R0914, R0915, C0302 complaining about complexity/size of code. (TODO: reenable)

Disables C0114, C0115, C0116 asking for docstrings (TODO: reenable)

Disables R0801 as it is a bit of false positive now (TODO: reenable)

Disables C0209: Formatting a regular string which could be an f-string (TODO: maybe reenable)

Disables C0121 complaining about == None (TODO: learn about why it is bad)

Disables R1710 asking for explicit returning of None

Disables W0719 asking for more specific exceptions

Disables R1713 as it is overager and yells about stuff where it would damage readability

Disables W0101 as unreachable code is deliberate in some config functions.

Disables W0611 as unused import are of basically zero importance (TODO: reenable after more important ones are fixed)

Disable R0904 as it complainst about too many tests, which is a false positive.

Disable E1124 as it complained about something helpful.

Disables W0706 as I do not get why it is wrong to code this way and why report may be helpful.

Disables C0413 as import order is at most really unimportant.

Disables R0903 as it complains about harmless things.

Disables R0402 as I fail to get why it is an improvement.

Disables C0325 as such parens improve readability.

Disables C0206 as I see no real improvement in readability.

R1724 is overcleaning and will result in bugs if continue will be ever changed. And results in less clear code.

R1737 - more syntax sugar, lets ignore it for now. TODO: rethink this

R1714 - minimal if any gains, problems if elements are not hashable

Disables R0914: Too many local variables - TODO: reenable it.

Disable R1730 - I see no improvement here, TODO rechecj and reconsider

Disable W0105 at least for now (as python has no multiline comments :( ) https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/implicit-str-concat.html check-str-concat-over-line-jumps = true this design bug was proposed to be removed but it was rejected in https://peps.python.org/pep-3126/ :(

Command

pylint *.py --include-naming-hint=y --variable-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --argument-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --disable=E0401,W0201,R0902,C0103,C0301,C0114,C0115,C0116,C0121,W0613,R0911,R0912,R0913,R0915,C0302,C1803,R1710,W0719,R1713,R1705,C0411,W1514,E1136,W0101,W0611,R0904,E1124,R0801,W0706,C0413,R0903,R0402,C0325,C0206,R1724,R1737,R1714,R1702,W0621,C0209,R0914,R1730,W0105 --check-str-concat-over-line-jumps=y