.env-template | ||
.gitignore | ||
.pre-commit-config.yaml | ||
0_config.py | ||
0_obtain_related_repositories_to_provide_better_feedback_on_atp_data_quality.py | ||
1_obtain_osm_data.py | ||
2_obtain_atp_data.py | ||
3_delete_old_build_and_output.py | ||
4_process_planet_file.py | ||
5_generate_graticule_reports.py | ||
10_find_missing_banned_named_parts.py | ||
12_generate_organic_map_bookmarks.py | ||
17_list_mismatching_brand_wikidata.py | ||
18_check_poi_type_conflicts_for_suspicious.py | ||
19_detect_unhandled_cascading_values_for_canonical_poi_types.py | ||
20_detect_unhandled_closed_poi.py | ||
21_list_import_status.py | ||
22_find_where_multiple_atp_match_to_one_osm.py | ||
81_generate_atp_issue_tracker_report.py | ||
82_list_unusual_tags_present_in_atp.py | ||
83_generate_atp_issue_reports_about_poorly_matched_entries.py | ||
84_generate_atp_issue_reports_about_bad_names.py | ||
89_run_remove_bad_data_across_atp_to_trigger_log_output.py | ||
changeset_parser_extracting_data_from_xml_line.php | ||
changeset_parser_streetcomplete_edits_generate_csv_and_make_quest_summary.php | ||
compare_atp_to_recent_osm.py | ||
data_iterator.py | ||
distance_distribution.py | ||
dump_all_atp_in_area.py | ||
generate_html.py | ||
latest_date_of_file_commit.py | ||
LICENSE.MD | ||
link_scan_worker.py | ||
matcher.py | ||
nominatim.py | ||
nominatim_worker.py | ||
opening_hours_parser.py | ||
prose.py | ||
qa.py | ||
qa_autofix_atp.py | ||
README.MD | ||
requirements-dev.txt | ||
requirements.txt | ||
serializing.py | ||
shared.py | ||
show_data.py | ||
spatial_index.py | ||
sqlite_test.py | ||
test_config.py | ||
test_detecting_elements_as_in_scope_as_shops_and_similar.py | ||
test_display_website.py | ||
test_general_smoke_test.py | ||
test_import_list_maker.py | ||
test_matching_logic.py | ||
test_nominatim.py | ||
test_opening_hours_parser.py | ||
test_processing.py | ||
test_prose.py | ||
test_spatial_index.py | ||
test_url_checker.py | ||
test_wikidata.py | ||
url_checker.py | ||
view_data_across_atp_datasets.py | ||
wikidata.py |
All The Places <-> OpenStreetMap matcher
Pulls data from ATP and OSM, processes it, matches datasets and reports how OpenStreetMap can be improved using ATP data.
This is experimental version of software. If anything is broken: please report. Currently I am, not aware of anyone else using this code, so setup/use/configuration is likely untested.
If you tried to look at it and become confused or you see some terrible code: please open an issue.
Currently main effort is toward making usable proof of concept and confirming that data obtained with this software is in fact usable, useful and welcome in OpenStreetMap.
But I am also interested in making this software usable by others. I am also interested in hints which parts are wost offenders.
Disclaimers
For disclaimer about data quality see main data listing website. Or see index.html in website that will be generated by this scripts.
I also still have some licensing worries, see issue #8790 (especially this comment )
See generated listing
hosted in separate repository as a static site (yes, its structure should be changed - there should be no need for generating so enormous number of separate files)
Contributing
Working on making more clear what is going on and making some licensing issues (pending replies at #alltheplaces US Slack channel).
For now contributions are welcome but anyone contributing agrees to license their work as
- AGPL
- GPL
- MIT
all kinds of contributions are welcome - from typo fixes, improving code style to adding missing features and improving performance and adding new features
For pull requests more time-consuming than fixing a typo, I would highly recommend to open issue first to OK/confirm/discuss change.
If you tried to use this software or tried to modify it but you are stuck due to confusing or inclear code or missing documentation - please create an issue.
Setup
git clone https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data.git
cd list_how_openstreetmap_can_be_improved_with_alltheplaces_data
pip install --requirement requirements.txt # install dependencies
pip install --requirement requirements-dev.txt # install dependencies
and then customize config files - you need at least to specify where cache will be
cp .env-template .env # copy config file template for customization
codium .env # open with whatever text editor you are using, you do not need this specific one
Running code
run in order:
- 1_obtain_osm_data.py
- 2_obtain_atp_data.py
- 3_delete_old_output.py
- 4_process_planet_file.py
- 5_generate_graticule_reports.py
Note that you can stop 5_generate_graticule_reports.py
and start it again - it will reuse cached data, rather than generate it from scratch.
To get rid of what was generated already, run 3_delete_old_output
.
1_obtain_osm_data.py
and 2_obtain_atp_data.py
will also use already obtained data, to use new data you will need reset cached files.
Run tests
You can also run them with python3 -m unittest
or python3 -m pytest
(note to self by @matkoniecz - I have unittest
and mypytest
aliases in an interactive shell)
Funding
This project is funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.
Data flow
note: if anything is unclear, missing or leaves you confused or unsure - please open an issue and I will improve documentation (or restructure code to make it more clear)
- Various configuration is in 0_config.py - part of it is set with
.env
file or environmental variables, see setup for more info. This configuration data is used to amend and control behaviour of program in various way.- Where various caches are stored
- What kind of issues detected in ATP data will be listed.
- To compare OpenStreetMap data with other source, we obviously need to download OSM data.
- Done in 1_obtain_osm_data.py
- At this stage raw OSM data is in .pbf files and not processed at all
- ATP data is also downloaded
- Done in 2_obtain_atp_data.py
- The ATP data is unpacked and otherwise remains unchanged
- OSM data is filtered, as we care only about locations of shops and shoplike object. Areas, including multipolygons are reduced to points to make further processing easier.
- Done in 4_process_planet_file.py
- This processing uses shops package, build specifically for this project
- This package in turn uses osmium for processing dataset
- This produces CSV file listing POIs that were detected to be of interest. This data is cached as generation of this file is time-consuming. For details on the format, refer to the documentation for the shops package.
- ATP data is also prepared for matching process
- Done in 5_generate_graticule_reports.py
- ATP data is filtered - some entries are ignored altogether, for example there is some data about specific lamp posts and trees. Such POIs are skipped and ignored altogether.
- ATP data is processed (for example some fields are renamed, as part of workaround for data quality issues)
- Data is cleaned to throw away bogus data that could be detected. See 0_config.py if you prefer more logs or less. The current approach is to report problems to ATP developers (using PRs or issues) and then hide given type of logs from specific spider. If there is backlog for given issue warning about specific problem is fully hidden.
- Matching between OSM and ATP data happens in this step (5_generate_graticule_reports.py), with Match objects built for each processed POI from ATP - with OSM objects matched to it. This data is saved to .csv files. Matching is taking into account many factors
- Also in this step reports in form of HTML files are generated (called from 5_generate_graticule_reports.py)
- link_scan_worker.py - checks whether specified links are existing and is their target redirects
- Uses list of cases where link cache was missing link
- Avoid hammering the same site repeatedly
- In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
- nominatim_worker.py
- Runs Nominatim queries that were logged as missing in cache
- Respects Nominatim usage policy while making many queries without pause
- In other places only cache will be used, no outbound connections will be made (as initial run can take loooooooong time)
- You can start your cache with dataset from https://codeberg.org/matkoniecz/nominatim_collected_data - though obviously some query results will be outdated...
Note that such processing may take several hours or multiple days due to scale of OSM dataset and require 200 GB for caches for processing global dataset.
This process caches significant amounts of data. As result subsequent runs with the same data will be much faster.
Processing should be doable also on old laptops and do not require high-end hardware - this software was developed and run on laptop with 16GB RAM and i7-7600U CPU.
It is currently an experimental proof of concept. Limited effort was put into performance optimization, as it is unclear where main bottlenecks are. The current priority is producing data of quality high enough to be usable. If you have any specific ideas how to improve performance - comments or pull requests are highly welcome!
Links
code at https://codeberg.org/matkoniecz/list_how_openstreetmap_can_be_improved_with_alltheplaces_data
issues about code/data/website at https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/issues
repository with web listings at https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset
output published on https://matkoniecz.codeberg.page/improving_openstreetmap_using_alltheplaces_dataset/
installation: install dependencies from requirements.txt
- maybe also requirements-dev.txt
Discussion of this project
- https://community.openstreetmap.org/t/improving-openstreetmap-shop-coverage-with-alltheplaces/119979
Similar projects
- https://gitlab.com/atpsync - attempt to achieve the same goal, different design decisions were taken. Also at early experimental stage
- https://github.com/02JanDal/osm-bjk - more complex, database-based
- http://sk53-osm.blogspot.com/2014/05/fuzzy-ideas-on-fuzzy-matching.html - someone planning similar project
Design decisions
- POI scrapping is done by a separate project (All The Places) ** There is no known viable replacement, but in principle different data source could be used instead ** There is so ATP-specific code but it is relatively easy to replace
- I want this comparison builder to become part of ATP, if maintainers are interested ** Though from https://github.com/alltheplaces/alltheplaces/issues/6787 I am not sure is there interest ** Maybe I should make a new more clear issue or a PR? Or treat no response as rejection? ** TODO: investigate how ATP is deployed
pyosmium is used
osm2pgsql
was considered but it has vastly higher hardware requirements
duckdb_spatial
was considered but at time of developing it there was open game-breaking bug in it, see https://github.com/duckdb/duckdb_spatial/issues/349#issuecomment-2300652683
Overpass API would require at least running own instance, fetching all shop worldwide will fail on any existing public instance (and likely would fail also if I would try to run it on a dedicated instance)
pyrosm
was mentioned, but I have not tried it. They do not mention about planet-scale processing but they also do not warn against this.
Someone mentioned planetiler, not sure is it even applicable for such processing.
I considered using osmium
and writing filtering code in C. I also considered writing everything in Rust. I have not done either.
Reformat code to follow Python coding standards
autopep8 --in-place --max-line-length=420 --recursive .
Note: these suggestions should not be blindly accepted.
PEP 8 -- Style Guide for Python Code
Detect code style issues
See command at the end of section.
Applicable in this projects, should not be applied elsewhere.
W0201 - maybe bad idea but makes sense in this specific code. Probably should not be disabled in other projects.
E0401 - buggy. Not tried reporting as https://github.com/pylint-dev/pylint/issues/9077 is stuck at triage since 2023
Generally applicable
Disables R1702: Too many nested blocks - TODO: reenable it
Disables W0621: Redefining name from outer scope - TODO: reenable it
Detect also TODO_LOW_PRIORITY, TODO_ATP - maybe https://stackoverflow.com/a/71036231/4130619 https://pylint.pycqa.org/en/latest/development_guide/how_tos/custom_checkers.html#how-to-write-a-checker ? Maybe add grep? That will catch also log entries, but maybe those are avoidable
E1136 is hopelessly buggy, see https://github.com/pylint-dev/pylint/issues/1498#issuecomment-1872189118
Disable W1514 as such OS are not supported anyway by me and it is fixed by https://peps.python.org/pep-0686/ making UTF8 default everywhere.
Disables rule C0103
with many false positives (too eager to convert variables into constants).
Disables R0902 as this does not seem to be an actual problem to me.
Disables C0411 as low priority decoration.
Disables C1803 as unwanted syntactic sugar (reconsider after pressing issues are eliminated)
Disables R1705 - as unclear what is wrong with else after return
Disables C0301 complaining about long lines (TODO: reenable? consider, see autopep allowing long lines above).
Disables W0613 complaining about unused arguments. (TODO: reenable? consider)
Disables R0911, R0912, R0913, R0914, R0915, C0302 complaining about complexity/size of code. (TODO: reenable)
Disables C0114, C0115, C0116 asking for docstrings (TODO: reenable)
Disables R0801 as it is a bit of false positive now (TODO: reenable)
Disables C0209: Formatting a regular string which could be an f-string (TODO: maybe reenable)
Disables C0121 complaining about == None
(TODO: learn about why it is bad)
Disables R1710 asking for explicit returning of None
Disables W0719 asking for more specific exceptions
Disables R1713 as it is overager and yells about stuff where it would damage readability
Disables W0101 as unreachable code is deliberate in some config functions.
Disables W0611 as unused import are of basically zero importance (TODO: reenable after more important ones are fixed)
Disable R0904 as it complainst about too many tests, which is a false positive.
Disable E1124 as it complained about something helpful.
Disables W0706 as I do not get why it is wrong to code this way and why report may be helpful.
Disables C0413 as import order is at most really unimportant.
Disables R0903 as it complains about harmless things.
Disables R0402 as I fail to get why it is an improvement.
Disables C0325 as such parens improve readability.
Disables C0206 as I see no real improvement in readability.
R1724 is overcleaning and will result in bugs if continue
will be ever changed. And results in less clear code.
R1737 - more syntax sugar, lets ignore it for now. TODO: rethink this
R1714 - minimal if any gains, problems if elements are not hashable
Disables R0914: Too many local variables - TODO: reenable it.
Disable R1730 - I see no improvement here, TODO rechecj and reconsider
Disable W0105 at least for now (as python has no multiline comments :( ) https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/implicit-str-concat.html check-str-concat-over-line-jumps = true this design bug was proposed to be removed but it was rejected in https://peps.python.org/pep-3126/ :(
Command
pylint *.py --include-naming-hint=y --variable-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --argument-rgx="^[a-z][a-z0-9]*((_[a-z0-9]+)*)?$" --disable=E0401,W0201,R0902,C0103,C0301,C0114,C0115,C0116,C0121,W0613,R0911,R0912,R0913,R0915,C0302,C1803,R1710,W0719,R1713,R1705,C0411,W1514,E1136,W0101,W0611,R0904,E1124,R0801,W0706,C0413,R0903,R0402,C0325,C0206,R1724,R1737,R1714,R1702,W0621,C0209,R0914,R1730,W0105 --check-str-concat-over-line-jumps=y