gh-146256: Add --jsonl collector to the profiling.sampling#146257
gh-146256: Add --jsonl collector to the profiling.sampling#146257maurycy wants to merge 40 commits intopython:mainfrom
--jsonl collector to the profiling.sampling#146257Conversation
--ndjson flag to the profiling.sampling--jsonl flag to the profiling.sampling
--jsonl flag to the profiling.sampling--jsonl collector to the profiling.sampling
…ktor * upstream/main: (119 commits) pythongh-144270: Make SubElement parent and tag positional-only (pythonGH-144845) pythongh-146558: JIT optimize dict access for objects with known hash (python#146559) pythongh-139922: always run MSVC 64-bit tail-calling CI (pythonGH-146570) pythongh-126835: Fix _PY_IS_SMALL_INT() macro (python#146631) pythongh-146587: fix type slot assignment incase of multiple slots for same name (python#146593) pythongh-138122: Add differential flame graph (python#145785) pythongh-146416: Emscripten: Improve standard stream handling in node_entry.mjs (python#146417) pythongh-146444: Don't package as part of iOS 'build hosts' target (python#146628) pythongh-138850: Add --disable-epoll to configure (pythonGH-145768) pythongh-146444: Make Platforms/Apple/ compatible with Python 3.9 (python#146624) pythongh-138577: Fix keyboard shortcuts in getpass with echo_char (python#141597) pythongh-146556: Fix infinite loop in annotationlib.get_annotations() on circular __wrapped__ (python#146557) pythongh-146579: _zstd: Fix decompression options dict error message (python#146577) pythongh-146083: Upgrade bundled Expat to 2.7.5 (python#146085) pythongh-146080: fix a crash in SNI callbacks when the SSL object is gone (python#146573) pythongh-146090: fix memory management of internal `sqlite3` callback contexts (python#146569) pythongh-145876: Do not mask KeyErrors raised during dictionary unpacking in call (pythonGH-146472) pythongh-146004: fix test_args_from_interpreter_flags on windows (python#146580) pythongh-139003: Use frozenset for module level attributes in _pyrepl.utils (python#139004) pythonGH-146527: Add more data to GC statistics and add it to PyDebugOffsets (python#146532) ...
|
@ivonastojanovic can you take a look? |
|
@ivonastojanovic @pablogsal Thank you. Please note that I've started adding test coverage, so it might be worth waiting a day with a proper review (it's already interesting: confused myself with I will mark it as Ready for review immediately. Perhaps #146256 and #145464 are the best places to discuss the format and the ideas. |
* main: pythongh-145458: use `self.skip_idle` consistently in the tachyon profiler (python#145459) pythongh-146615: Fix format specifiers in Objects/ directory (pythonGH-146620) pythongh-146615: Fix format specifiers in Python/ directory (pythonGH-146619) pythongh-146615: Fix format specifiers in test cextensions (pythonGH-146618) pythongh-146615: Fix format specifiers in extension modules (pythonGH-146617) pythongh-146615: Fix crash in __get__() for METH_METHOD descriptors with invalid type argument (pythonGH-146634) pythongh-146376: Reduce timeout in Emscripten GHA workflow (python#146378) pythongh-146442: Fix various bugs in compiler pipeline (python#146443) pythongh-146238: Support half-floats in the array module (python#146242) pythongh-145056: Add support for merging collections.UserDict and frozendict (pythonGH-146465) pythongh-145056: Fix merging of collections.OrderedDict and frozendict (pythonGH-146466) pythongh-139633: Run netrc file permission check only once per parse (pythonGH-139634)
| from .stack_collector import StackTraceCollector | ||
|
|
||
|
|
||
| _CHUNK_SIZE = 256 |
There was a problem hiding this comment.
More interesting than it looks!
The collector is not streaming yet. The values I found are on par with 256-512:
- https://github.com/DataDog/dd-trace-py/blob/32a7d167b7628589ae1605aa5165ab4290836da2/ddtrace/internal/settings/_config.py#L454
- https://github.com/open-telemetry/opentelemetry-specification/blob/3a4cba26572558609bdcb51dfcbb2d8259085387/specification/trace/sdk.md#L1111-L1114
- https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html
- https://github.com/elastic/elasticsearch-py/blob/dc2f5f0662bc47e9c3072eddcbeb8564d7a766da/elasticsearch/helpers/actions.py#L421-L424
| if (frame_id := self._frame_to_id.get(frame_key)) is not None: | ||
| return frame_id | ||
|
|
||
| frame_id = len(self._frames) + 1 |
| if (string_id := self._string_to_id.get(value)) is not None: | ||
| return string_id | ||
|
|
||
| string_id = len(self._strings) + 1 |
| if location is None: | ||
| return DEFAULT_LOCATION | ||
| if isinstance(location, int): | ||
| return (location, location, -1, -1) |
There was a problem hiding this comment.
This is now handled per collector:
- https://github.com/python/cpython/blob/main/Lib/profiling/sampling/live_collector/collector.py#L319
- https://github.com/python/cpython/blob/main/Lib/profiling/sampling/pstats_collector.py#L31
- https://github.com/python/cpython/blob/main/Lib/profiling/sampling/gecko_collector.py#L470-L478
- https://github.com/python/cpython/blob/main/Lib/profiling/sampling/gecko_collector.py#L274-L279
I don't know if it's necessary but I simply didn't want to make https://github.com/python/cpython/pull/146257/changes#diff-58ccdb8421c89943862c73d1cbeae3e961873b55ed2adb7efc875dafd549c01bR177 too defensive.
| location | ||
| ) | ||
|
|
||
| fields = {"line": lineno} |
There was a problem hiding this comment.
Should it be -1 or 0 for synthetic?
To quote the Markdown:
lineno = -1: Synthetic frame (no source location)
On the other hand, DEFAULT_LINE settled on 0:
https://github.com/python/cpython/blob/main/Lib/profiling/sampling/constants.py#L24
This is the reason for adding the synthetic here:
| def _create_collector(format_type, sample_interval_usec, skip_idle, opcodes=False, | ||
| output_file=None, compression='auto', diff_baseline=None): | ||
| mode=None, output_file=None, compression='auto', diff_baseline=None): |
There was a problem hiding this comment.
This is already very complex. The collector constructor signature supports all collectors at once.
I've added mode for the purpose of meta but I don't think this scales for other meta.
(Truth to be told, I think that that complex signatures are also the underlying reason for the issue fixed by #145459)
| } | ||
|
|
||
|
|
||
| class JsonlCollector(StackTraceCollector): |
There was a problem hiding this comment.
Maybe the collectors should be separate from renderers?
| "v": 1, | ||
| "run_id": self.run_id, | ||
| "kind": "frame", | ||
| "scope": "final", |
There was a problem hiding this comment.
The very big thing here is ensuring that the format is future-proof. That's the reason for v.
For example, in the future: "window" for streaming, including timestamps.
|
@ivonastojanovic @pablogsal Done; ready for review. Thank you. |
This PR adds
--jsonldiscussed in #146256.The aim is to introduce a subset of JSONL format that will be also used in streaming. I made some decisions but highlighted possible questions in the above PR.
The class is below 2**8 lines of code and does not touch existing
profiling.samplingcode, so I took a leap.Usage
macOS:
sudo -E \ uv run \ --python /Users/maurycy/src/github.com/maurycy/cpython/python.exe \ python \ -m profiling.sampling \ run \ --jsonl \ -o /tmp/profile.jsonl /tmp/hello_world.pyWhere
/tmp/hello_world.pycould be:Visual Studio Code Extension
For the purpose of demonstrating the
--jsonlusefulness, I have vibe-coded (with Claude Code) a simple VSCode Extension (only that) that displays a JSONL profile in the editor:I think that, once we have
--streamit will be much more exciting.Apart from headless profilers: updating the real-time hot spots from the production in VSCode, or, well, making agents' life easier.
You can fetch the vibe-coded VSCode Extension here (no guarantees):
Or:
Please do not forget about removing
~/.vscode/extensions/profiling-heatmap/after tests.--jsonlflag to theprofiling.sampling#146256