Skip to content

Latest commit

 

History

History
414 lines (321 loc) · 14.8 KB

File metadata and controls

414 lines (321 loc) · 14.8 KB
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script>

ABSTRACT

Call graph construction is the foundation of inter-procedural static analysis. However, constructing precise call graphs for Python programs while maintaining high efficiency remains a significant challenge. For instance, the state-of-the-art approach, PyCG, fails to scale to large programs. In our preliminary experiments, it ran out of memory or exceeded the time limit for programs exceeding 2,000 lines of code. This limitation stems from the costly global fixed-point iterations required during analysis. In addition, PyCG is flow-insensitive and does not fully support Python’s dynamic features, which further limits its accuracy.

To overcome these drawbacks, we propose a scalable and precise approach for constructing application-centered call graphs for Python programs, and implement it as a prototype Jarvis. Jarvis maintains a type graph (i.e., type relations of program identifiers) for each function in a program to allow type inference.x Taking one function as an input, Jarvis generates the call graph on-the-fly, where flowsensitive intraprocedural analysis and interprocedural analysis are conducted in turn and strong updates are conducted. Unlike traditional whole-program analyses (eg., PyCG) that rely on costly global fixed-point iterations, Jarvis constructs call graphs on-the-fly using function-scoped type graphs. By propagating type and call information in a single pass and reusing function-level type relations, Jarvis achieves precise, flow-sensitive call graph construction without repeated whole-program iterations. Our evaluation on a micro-benchmark of 135 small Python programs and a macro-benchmark of 6 real-world Python applications has demonstrated that Jarvis can significantly improve PyCG by at least 67% faster in time, 84% higher in precision, and at least 20% higher in recall

The paper has been submitted to TOSEM. The Jarvis artifact is provided here.

Transfer rules

$$\begin{align*} &{Import:}fromm~^\primeimportx, importm^\prime \\ &\frac{\begin{matrix} t_1=new_type(m,x), t_2=new_type(m^\prime,x), t_3=t_m, t_4=new_type(m^\prime) \end{matrix} }{ \Delta_{e} \leftarrow \langle t_1, t_2, e\rangle, \Delta_{e} \leftarrow \langle t_3, t_4, e\rangle}\\ &{Assign:}x=y \\ &\frac{t_1=new_type(x), t_2=new_type(y)} { \Delta_{e} \leftarrow \langle t_1, t_2, e\rangle} \\ &{Store:}x.field=y\\ &\frac{t_i \in points(x), t_2 = new_type(y)}{\Delta_{e} \leftarrow \langle t_i.field, t_2, e\rangle}\\ &{Load:}y=x.field\\ &\frac{t_1 \in new_type(y), t_j \in points(x)}{\Delta_{e} \leftarrow \langle t_1, t_j.field,e \rangle} \\ &{New:}y=x(...) \\ &\frac{t_1=new_type(y), t_2=new_type(x)}{ \begin{matrix} \Delta_{e}\leftarrow{inter_analysis}(f,e,FTG^f_{e.p}), \Delta_{e}\leftarrow\langlet_1,t_2,e\rangle\\ \end{matrix} } \\ &{Call:}a=x.m(...) \\ &\frac{ \begin{matrix} t_1=new_type(x), t_2=new_type(t_1.m), t_3=new_type(a) \end{matrix} } {\begin{matrix} \Delta_{call}\leftarrow{inter_analysis}(f,e,FTG^f_{e.p}), \Delta_{call}\leftarrow\langlet_3,t_2.\textit{ret},e\rangle \end{matrix} }\\ &{Func:}defm^\prime(args...) ...\\ &\frac{d=new_type(m^\prime),t_{1...n}=new_type(args_{1...n})}{\Delta_{e} \leftarrow \langle d, \varnothing, e \rangle,F \leftarrow \langle d,args_{1...n} \rangle}\\ &{Class:}classcls(base...) ...\\ &\frac{d=new_type(cls),base_{1...n}=new_type(base_{1...n})}{\Delta_{e} \leftarrow \langle d, \varnothing, e \rangle,C \leftarrow \langle d,base_{1...n} \rangle}\\ &{Return:}defm^\prime ...returnx\\ &\frac{t_1=new_type(m^\prime), t_2=new_type(x)}{\Delta_{e} \leftarrow \langle t_1.\textit{ret}, t_2, e \rangle}\\ &{With:}withcls()asf\\ &\frac{ \begin{matrix} t_1=new_type(cls), t_2=new_type(cls.__enter__), t_3=new_type(f) \end{matrix} } {\begin{matrix} \Delta_{call}\leftarrow{inter_analysis}(f,e,FTG^f_{e.p}), \Delta_{call}\leftarrow\langlet_3,t_2.\textit{ret},e\rangle \end{matrix} }\\ &{For:}forxincls()\\ &\frac{ \begin{matrix} t_1=new_type(cls), t_2=new_type(cls.__iter__), t_3=new_type(f) \end{matrix} } {\begin{matrix} \Delta_{call}\leftarrow{inter_analysis}(f,e,FTG^f_{e.p}), \Delta_{call}\leftarrow~\langle~t_3,~t_2.\textit{ret},~e\rangle \end{matrix} }\\ &{If:}~if ...\\ & \frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {if}, {\varnothing}\rangle \end{matrix}} \\ &{If-Else:}~if ...else ...\\ & \frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {<if,else>}, {\varnothing}\rangle \end{matrix}} \\ &{Elif:}elif...\\ & \frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {else}, {\varnothing}\rangle, \langle {Expr}, {Ctrl}, {if}, {\varnothing}\rangle \end{matrix}} \\ &{While:}~while ...\\ & \frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {while}, {\varnothing}\rangle \end{matrix}} \\ &{While-Else:}~while ...else ...\\ & \frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {<while,else>}, {\varnothing}\rangle \end{matrix}} \\ &{Exception:}~try ...catch ...\\ &\frac{}{\begin{matrix} CFG \leftarrow \langle {Expr}, {Ctrl}, {<try,catch>}, {\varnothing}\rangle \end{matrix}} \\ \end{align*}$$

Dataset and Ground truth

The micro-benchmark and macro-benchmark are provided in dataset and grount_truth directory.

Getting Jarvis to run

Prerequisites:

  • Python = 3.8
  • PyCG: tool/PyCG
  • Jarvis: tool/Jarvis

run jarvis_cli.py.

Jarvis usage:

$ python3 tool/Jarvis/jarvis_cli.py [module_path1 module_path2 module_path3...] [--package] [--decy] [-o output_path]

Jarvis help:

$ python3 tool/Jarvis/jarvis_cli.py -h
  usage: jarvis_cli.py [-h] [--package PACKAGE] [--decy] [--precision]
                       [--moduleEntry [MODULEENTRY ...]]
                       [--operation {call-graph,key-error}] [-o OUTPUT]
                       [module ...]

  positional arguments:
    module                modules to be processed, which are also application entries in A.W. mode 

  options:
    -h, --help            show this help message and exit
    --package PACKAGE     Package containing the code to be analyzed
    --decy                whether analyze the dependencies
    --precision           whether flow-sensitive
    --entry-point [MODULEENTRY ...]
                          Entry functions to be processed
    -o OUTPUT, --output OUTPUT
                          Output call graph path

Example 1: analyze bpytop.py in E.A. mode.

$ python3 tool/Jarvis/jarvis_cli.py dataset/macro_benchmark/pj/bpytop/bpytop.py --package dataset/macro_benchmark/pj/bpytop -o jarvis.json

Example 2: analyze bpytop.py in A.W. mode. Note we should prepare all the dependencies in the virtual environment.

# create virtualenv environment
$ virtualenv venv python=python3.8
# install Dependencies in virtualenv environment
$ python3 -m pip install psutil
# run jarvis
$ python3 tool/Jarvis/jarvis_cli.py dataset/macro_benchmark/pj/bpytop/bpytop.py --package dataset/macro_benchmark/pj/bpytop --decy -o jarvis.json

Evaluation

RQ1 and RQ2 Setup

cd to the root directory of the unzipped files.

# 1. run micro_benchmark
$ ./reproducing_RQ12_setup/micro_benchmark/test_All.sh
# 2. run macro_benchmark
$ ./reproducing_RQ12_setup/macro_benchmark/pycg_EA.sh
#     PyCG iterates once
$ ./reproducing_RQ12_setup/macro_benchmark/pycg_EW.sh 1
#     PyCG iterates twice
$ ./reproducing_RQ12_setup/macro_benchmark/pycg_EW.sh 2
#     PyCG iterates to convergence 
$ ./reproducing_RQ12_setup/macro_benchmark/pycg_EW.sh
$ ./reproducing_RQ12_setup/macro_benchmark/jarvis_AA.sh
$ ./reproducing_RQ12_setup/macro_benchmark/jarvis_EA.sh
$ ./reproducing_RQ12_setup/macro_benchmark/jarvis_AW.sh

RQ1. Scalability Evaluation

Scalability results

Run

$ python3 ./reproducing_RQ1/gen_table.py

The results are shown below:

scalability

AGs and FTGs

Run

$ pip3 install matplotlib
$ pip3 install numpy
$ python3 ./reproducing_RQ1/FTG/plot.py

The generated graphs are pycg-ag.pdf, pycg-change-ag.pdf and jarvis-ftg.pdf, where they represents Fig. 9a, Fig. 9b and Fig 10, correspondingly.

RQ2. Accuracy Evaluation

Accuracy results

Run

$ python3 ./reproducing_RQ2/gen_table.py     

The generated results:

accuracy

Comparing to Code2Flow

Scalability results (RQ1), AE denotes AssertionError:

scalability

Accuracy results (RQ2):

accuracy

Case Study: Fine-grained Tracking of Vulnerable Dependencies

The 43 python projects out of the top 200 Highly-starred projects are listed in file

1. Target projects

Fastapi, Httpie, Scrapy, Lightning, Airflow,sherlock,wagtail

2. Vulnerable libraries in Top 10 dependencies

The CVEs of html , numpy , lxml,psutil don't relate to Python , we don't care them.

3. Vulnerable projects using dependency analysis

sherlock
- sherlock.sherlock
  - requests(v2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- sherlock.sites
  - requests(v.2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
airflow
- airflow.kubernetes.kube_client
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- airflow.providers.cncf.kubernetes.operators.pod
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- airflow.providers.cncf.kubernetes.utils.pot_manager
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- airflow.executors.kubernetes_executor
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
......
wagtail
- wagtail.contrib.frontent_cache.backends
  - requests(v2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
Httpie
- httpie.client
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- httpie.ssl_
  - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- httpie.models
  - urllib3(1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
Scrapy
- scrapy.downloadermiddlewares.cookies
  - tldextract(v3.4.4)
    - requests(v2.28.0)
      - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
Lightning
- lightning.app.utilities.network
  - requests(v2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- lightning.app.utilities.network
  - requests(v2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
- lightning.app.utilities.network
  - requests(v2.28.0)
    - urllib3(v1.26.0) ---- [CVE-2021-33503,CVE-2019-11324,CVE-2019-11236,CVE-2020-7212]
...

4. Vulnerable projects using method-level invocation analysis

Fastapi

According to the patch commit, the vulnerable method of CVE-2021-33503 in urllib3 is urllib3.util.url.

Below is the method-level invocation path:

Httpie
- httpie.apapters.<main>
  - requests.adapters.<main>
    - urllib3.contrib.socks.<main>
      - Urllib3.util.url.<main> ---- CVE-2021-33503
Scrapy
- scrapy.downloadermiddlewares.cookies.<main>
  - tldextract.__init__.<main>
    - tldextract.tldextract.<main>
      - tldextract.suffix_list.<main>
        - requests_file.<main>
          - requests.adapters.<main>
            - Urllib3.util.url.<main> ---- CVE-2021-33503
Lighting
- lightning.app.utilities.network.<main>
  - requests.adapters.<main>
    - urllib3.contrib.socks.<main>
      - Urllib3.util.url.<main> ---- CVE-2021-33503
Airflow
- airflow.providers.amazon.aws.hooks.base_aws.BaseSessionFactory._get_idp_response
  - requests.adapters.<main>
    - urllib3.contrib.sock.<main>
      - urllib3.util.url.<main> ---- CVE-2021-33503

PS:

represents body code block of python file.(Because python doesn't need entry function)

Acknowledgements

Our artifact has reused part of the functionalities from third party libraries. i.e., PyCG.

Vitalis Salis et al. PyCG: Practical Call Graph Generation in Python. In 43rd International Conference on Software Engineering (ICSE), 25–28 May 2021.