This project utilizes Python, Pandas, and Gitpod for data analysis of job data available in CSV format.
Pandas is a powerful data manipulation and analysis library for Python. It provides data structures and functions to efficiently handle and analyze structured data.
With Gitpod, you can easily set up the project environment and start using these libraries for your data analysis.
Before getting started, make sure you have the following prerequisites:
- Git: Git is a version control system that allows you to track changes in your code and collaborate with others. If you don't have Git installed, you can download and install it from the official website: Git Downloads.
- Gitpod account: Gitpod is an online integrated development environment (IDE) that makes it easy to set up and work on projects without the need to install any software locally on your computer. With Gitpod, you can quickly start coding in a fully-featured development environment right from your web browser. To get started, you'll need to sign up for a free Gitpod account on the Gitpod website: Gitpod Sign Up
Once you have Git and a Gitpod account set up, you'll be ready to fork and work on projects using Gitpod's online IDE.
To start using this project in Gitpod, follow these steps:
-
Fork the repository by clicking the "Fork" button in the top right corner of the repository page. This creates a copy of the repository in your GitHub account.
-
Copy the URL below, replace
YOUR_USERNAME
with your GitHub username and open the following URL in your browser to open this repository in Gitpod:
https://gitpod.io/#https://github.com/YOUR_USERNAME/hackathon-2023.data-analysis.base-example -
Gitpod will open the project in a new workspace, providing you with an integrated development environment (IDE) for writing and running code.
To use the Pandas and Matplotlib libraries for data analysis, follow these steps:
-
When you open the Gitpod workspace for the first time, you will see that the terminal is "doing stuff": that's expected, and it's the execution of a command that's installing the necessary python resources.
-
In the Gitpod workspace terminal, write the command below and press enter to execute the
import_job_posting_data.py
script:python import_job_posting_data.py
-
Open the
data_analysis.py
file in the IDE interface provided by Gitpod. You can find it in the project files: "Explorer" section (top right of the IDE). This file contains the data analysis steps.
This command will call a URL, import the job posting data in your Gitpod environment as CSV file and store it in a newly created assets
folder.
-
In the Gitpod terminal, execute the
data_analysis.py
script using the following command:python data_analysis.py
-
The script will generate a bar plot based on the provided CSV data and save it as
test_bar_plot.png
. A "plot" is a visual representation of data using a specific chart or graph type. -
You can view the generated plot by clicking on the "Preview" button in the top right corner of the Gitpod IDE. The plot will be displayed in the Gitpod preview.
-
Feel free to explore and modify the code to perform additional data analysis tasks using Pandas and Matplotlib.
The .gitpod.yml
file is used to configure your Gitpod development environment and automate certain tasks.
This line specifies the base image for your Gitpod workspace. It means that your development environment will be based on the Python 3.9 version.
This section allows you to define tasks that will be executed when your Gitpod workspace is created. In this case, there is only one task:
command: "pip install pandas matplotlib"
: This task runs a command to install the required packagespandas
andmatplotlib
usingpip
, which is the package manager for Python. These packages are commonly used for data analysis and plotting in Python.
This section is used to configure the Visual Studio Code (VS Code) editor in your Gitpod workspace. It allows you to specify extensions that will be installed and activated when you open your workspace in Gitpod. In this case:
extensions: - ms-python.python
: This line specifies the extensionms-python.python
, which provides Python language support in VS Code.
This section is used to specify which ports should be exposed in your Gitpod workspace, allowing you to access services running on those ports. In this case:
port: 8000
: This line specifies that port8000
should be exposed. This means that any program running in your Gitpod workspace that listens on port 8000 can receive network traffic from outside the workspace.onOpen: open-preview
: This line indicates that a preview should be automatically opened in Gitpod when port8000
is opened.
Overall, this .gitpod.yml
file sets up a Python 3.9 environment, installs the required packages (pandas
and matplotlib
), configures VS Code with the Python extension, and exposes port 8000
.
The data_analysis.py
file performs some basic data analysis and visualization tasks on the CSV file assets/sourcestack-data.csv
located in this repository.
-
Importing libraries:
pandas
is imported aspd
to handle data manipulation and analysis.matplotlib.pyplot
is imported asplt
to create visualizations.
-
Reading the CSV file:
- The code reads a CSV file named
'assets/sourcestack-data.csv'
using thepd.read_csv()
function.
- The code reads a CSV file named
-
Creating a new figure:
- A new figure with a size of 15x10 inches is created using
plt.figure(figsize=(15, 10))
.
- A new figure with a size of 15x10 inches is created using
-
Generating a bar plot:
- The code counts the occurrences of each value in the 'hours' column of the DataFrame (
df['hours']
) usingvalue_counts()
. - The result is plotted as a bar plot using
df['hours'].value_counts().plot(kind='bar')
.
- The code counts the occurrences of each value in the 'hours' column of the DataFrame (
-
Rotating x-axis labels:
- The x-axis labels of the bar plot are rotated by 90 degrees for better readability using
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
.
- The x-axis labels of the bar plot are rotated by 90 degrees for better readability using
-
Adding margins to the plot:
- A margin of 0.1 is added to the plot to prevent the labels from being cut off using
plt.margins(0.1)
.
- A margin of 0.1 is added to the plot to prevent the labels from being cut off using
-
Saving the plot as a PNG image:
- The plot is saved as a PNG image file named
'test_bar_plot.png'
usingplt.savefig(image_path)
.
- The plot is saved as a PNG image file named
-
Script completion message:
- After all the tasks are completed, a message is printed to indicate that the script execution is completed.
This script essentially reads a CSV file, creates a bar plot of the values in the 'hours' column, and saves the plot as a PNG image file named 'test_bar_plot.png'
. Throughout the script, print statements are used to provide information about the progress of each step.
Contributions to this project are welcome! If you find a bug, have a feature request, or want to contribute code, please follow these guidelines:
- Fork the repository and clone it to your local machine.
- Create a new branch for your changes:
git checkout -b my-branch-name
- Make your changes, and test them thoroughly.
- Commit your changes with a descriptive commit message.
- Push your changes to your forked repository:
git push origin my-branch-name
- Open a pull request against the main branch of this repository, explaining the changes you've made and why they are important.
Thank you for your contributions!
The job data has been retrieved using SourceStack.
This project is licensed under the MIT License - see the LICENSE file for details.