banner



How To Draw Regression Line In Stata

Evaluating Regression Lines Lab

Introduction

In the previous lesson, we learned to evaluate how well a regression line estimated our actual data. In this lab, we will turn these formulas into code. In doing so, we'll build lots of useful functions for both calculating and displaying our errors for a given regression line and dataset.

In moving through this lab, we'll access to the functions that nosotros previously built out to plot our data, bachelor in the graph here.

Determining Quality

In the file, movie_data.py y'all will find movie information written every bit a python listing of dictionaries, with each dictionary representing a movie. The movies are derived from the first 30 entries from the dataset containing 538 movies provided here.

                from                movie_data                import                movies                len(movies)

Press shift + enter

                {'budget': 13000000, 'domgross': 25682380.0, 'title': '21 & Over'}                              
                movies[0]['upkeep']/                1000000              

The numbers are in millions, so nosotros volition simplify things by dividing everything past a million

                scaled_movies                =                listing(map(lambda                movie: {'championship':                movie['title'],                'budget':                round(picture show['budget']/                1000000,                0),                'domgross':                round(movie['domgross']/                one thousand thousand,                0)},                movies))                scaled_movies[0]
                {'title': '21 & Over', 'upkeep': 13.0, 'domgross': 26.0}                              

Annotation that, like in previous lessons, the budget is our explanatory value and the revenue is our dependent variable. Here acquirement is represented as the key domgross.

Plotting our data

Let'south write the code to plot this information prepare.

As a starting time task, catechumen the budget values of our scaled_movies to x_values, and catechumen the domgross values of the scaled_movies to y_values.

                x_values                =                None                y_values                =                None              
                x_values                and                x_values[0]                # 13.0              
                y_values                and                y_values[0]                # 26.0              

Assign a variable called titles equal to the titles of the movies.

Groovy! Now nosotros have the data necessary to brand a trace of our information.

                from                plotly.offline                import                iplot,                init_notebook_mode                init_notebook_mode(connected                =                True)                from                graph                import                trace_values,                plot                movies_trace                =                trace_values(x_values,                y_values,                text                =                titles,                name                =                'movie data')                plot([movies_trace])

Plotting a regression line

Now let's add together a regression line to brand a prediction of output (revenue) based on an input (the budget). We'll apply the following regression formula:

  • $\lid{y} = m x + b$, with $grand = 1.7$, and $b = ten$.

  • $\hat{y} = 1.7x + 10$

Write a function chosen regression_formula that calculates our $\hat{y}$ for whatever provided value of $x$.

                def                regression_formula(x):                pass              

Check to see that the regression formula generates the correct outputs.

                regression_formula(100)                # 180.0                regression_formula(250)                # 435.0              

Let'south plot the data as well every bit the regression line to get a sense of what we are looking at.

                from                plotly.offline                import                iplot,                init_notebook_mode                init_notebook_mode(continued                =                Truthful)                from                graph                import                trace_values,                m_b_trace,                plot                if                x_values                and                y_values:                movies_trace                =                trace_values(x_values,                y_values,                text                =                titles,                proper name                =                'movie information')                regression_trace                =                m_b_trace(1.vii,                ten,                x_values,                name                =                'estimated revenue')                plot([movies_trace,                regression_trace])

Calculating errors of a regression Line

At present that we have our regression formula, we can movement towards calculating the error. We provide a function called y_actual that given a data gear up of x_values and y_values, finds the actual y value, provided a value of ten.

                def                y_actual(x,                x_values,                y_values):                combined_values                =                list(zip(x_values,                y_values))                point_at_x                =                list(filter(lambda                point:                point[0]                ==                x,combined_values))[0]                return                point_at_x[1]
                x_values                and                y_values                and                y_actual(13,                x_values,                y_values)                # 26.0              

Write a part chosen fault, that given a list of x_values, and a list of y_values, the values m and b of a regression line, and a value of ten, returns the mistake at that x value. Remember ${\varepsilon_i} = y_i - \hat{y}_i$.

                def                error(x_values,                y_values,                yard,                b,                x):                laissez passer              
                error(x_values,                y_values,                1.7,                10,                thirteen)                # -6.099999999999994              

Now that we take a formula to calculate our errors, write a role called error_line_trace that returns a trace of an error at a given betoken. So for a given movie budget, it will display the deviation betwixt the regression line and the actual flick revenue.

Ok, and then the function error_line_trace takes our dataset of x_values as the first argument and y_values as the second argument. It also takes in values of $m$ and $b$ equally the side by side ii arguments to correspond the regression line we volition calculate errors from. Finally, the last argument is the value $x$ it is drawing an error for.

The return value is a dictionary that represents a trace, and looks like the post-obit:

{'marker': {'color':                'red'},                'mode':                'line',                'name':                'error at 120',                'x': [120,                120],                'y': [93.0,                214.0]}

The trace represents the fault line higher up. The information in x and y stand for the starting point and ending betoken of the fault line. Note that the x value is the same for the starting and ending point, just as it is for each vertical line. It's just the y values that differ - representing the actual value and the expected value. The fashion of the trace equals 'lines'.

                def                error_line_trace(x_values,                y_values,                thousand,                b,                x):                laissez passer              
                error_at_120m                =                error_line_trace(x_values,                y_values,                1.7,                10,                120)                # {'mark': {'colour': 'red'},                #  'mode': 'line',                #  'name': 'fault at 120',                #  'x': [120, 120],                #  'y': [93.0, 214.0]}                error_at_120m              

We just ran the our function to depict a trace of the error for the flick Elysium. Let's encounter how it looks.

                from                plotly.offline                import                iplot,                init_notebook_mode                init_notebook_mode(connected                =                Truthful)                from                graph                import                trace_values,                m_b_trace,                plot                if                x_values                and                y_values:                movies_trace                =                trace_values(x_values,                y_values,                text                =                titles,                name                =                'flick data')                regression_trace                =                m_b_trace(1.seven,                10,                x_values,                name                =                'estimated revenue')                plot([movies_trace,                regression_trace,                error_at_120m])

From there, we tin write a function called error_line_traces, that takes in a listing of x_values as an statement, y_values equally an argument, and returns a list of traces for every ten value provided.

                def                error_line_traces(x_values,                y_values,                k,                b):                pass              
                errors_for_regression                =                error_line_traces(x_values,                y_values,                i.7,                10)
                errors_for_regression                and                len(errors_for_regression)                # 30              
                errors_for_regression                and                errors_for_regression[-                1]                # {'ten': [200.0, 200.0],                #  'y': [409.0, 350.0],                #  'manner': 'lines',                #  'marker': {'color': 'ruby-red'},                #  'name': 'fault at 200.0'}              
                from                plotly.offline                import                iplot,                init_notebook_mode                init_notebook_mode(connected                =                True)                from                graph                import                trace_values,                m_b_trace,                plot                if                x_values                and                y_values:                movies_trace                =                trace_values(x_values,                y_values,                text                =                titles,                name                =                'movie data')                regression_trace                =                m_b_trace(1.7,                10,                x_values,                proper noun                =                'estimated revenue')                plot([movies_trace,                regression_trace,                *                errors_for_regression])

Don't worry about some of the points that don't accept associated error lines. It is a complexity with Plotly and not our functions.

Calculating RSS

Now write a function chosen squared_error, that given a value of ten, returns the squared mistake at that ten value.

${\varepsilon_i}^ii = (y_i - \hat{y}_i)^2$

                def                squared_error(x_values,                y_values,                k,                b,                x):                pass              
                x_values                and                y_values                and                squared_error(x_values,                y_values,                1.7,                10,                x_values[0])                # 37.20999999999993              

Now write a function that will iterate through the x and y values to create a listing of squared errors at each point, $(x_i, y_i)$ of the dataset.

                def                squared_errors(x_values,                y_values,                m,                b):                pass              
                x_values                and                y_values                and                squared_errors(x_values,                y_values,                i.7,                ten)

Side by side, write a function called residual_sum_squares that, provided a list of x_values, y_values, and the m and b values of a regression line, returns the sum of the squared error for the movies in our dataset.

                def                residual_sum_squares(x_values,                y_values,                m,                b):                pass              
                residual_sum_squares(x_values,                y_values,                i.vii,                10)                # 327612.2800000001              

Finally, write a function called root_mean_squared_error that calculates the RMSE for the movies in the dataset, provided the same parameters as RSS. Remember that root_mean_squared_error is a way for us to measure the estimate error per data point.

                import                math                def                root_mean_squared_error(x_values,                y_values,                thou,                b):                return                (math.sqrt(residual_sum_squares(x_values,                y_values,                m,                b)))/                len(x_values)
                root_mean_squared_error(x_values,                y_values,                ane.7,                10)                # 19.07914160659343              

Some functions for your understanding

Now we'll provide a couple functions for you. Note that we can represent multiple regression lines by a list of m and b values:

                regression_lines                =                [(1.vii,                10), (1.9,                20)]

Then we can return a list of the regression lines along with the associated RMSE.

                def                root_mean_squared_errors(x_values,                y_values,                regression_lines):                errors                =                []                for                regression_line                in                regression_lines:                error                =                root_mean_squared_error(x_values,                y_values,                regression_line[0],                regression_line[1])                errors.append([regression_line[0],                regression_line[one],                round(error,                0)])                render                errors              

Now let'due south generate the RMSE values for each of these lines.

                x_values                and                y_values                and                root_mean_squared_errors(x_values,                y_values,                regression_lines)

Now nosotros'll provide a couple functions for you:

  • a function called trace_rmse, that builds a bar chart displaying the value of the RMSE. The render value is a dictionary with keys of x and y, both which point to lists. The $ten$ cardinal points to a listing with one element, a string containing each regression line'due south m and b value. The $y$ central points to a list of the RMSE values for each respective regression line.
                import                plotly.graph_objs                as                get                def                trace_rmse(x_values,                y_values,                regression_lines):                errors                =                root_mean_squared_errors(x_values,                y_values,                regression_lines)                x_values_bar                =                list(map(lambda                error:                'm: '                +                str(fault[0])                +                ' b: '                +                str(error[one]),                errors))                y_values_bar                =                listing(map(lambda                mistake:                error[-                1],                errors))                return                dict(                x                =                x_values_bar,                y                =                y_values_bar,                type                =                'bar'                )                x_values                and                y_values                and                trace_rmse(x_values,                y_values,                regression_lines)

Once this is built, nosotros tin can create a subplot showing the two regression lines, as well as the related RMSE for each line.

                import                plotly                from                plotly.offline                import                iplot                from                plotly                import                tools                import                plotly.graph_objs                as                go                def                regression_and_rss(scatter_trace,                regression_traces,                rss_calc_trace):                fig                =                tools.make_subplots(rows                =                1,                cols                =                2)                for                reg_trace                in                regression_traces:                fig.append_trace(reg_trace,                1,                i)                fig.append_trace(scatter_trace,                one,                one)                fig.append_trace(rss_calc_trace,                1,                2)                iplot(fig)
                ### add together more regression lines here, past adding new elements to the list                regression_lines                =                [(1.7,                x), (1,                50)]                if                x_values                and                y_values:                regression_traces                =                list(map(lambda                line:                m_b_trace(line[0],                line[1],                x_values,                proper name                =                'thousand:'                +                str(line[0])                +                'b: '                +                str(line[ane])),                regression_lines))                scatter_trace                =                trace_values(x_values,                y_values,                text                =                titles,                name                =                'film data')                rmse_calc_trace                =                trace_rmse(x_values,                y_values,                regression_lines)                regression_and_rss(scatter_trace,                regression_traces,                rmse_calc_trace)

Equally we tin see above, the second line (m: ane.0, b: 50) has the lower RMSE. We thus can conclude that the 2d line "fits" our set of movie data better than the outset line. Ultimately, our goal will be to choose the regression line with the lowest RSME or RSS. Nosotros will learn how to accomplish this goal in the following lessons and labs.

Source: https://githubhelp.com/BGScoones/evaluating-regression-lines-lab-data-science-intro-000

Posted by: cashhincir.blogspot.com

0 Response to "How To Draw Regression Line In Stata"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel